Skip to main content

Cost Controls — Caching, Rate Limiting & Spend Limits

Task

Enable caching, rate limiting, and spend limits to control AI inference costs at the gateway level.

Why

AI inference costs can spike unpredictably — a single runaway application, a burst of traffic, or one team's heavy usage can burn through budget. AI Gateway enforces cost guardrails at the infrastructure level, with no application code changes needed.

What You Are Configuring

  • Caching — serve identical prompt responses from cache at $0 cost
  • Rate limiting — cap the number of requests in a time window
  • Spend limits — set dollar-amount budgets scoped by model, provider, or user

Step 1: Enable Caching

  1. Navigate to AI > AI Gateway > bootcamp-lab > Settings
  2. Under Cache Responses, toggle to ON
  3. Set the TTL (time to live) to 300 seconds (5 minutes)

Cache settings toggle and TTL configuration

  1. Click Save
How caching works

AI Gateway caches responses using a SHA-256 hash of the provider + endpoint + model + auth header + full request body. If an identical request arrives within the TTL, the cached response is served instantly at $0 cost. Cache keys are exact-match — even a single character difference in the prompt produces a different key.


Step 2: Test Caching — Observe HIT vs MISS

untick "custom cost" checkbox

Make sure to unctick the custom cost checkbox before sending the request.

  1. In the AI Gateway Explorer app, set User ID to user-anna
  2. Set Model to Llama 3.2 3B (small/cheap)
  3. Type a prompt: "What is the capital of New Zealand?"
  4. Click Send — note the response in the panel
  5. Now click Send Same Again — this resends the exact same prompt

Expected Result

RequestCache StatusCostDuration
FirstMISS$0.000xLonger (inference runs)
SecondHIT$0.00Much faster (served from cache)

The Explorer app shows the cache status in the response panel. The second request is noticeably faster and costs nothing.

Cache HIT in Explorer app


Step 3: Verify Cache HIT in Logs

  1. Navigate to AI > AI Gateway > bootcamp-lab > Logs
  2. Find the two requests you just sent
  3. Compare them:
FieldFirst RequestSecond Request
StatusSuccessCache
Cost$0.000x$0.00
TokensToken counts shownSame token counts

The second request shows status Cache (with hard drive icon ), confirming it was served from the AI Gateway cache.

Cache HIT in Logs


Step 4: Disable Caching & Enable Rate Limiting

Before testing rate limiting, disable caching so each request counts against the rate limit, then configure the rate limit.

  1. Navigate to AI > AI Gateway > bootcamp-lab > Settings
  2. Under Cache Responses, toggle to OFF
Why disable caching first?

Cached responses are served directly from the gateway without counting against the rate limit. If caching is still enabled from the previous exercise, the "Send 10x Rapid" button will serve most requests from cache and the rate limit will never trigger. Disable caching to ensure each request reaches the model and counts toward the quota.

In production, caching and rate limiting work together — cached responses are fast and free, while rate limiting controls the volume of new requests that reach the model.

  1. Under Rate Limiting, configure:
FieldValue
Requests5
Time window60 seconds
TechniqueSliding window

Rate limiting settings configured

  1. Click Save
Fixed vs. sliding window
  • Fixed window: resets every N seconds on a clock boundary. You could send 5 requests at 12:00:59 and 5 more at 12:01:01.
  • Sliding window: evaluates the last N seconds from each request. More consistent enforcement — use this for the lab.

Step 5: Trigger the Rate Limit

  1. In the Explorer app, click Send 10x Rapid
  2. Watch the results panel

Expected Result

The first 5 requests succeed (status 200). Starting from request 6, you see 429 Too Many Requests.


Step 6: Verify Rate Limit in Logs

  1. Go to Logs
  2. Find the rate-limited requests — they show Error status
  1. Click to expand a rate-limited entry:
FieldWhat You Should See
StatusError
Error messageRate limited
Tokens0 (no inference was performed)
Cost$0.00 (no cost incurred)

Rate limited log entry

tip

Rate-limited requests consume zero tokens and incur zero cost. The request is rejected before it reaches the AI model.


Step 7: Create a Global Spend Limit

Spend limits track actual dollar cost — not request counts — and block requests when the budget is exhausted.

info

We will keep it simple and global for now across all the providers, models and irrespective of the user.

  1. Navigate to AI > AI Gateway > bootcamp-lab > Settings
  2. First, toggle OFF the rate limit so it does not interfere with spend limit testing
  3. Toggle ON Spend Limits, then click Add Rule
  4. Configure:
FieldValue
Budget$1
Time window1 hour
Window techniqueSliding window
DimensionsNone (global — applies to all requests)

Spend limit rule creation form

  1. Click Save

Step 8: Test the Global Spend Limit

Speed up budget testing with Custom Cost

Workers AI costs are fractions of a cent per request. To trigger spend limits quickly, enable the Custom Cost toggle in the Explorer app (Token In: $0.20, Token Out: $0.20). This overrides the per-token pricing so each request "costs" ~$50 in the gateway's accounting. See the Custom Cost Override section in the Dynamic Routing page for details on this feature and its real-world use cases.

Disable the toggle after testing.

  1. In the Explorer app, enable the Custom Cost toggle (Token In: $0.20, Token Out: $0.20)
  2. Send a prompt — the inflated cost will exceed the $0.01 budget in a single request
  3. Send a second prompt
  4. Observe: the second request returns 429

Spend limit blocked request

  1. Disable the Custom Cost toggle after testing

Expected Result

The first request succeeds (with inflated cost shown in logs). The second request is blocked with 429 because the spend limit has been exceeded.


Step 9: Verify Spend Limit in Logs

  1. Go to Logs and find the blocked request

Logs showing spend limit blocked request

  1. Switch to the Analytics tab to see the cumulative cost graph approaching the $0.01 limit

Step 10: Create a Per-User Spend Limit

The global limit blocks everyone when the total budget runs out. A per-user limit gives each user their own budget.

  1. Go to Spend Limits settings
  2. Edit the existing rule or Add a new rule:
FieldValue
Budget$1
Time window1 hour
Dimensionsmetadata.user_idSplit by key
info

Split by key means that each user would have their own spend counter.

Spend limit rule with per-user dimension

  1. Click Save

Step 11: Test Per-User Budget Isolation

  1. In the Explorer app, enable the Custom Cost toggle if not already enabled (Token In: $0.20, Token Out: $0.20)
  2. Set User ID to user-anna
  3. Send multiple prompts until Anna's budget is exhausted
  4. Now switch User ID to user-bob
  5. Send a prompt

Expected Result

  • user-anna → blocked (429) — her budget is exhausted
  • user-bob → succeeds — he has his own fresh budget/counter and user-anna requests don't contribute to his budget

Step 12: Verify Per-User Budget Isolation in Logs

  1. Go to Logs
  2. If the log filter supports metadata, filter by user_id:
    • user-anna requests: mix of Success and Error (blocked after budget)
    • user-bob requests: Success after Anna's budget is exhausted
The CTO question answered

This is the answer to: "How do I give each team a budget without one team burning through everyone else's allocation?" Per-user (or per-team, per-app) spend limits with the Split by key dimension. Each dimension value gets its own independent budget bucket.


Step 13: Review Everything in Logs (Module Recap)

You have now configured every AI Gateway feature across this module. Take a moment to review the full log history.

  1. Go to Logs and remove all filters to see everything
  2. Scan the Status column — you should see a mix of:
StatusCauseModule Page
SuccessNormal requestsAll pages
CacheServed from cacheThis page (Step 2)
ErrorRate limited (code 2003)This page (Step 5)
ErrorSpend limit exceeded (429)This page (Step 8–11)
ErrorGuardrail blocked (code 2016)Page 2
ErrorDLP blockedPage 2
  1. Click through several entries to see:
    • Guardrail blocks with safety category details
    • DLP blocks with matched profile information
    • Dynamic routing model switches (70B → 1B)
    • Cache HITs at $0 cost
    • Rate-limited requests at $0 cost
    • Per-user metadata on every request
  1. Switch to Analytics to see the full picture:
MetricWhat It Shows
RequestsTotal requests sent across the module
TokensToken usage by model
CostsTotal spend with cache savings visible
Cache hit ratePercentage of requests served from cache
ErrorsBreakdown of rate-limited, spend-limited, guardrail, and DLP blocks

Customer Talk Track

"How do I prevent a runaway AI bill? Three layers. Caching eliminates cost on repeat queries — same question, same answer, zero cost. Rate limiting caps throughput so a burst of traffic cannot blow through your budget in seconds. And spend limits set per-user or per-team dollar budgets that enforce automatically. All configured in the dashboard, no code changes, and every decision is logged."


Congratulations

You have completed the AI Gateway Operations module. Here is everything you built:

FeatureWhat It DoesHow You Configured It
Gateway + LogsFull observability of every AI callCreated gateway, reviewed logs with prompt/response/tokens/cost/metadata
GuardrailsBlock harmful content and prompt injectionLlama Guard 3: block violence, hate, injection; flag others
DLPBlock PII in promptsFinancial Information profile with block action on requests
Dynamic RoutingConditional model selection, budget fallbackRoute: admin → 70B (with budget fallback), default → 3B. A/B percentage routing available via API (beta).
CachingServe repeat prompts at $0Enabled with 300s TTL
Rate LimitingCap request throughput5 req / 60s sliding window
Spend LimitsPer-user dollar budgets$0.005/hr split by metadata.user_id

Architecture

Partner (Explorer App)

├── metadata: { user_id: "user-anna" }


AI Gateway (bootcamp-lab)
├── Guardrails ──── block harmful prompts + prompt injection (Llama Guard 3)
├── DLP ─────────── block PII in prompts (Financial Info profile)
├── Cache ────────── serve repeat prompts at $0
├── Rate Limit ──── cap request throughput
├── Spend Limit ─── per-user dollar budgets
└── Dynamic Route ─ smart-route
├── user-admin → Budget($0.01) → 70B → fallback 1B
└── default ───→ 1B model


Workers AI (inference)

Validation

  • Caching enabled — cache HIT observed on duplicate prompt with $0 cost
  • Cache status visible in logs (Success vs Cache)
  • Rate limiting configured — 429 triggered after exceeding 5 requests/minute
  • Rate-limited requests show error code 2003 in logs with $0 cost
  • Global spend limit ($0.01) — requests blocked when budget exhausted
  • Per-user spend limit ($0.005) — user-anna blocked while user-bob still succeeds
  • Per-user budget isolation visible in logs
  • Combined log view reviewed — all status types visible (Success, Cache, Error)
  • Analytics dashboard reviewed — requests, tokens, costs, cache hit rate
  • Can explain the three cost control layers (caching, rate limiting, spend limits) to a customer

Troubleshooting

Cache always shows MISS
  • The prompt text must be identical — even a trailing space creates a different cache key
  • Use the Send Same Again button to ensure exact same prompt
  • Verify caching is enabled in gateway settings
  • Check that the TTL has not expired (default 300s)
  • Different User IDs with the same prompt will still produce the same cache key (metadata is not part of the cache key)
Rate limit not triggering
  • The window may be too large — ensure it is set to 60 seconds
  • Cached responses may not count against the rate limit — send unique prompts
  • Verify you selected Sliding window technique
  • Try reducing the limit to 3 requests to make it easier to trigger
Spend limit not enforcing
  • Spend limit tracking is eventually consistent — a burst of concurrent requests may briefly exceed the limit
  • The $0.01 budget may not be reached yet — send more unique prompts (cached prompts cost $0)
  • Verify the rule was saved
  • Check that the time window has not already reset
Per-user spend limit not isolating users
  • Verify the dimension is set to metadata.user_id with Split by key
  • Ensure the User ID field in the Explorer app is set to different values for each user
  • Check the logs — both users should have distinct user_id metadata values