rate limits, quotas, and spend controls for llm and agentic apps
once a request is identified, transformed, and authorized, the next governance question is:
how much can this user (or org, or agent) consume?
limitabl answers that question.
it enforces per-user, per-tenant, per-model, and per-provider limits β ensuring predictable costs, fair usage, and controlled spend across your entire llm ecosystem.
limitabl is the usage, rate-limit, and spend-governance layer of gatewaystack.
it lets you:
π¦ implementation:
ai-rate-limit-gateway+ai-cost-gateway(roadmap)
as llm adoption increases, so do cost overruns, unexpected spikes, and unbounded agent behavior.
organizations need:
limitabl is the answer β granular control over usage and spend.
all gatewaystack modules operate on a shared RequestContext object.
limitabl operates in two phases:
phase 1 (pre-flight):
identity, modelRequest, policyDecision (optional)writing: limitsDecision β constraints and effects before execution (ok |
throttle |
deny |
fallback |
degrade) |
phase 2 (post-execution):
usage β actual tokens, cost, and latency for the call; updates quota/budget stores; emits usage eventslimitabl runs immediately before llm execution (pre-flight checks) and again after the provider responds (usage accounting).
in the pre-flight phase, limitabl evaluates:
if a limit is exceeded, limitabl returns a structured governance decision (deny, throttle, fallback, or degrade).
in the post-execution phase, limitabl records actual usage and updates internal counters and budgets.
1. checkRateLimit β enforce per-user and per-tenant rates
requests per second/minute/hour, sliding windows, token buckets.
2. checkQuota β validate remaining quota
daily, weekly, monthly usage ceilings.
3. checkBudget β enforce cost ceilings
budget ceilings per user, tenant, environment, or provider.
4. checkRiskCost β risk-based spend rules
for example, only certain scopes can call expensive reasoning models.
5. applyThrottling β delay or reject when limits are hit
adaptive or static throttling policies.
6. fallbackProvider β reroute to cheaper models
automatic degradation modes when quotas or budgets are reached.
7. emitUsageEvents β log usage for observability
produces structured records for explicablβs audit pipeline.
limitsDecision and usage fields in RequestContextidentifiabl)transformabl)validatabl)proxyabl)explicabl)limitablβs two-phase design ensures both preventive controls and accurate accounting:
phase 1: pre-flight checks (before routing)
proxyabl via limitsDecisionphase 2: usage accounting (after execution)
explicablusage in RequestContextthis ensures both preventive controls and accurate accounting.
limits are defined hierarchically:
limits:
global:
rate: 10000/min
budget: $5000/day
organizations:
org_healthcare:
rate: 1000/min
budget: $500/day
models:
gpt-4:
quota: 100000 tokens/day
budget: $200/day
users:
user_doctor_123:
rate: 100/min
budget: $50/day
precedence: user limits override org limits, which override global limits.
enforcement: the most restrictive limit applies.
limitabl prevents runaway agentic behavior:
agent_protection:
max_tool_calls_per_workflow: 20
max_recursion_depth: 5
max_workflow_cost: $2.00
max_workflow_duration: 120s
duplicate_tool_detection:
enabled: true
threshold: 3 # same tool with same params
example: an agent enters an infinite web_search loop. after 10 identical calls, limitabl terminates the workflow and returns an error.
in multi-instance deployments, limitabl uses shared state:
storage:
backend: "redis"
cluster:
- "redis://primary:6379"
- "redis://replica-1:6379"
consistency: "strong"
ttl: "3600s"
rate limits are enforced across all gatewaystack instances with strong consistency guarantees.
user
β identifiabl (who is calling?)
β transformabl (prepare, clean, classify, anonymize)
β validatabl (is this allowed?)
β limitabl (how much can they use? pre-flight constraints)
β proxyabl (where does it go? execute)
β llm provider (model call)
β [limitabl] (deduct actual usage, update quotas/budgets)
β explicabl (what happened?)
β response
limitabl enforces predictable usage, stable cost, and controlled access, preventing abuse, overspend, and runaway agents.
limitabl plugs into gatewaystack and your existing llm stack without requiring application-level changes. it exposes http middleware and sdk hooks for:
for limit configuration examples:
β rate limit patterns
β budget configuration guide
β agent loop protection
for implementation:
β integration guide
want to explore the full gatewaystack architecture?
β view the gatewaystack github repo
want to contact us for enterprise deployments?
β reducibl applied ai studio
every request flows from your app through gatewaystack's modules before it reaches an llm provider β identified, transformed, validated, constrained, routed, and audited.