Launching AIgateway: 150 AI Models Behind One Key
Every AI startup rebuilds the same plumbing. OpenAI here, Anthropic there, Gemini, Groq, Together, Replicate, each with its own auth, retries, streaming, rate limits. I paid that integration tax 30 times. So I built AIgateway: one API, one key, one bill, 150+ models, edge-native on Cloudflare. Free Kimi K2.6 until Apr 30.
Every AI startup I’ve built (and I’ve built twenty-seven) rebuilds the same plumbing.
OpenAI SDK here. Anthropic SDK there. Gemini’s own format. Groq, Together, Fireworks, Replicate, Cerebras, DeepSeek, Moonshot. Each with its own auth scheme, retry semantics, streaming protocol, rate-limit behavior, and pricing math.
I paid that integration tax thirty times. That’s thirty wheels, re-invented, all slightly out of round. Never again.
Today I’m launching AIgateway: one API, one key, one bill for every major AI provider. 150+ models across text, image, audio, video, vision, and embeddings. Edge-native on Cloudflare Workers. OpenAI-compatible at the wire level, so it drops into code you’ve already written.
And until April 30, Kimi K2.6 is free through AIgateway. Zero cost, zero commitment. Bench it against your current stack and see.
150+
Models, one key
~15ms
Gateway overhead
300+
Edge cities
Apr 30
Free Kimi K2.6
Why: the thesis behind aggregation
Models are commoditizing faster than any technology wave in history. GPT-4 class intelligence went from $60 per million tokens to $0.15 in eighteen months. A 400x collapse. Claude, Gemini, Llama, Qwen, DeepSeek, Kimi. Every new release pushes the frontier forward and the price curve down.
When the input to your product is a commodity, the aggregation layer wins. This is not a hunch. It’s the pattern of every prior infrastructure wave.
Prior waves
This wave
Every serious AI app I’ve shipped now sits on three to eight models. Reasoning on one. Cheap classification on another. Vision on a third. TTS on a fourth. Embeddings on a fifth. The model-agnostic app is the default, not the exception.
That means the gateway is not a nice-to-have. It is the control plane for the AI era.
The motivation: thirty times, same plumbing
I want to be specific about what “integration tax” actually means, because the word “just” does a lot of dishonest work when founders talk about AI integrations.
What “just integrate OpenAI” actually costs per project
Multiply that by five providers (because no serious app uses only one) and you’re a solo week into plumbing before you’ve written a line of product code. Now multiply by thirty projects. That’s my personal number. It is absurd.
The bitterest part: every version of this plumbing is slightly worse than the last one the builder wrote, because they’re doing it in a hurry, late at night, trying to ship a feature. This is how you end up with production AI apps that silently swallow 429s, leak tokens in logs, double-bill users on retries, or fail silently when a provider has an outage.
AIgateway is what I wish I’d had at project one.
What: the API surface
The design principle is boring on purpose: OpenAI-compatible at the wire level.
You already have an OpenAI client in your codebase. Point it at AIgateway. Change a model string. Done.
import OpenAI from "openai";
const ai = new OpenAI({
apiKey: process.env.AIGATEWAY_KEY,
baseURL: "https://api.aigateway.sh/v1",
});
// GPT-4o
const gpt = await ai.chat.completions.create({
model: "openai/gpt-4o",
messages: [{ role: "user", content: "Hello" }],
});
// Swap to Claude by changing the string. Nothing else changes.
const claude = await ai.chat.completions.create({
model: "anthropic/claude-sonnet-4.5",
messages: [{ role: "user", content: "Hello" }],
});
// Swap to Kimi (free until Apr 30). Still no code change.
const kimi = await ai.chat.completions.create({
model: "moonshot/kimi-k2.6",
messages: [{ role: "user", content: "Hello" }],
});
Same for streaming:
const stream = await ai.chat.completions.create({
model: "google/gemini-2.5-pro",
messages: [{ role: "user", content: "Write me a haiku." }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}
Same for vision, embeddings, image generation, audio transcription, text-to-speech, and video. One interface, every modality, every provider.
Providers unified under one roof
TEXT / REASONING
OpenAI · Anthropic · Google · Meta · Mistral · xAI · DeepSeek · Moonshot · Qwen · Cohere · Perplexity · Nous
INFERENCE HOSTS
Groq · Cerebras · Together · Fireworks · Replicate · Hugging Face · Lepton · OpenRouter
MODALITIES
Text · Vision · Image gen · Audio (ASR + TTS) · Video · Embeddings · Classification · Translation
New providers land weekly. The promise is simple: if a model exists and has an API, within a week it exists on AIgateway under the same interface.
How: the architecture at the edge
The hardest decision I made early was not building this on a traditional server.
Every existing “AI gateway” runs on Kubernetes or AWS Fargate in one region. That works until your users are in Mumbai and your gateway is in Virginia. At that point, every call pays for a transatlantic hop twice: once to your gateway, once to the model provider the gateway forwards to.
AIgateway runs on Cloudflare Workers, deployed to 300+ cities. The gateway sits in the city nearest your user. The counter-intuitive outcome: calling OpenAI through AIgateway is often faster than calling OpenAI directly from your own backend. Because your backend is in one region. We are in every region.
Request flow
The stack
| Layer | Tech | Why |
|---|---|---|
| Runtime | Cloudflare Workers | 300+ cities, sub-5ms cold starts, no container tax |
| Router | Hono | ~10kb, zero-dep, streams-first, TypeScript native |
| Rate limits | KV + Durable Objects | Eventually-consistent KV for cheap reads, DO for per-key atomic counters |
| Usage logs | D1 | SQLite at the edge; logged via waitUntil so billing never blocks the response |
| Media payloads | R2 | Zero egress. Image, audio, and video responses stream through without an S3 bandwidth bill |
| Streaming state | Durable Objects | Per-session SSE state that survives provider flaps and failovers mid-stream |
| Observability | Workers Analytics + Logpush | p50/p95/p99 per model per region, piped to R2 for long-term query |
The part I’m proudest of: fallback routing
Every model is scored continuously on three signals: latency p95, error rate, and throttle signals (429s, upstream timeouts, circuit trips). Scores update per request, rolling one-minute window.
When a call arrives for openai/gpt-4o and OpenAI is hiccuping, the router can transparently re-route to anthropic/claude-sonnet-4.5 or google/gemini-2.5-pro, matched on capability, context window, and tool-use support. You opt into this per-request via a fallback header, so it never happens without your consent, and every fallback is logged with the exact reason.
await ai.chat.completions.create({
model: "openai/gpt-4o",
messages: [...],
}, {
headers: {
"x-aigateway-fallback": "anthropic/claude-sonnet-4.5, google/gemini-2.5-pro",
},
});
Your users never see a 503. You never page yourself at 3am because a single provider had a rough night.
Streaming that survives provider flaps
This is the boring-looking feature that took the longest. Streamed responses over SSE feel simple in a happy-path demo. In production, you get:
- Mid-stream failures: the provider dies after emitting 400 tokens of a 2000-token response
- Chunked connection drops: the client keeps the socket open but stops receiving bytes
- Provider clock skew: one backend sends
data:frames, another sends bare JSON - Token-level billing reconciliation: you billed for 400 tokens; need to refund if the stream never completed
Durable Objects hold per-session state so we can resume, replay, or fail cleanly, and guarantee accurate billing no matter how the stream died. The client sees one clean SSE stream. We eat the complexity.
Pricing: transparent by default
The worst pattern in this market is opaque “credits.” You buy 1,000 credits. A GPT-4o call costs 12 credits but a Claude call costs 9 credits but an image gen costs 50 credits, except on Tuesdays when there’s a multiplier. You never actually know what you’re paying for.
AIgateway pricing is the opposite:
What you pay
What you get
Every response carries two headers you actually want:
x-aigateway-cost-usd: 0.00248
x-aigateway-provider: anthropic
You can log those to your own observability stack and build per-customer cost attribution in a weekend.
The launch offer: free Kimi K2.6 until April 30
Moonshot’s Kimi K2.6 is one of the most interesting frontier models I’ve tested in 2026. 1M context, strong reasoning, and priced aggressively even before this promo.
From now until April 30, every AIgateway account gets unlimited Kimi K2.6 calls, free. No credit card, no caveats, no “pro tier upgrade” after hour one. Benchmark it against whatever you’re running today, on your real prompts, your real users, your real eval set. If it’s better, switch. If it isn’t, the eval still cost you nothing.
How to use the free window
- Sign up at aigateway.sh (no card required)
- Point your existing OpenAI client at
https://api.aigateway.sh/v1with your AIgateway key - Set the model string to
moonshot/kimi-k2.6 - Run your eval suite. Your usage page will show
$0.00through Apr 30.
The roadmap: what lands in the next 60 days
This is v1. What I’m actively building:
Semantic caching
Embedding-based exact + near-match cache. On chat apps with repeated queries, expect 40–60% cost reduction. Opt-in per request.
Prompt-level A/B testing
Split a percentage of traffic across two models, log outcomes, let the dashboard surface the winner by cost-per-successful-output.
BYO-key mode
Bring your own provider keys, pay AIgateway only for routing + observability. For regulated industries that can’t have a third party hold the provider key.
Eval-based auto-routing
Define an eval. AIgateway picks the cheapest model that passes it. Every week, re-runs the eval as new models land. Prices fall automatically; quality never regresses.
On-device fallback
For mobile and edge clients: a small local model (via WebGPU or llama.cpp) that takes over when the network is flaky. Same API surface on the wire.
Who this is for
Be honest with yourself about the answer:
Use AIgateway if
- You ship AI features and use more than one model provider
- You’ve copy-pasted retry/streaming code across projects
- You want per-request cost visibility without building a pipeline
- You need fallback routing for production reliability
- Your users are distributed globally and latency matters
Skip AIgateway if
- You only use one model, will only ever use one model
- You’re a research team that needs raw provider quirks
- Compliance mandates a direct contract with each provider
- You already run your own gateway and it works fine (genuinely, don’t switch for the sake of switching)
The bet I’m making
Every infrastructure wave produces one or two gateways that swallow the margin of the layer below them. Stripe for payments. Twilio for telecom. Cloudflare for networking. Plaid for banking.
Intelligence is entering its commodity phase. The gateway is the winning shape.
I don’t know yet if AIgateway will be the gateway. I do know that the problem it solves is real (I felt it thirty times), and that the architecture (edge-native, OpenAI-compatible, fallback-aware) is the right one.
If you’re shipping anything with AI, try it. The downside is an afternoon. The upside is every future model your competitor hasn’t added yet.
Get started
https://api.aigateway.sh/v1: OpenAI-compatible base URLmodel: “moonshot/kimi-k2.6”DMs are open for feedback. Built solo, shipped fast, still rough in places. That’s the honest version. If you hit a sharp edge, tell me and I’ll fix it that week.
One key. One hundred and fifty models. Zero lock-in.
Let’s see what you build.
Enjoyed this? Get more like it.
Weekly on AI product strategy and execution. No fluff.
Comments
Loading comments...