← ~/blog
· 12 min read

Launching AIgateway: 150 AI Models Behind One Key

Every AI startup rebuilds the same plumbing. OpenAI here, Anthropic there, Gemini, Groq, Together, Replicate, each with its own auth, retries, streaming, rate limits. I paid that integration tax 30 times. So I built AIgateway: one API, one key, one bill, 150+ models, edge-native on Cloudflare. Free Kimi K2.6 until Apr 30.

Every AI startup I’ve built (and I’ve built twenty-seven) rebuilds the same plumbing.

OpenAI SDK here. Anthropic SDK there. Gemini’s own format. Groq, Together, Fireworks, Replicate, Cerebras, DeepSeek, Moonshot. Each with its own auth scheme, retry semantics, streaming protocol, rate-limit behavior, and pricing math.

I paid that integration tax thirty times. That’s thirty wheels, re-invented, all slightly out of round. Never again.

Today I’m launching AIgateway: one API, one key, one bill for every major AI provider. 150+ models across text, image, audio, video, vision, and embeddings. Edge-native on Cloudflare Workers. OpenAI-compatible at the wire level, so it drops into code you’ve already written.

And until April 30, Kimi K2.6 is free through AIgateway. Zero cost, zero commitment. Bench it against your current stack and see.

150+

Models, one key

~15ms

Gateway overhead

300+

Edge cities

Apr 30

Free Kimi K2.6


Why: the thesis behind aggregation

Models are commoditizing faster than any technology wave in history. GPT-4 class intelligence went from $60 per million tokens to $0.15 in eighteen months. A 400x collapse. Claude, Gemini, Llama, Qwen, DeepSeek, Kimi. Every new release pushes the frontier forward and the price curve down.

When the input to your product is a commodity, the aggregation layer wins. This is not a hunch. It’s the pattern of every prior infrastructure wave.

Prior waves

Electricity → grid operators, not generators, captured the margin
Bandwidth → CDNs and ISPs, not fiber installers
Compute → AWS, not Dell
Payments → Stripe, not banks

This wave

Intelligence → the gateway captures the margin, not the lab
The builder you’re selling to already knows they’ll run five models in production. The only question is who charges them once for that.

Every serious AI app I’ve shipped now sits on three to eight models. Reasoning on one. Cheap classification on another. Vision on a third. TTS on a fourth. Embeddings on a fifth. The model-agnostic app is the default, not the exception.

That means the gateway is not a nice-to-have. It is the control plane for the AI era.


The motivation: thirty times, same plumbing

I want to be specific about what “integration tax” actually means, because the word “just” does a lot of dishonest work when founders talk about AI integrations.

What “just integrate OpenAI” actually costs per project

Auth + key rotation + env vars across envs2–3 hrs
Streaming (SSE parse, backpressure, abort handling)4–6 hrs
Retry + backoff + idempotency2–4 hrs
Rate-limit handling (429s, reservoirs, per-org caps)3–5 hrs
Usage logging, cost attribution, pricing math4–8 hrs
Fallback routing when a provider 503s6–10 hrs
Prompt / response caching layer4–8 hrs
Total per product25–44 hrs

Multiply that by five providers (because no serious app uses only one) and you’re a solo week into plumbing before you’ve written a line of product code. Now multiply by thirty projects. That’s my personal number. It is absurd.

The bitterest part: every version of this plumbing is slightly worse than the last one the builder wrote, because they’re doing it in a hurry, late at night, trying to ship a feature. This is how you end up with production AI apps that silently swallow 429s, leak tokens in logs, double-bill users on retries, or fail silently when a provider has an outage.

AIgateway is what I wish I’d had at project one.


What: the API surface

The design principle is boring on purpose: OpenAI-compatible at the wire level.

You already have an OpenAI client in your codebase. Point it at AIgateway. Change a model string. Done.

import OpenAI from "openai";

const ai = new OpenAI({
  apiKey: process.env.AIGATEWAY_KEY,
  baseURL: "https://api.aigateway.sh/v1",
});

// GPT-4o
const gpt = await ai.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
});

// Swap to Claude by changing the string. Nothing else changes.
const claude = await ai.chat.completions.create({
  model: "anthropic/claude-sonnet-4.5",
  messages: [{ role: "user", content: "Hello" }],
});

// Swap to Kimi (free until Apr 30). Still no code change.
const kimi = await ai.chat.completions.create({
  model: "moonshot/kimi-k2.6",
  messages: [{ role: "user", content: "Hello" }],
});

Same for streaming:

const stream = await ai.chat.completions.create({
  model: "google/gemini-2.5-pro",
  messages: [{ role: "user", content: "Write me a haiku." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

Same for vision, embeddings, image generation, audio transcription, text-to-speech, and video. One interface, every modality, every provider.

Providers unified under one roof

TEXT / REASONING

OpenAI · Anthropic · Google · Meta · Mistral · xAI · DeepSeek · Moonshot · Qwen · Cohere · Perplexity · Nous

INFERENCE HOSTS

Groq · Cerebras · Together · Fireworks · Replicate · Hugging Face · Lepton · OpenRouter

MODALITIES

Text · Vision · Image gen · Audio (ASR + TTS) · Video · Embeddings · Classification · Translation

New providers land weekly. The promise is simple: if a model exists and has an API, within a week it exists on AIgateway under the same interface.


How: the architecture at the edge

The hardest decision I made early was not building this on a traditional server.

Every existing “AI gateway” runs on Kubernetes or AWS Fargate in one region. That works until your users are in Mumbai and your gateway is in Virginia. At that point, every call pays for a transatlantic hop twice: once to your gateway, once to the model provider the gateway forwards to.

AIgateway runs on Cloudflare Workers, deployed to 300+ cities. The gateway sits in the city nearest your user. The counter-intuitive outcome: calling OpenAI through AIgateway is often faster than calling OpenAI directly from your own backend. Because your backend is in one region. We are in every region.

Request flow

User in Bengaluru hits your API
Your backend calls api.aigateway.sh
Cloudflare routes to the Worker running in Bengaluru (<5ms cold start)
KV: rate-limit check (<1ms)
Durable Object: per-session state
Model router picks best provider
Streamed SSE response from provider, relayed verbatim
D1: usage + cost logged async (never on critical path)

The stack

LayerTechWhy
RuntimeCloudflare Workers300+ cities, sub-5ms cold starts, no container tax
RouterHono~10kb, zero-dep, streams-first, TypeScript native
Rate limitsKV + Durable ObjectsEventually-consistent KV for cheap reads, DO for per-key atomic counters
Usage logsD1SQLite at the edge; logged via waitUntil so billing never blocks the response
Media payloadsR2Zero egress. Image, audio, and video responses stream through without an S3 bandwidth bill
Streaming stateDurable ObjectsPer-session SSE state that survives provider flaps and failovers mid-stream
ObservabilityWorkers Analytics + Logpushp50/p95/p99 per model per region, piped to R2 for long-term query

The part I’m proudest of: fallback routing

Every model is scored continuously on three signals: latency p95, error rate, and throttle signals (429s, upstream timeouts, circuit trips). Scores update per request, rolling one-minute window.

When a call arrives for openai/gpt-4o and OpenAI is hiccuping, the router can transparently re-route to anthropic/claude-sonnet-4.5 or google/gemini-2.5-pro, matched on capability, context window, and tool-use support. You opt into this per-request via a fallback header, so it never happens without your consent, and every fallback is logged with the exact reason.

await ai.chat.completions.create({
  model: "openai/gpt-4o",
  messages: [...],
}, {
  headers: {
    "x-aigateway-fallback": "anthropic/claude-sonnet-4.5, google/gemini-2.5-pro",
  },
});

Your users never see a 503. You never page yourself at 3am because a single provider had a rough night.

Streaming that survives provider flaps

This is the boring-looking feature that took the longest. Streamed responses over SSE feel simple in a happy-path demo. In production, you get:

  • Mid-stream failures: the provider dies after emitting 400 tokens of a 2000-token response
  • Chunked connection drops: the client keeps the socket open but stops receiving bytes
  • Provider clock skew: one backend sends data: frames, another sends bare JSON
  • Token-level billing reconciliation: you billed for 400 tokens; need to refund if the stream never completed

Durable Objects hold per-session state so we can resume, replay, or fail cleanly, and guarantee accurate billing no matter how the stream died. The client sees one clean SSE stream. We eat the complexity.


Pricing: transparent by default

The worst pattern in this market is opaque “credits.” You buy 1,000 credits. A GPT-4o call costs 12 credits but a Claude call costs 9 credits but an image gen costs 50 credits, except on Tuesdays when there’s a multiplier. You never actually know what you’re paying for.

AIgateway pricing is the opposite:

What you pay

Pass-through provider cost (exact published rate)
+ 5% platform fee
That’s it. No markups hidden in credits. No tier-gating basic features. No surprise “infrastructure fees.”

What you get

Per-request token cost visible in the response header
Per-model, per-day rollups in the dashboard
CSV export of every call (for accounting / cost attribution)

Every response carries two headers you actually want:

x-aigateway-cost-usd: 0.00248
x-aigateway-provider: anthropic

You can log those to your own observability stack and build per-customer cost attribution in a weekend.


The launch offer: free Kimi K2.6 until April 30

Moonshot’s Kimi K2.6 is one of the most interesting frontier models I’ve tested in 2026. 1M context, strong reasoning, and priced aggressively even before this promo.

From now until April 30, every AIgateway account gets unlimited Kimi K2.6 calls, free. No credit card, no caveats, no “pro tier upgrade” after hour one. Benchmark it against whatever you’re running today, on your real prompts, your real users, your real eval set. If it’s better, switch. If it isn’t, the eval still cost you nothing.

How to use the free window

  1. Sign up at aigateway.sh (no card required)
  2. Point your existing OpenAI client at https://api.aigateway.sh/v1 with your AIgateway key
  3. Set the model string to moonshot/kimi-k2.6
  4. Run your eval suite. Your usage page will show $0.00 through Apr 30.

The roadmap: what lands in the next 60 days

This is v1. What I’m actively building:

MAY

Semantic caching

Embedding-based exact + near-match cache. On chat apps with repeated queries, expect 40–60% cost reduction. Opt-in per request.

MAY

Prompt-level A/B testing

Split a percentage of traffic across two models, log outcomes, let the dashboard surface the winner by cost-per-successful-output.

JUN

BYO-key mode

Bring your own provider keys, pay AIgateway only for routing + observability. For regulated industries that can’t have a third party hold the provider key.

JUN

Eval-based auto-routing

Define an eval. AIgateway picks the cheapest model that passes it. Every week, re-runs the eval as new models land. Prices fall automatically; quality never regresses.

JUL

On-device fallback

For mobile and edge clients: a small local model (via WebGPU or llama.cpp) that takes over when the network is flaky. Same API surface on the wire.


Who this is for

Be honest with yourself about the answer:

Use AIgateway if

  • You ship AI features and use more than one model provider
  • You’ve copy-pasted retry/streaming code across projects
  • You want per-request cost visibility without building a pipeline
  • You need fallback routing for production reliability
  • Your users are distributed globally and latency matters

Skip AIgateway if

  • You only use one model, will only ever use one model
  • You’re a research team that needs raw provider quirks
  • Compliance mandates a direct contract with each provider
  • You already run your own gateway and it works fine (genuinely, don’t switch for the sake of switching)

The bet I’m making

Every infrastructure wave produces one or two gateways that swallow the margin of the layer below them. Stripe for payments. Twilio for telecom. Cloudflare for networking. Plaid for banking.

Intelligence is entering its commodity phase. The gateway is the winning shape.

I don’t know yet if AIgateway will be the gateway. I do know that the problem it solves is real (I felt it thirty times), and that the architecture (edge-native, OpenAI-compatible, fallback-aware) is the right one.

If you’re shipping anything with AI, try it. The downside is an afternoon. The upside is every future model your competitor hasn’t added yet.

Get started

aigateway.sh: sign up, grab a key
https://api.aigateway.sh/v1: OpenAI-compatible base URL
→ Free Kimi K2.6 until Apr 30. Just set model: “moonshot/kimi-k2.6”

DMs are open for feedback. Built solo, shipped fast, still rough in places. That’s the honest version. If you hit a sharp edge, tell me and I’ll fix it that week.

One key. One hundred and fifty models. Zero lock-in.

Let’s see what you build.

Enjoyed this? Get more like it.

Weekly on AI product strategy and execution. No fluff.

Unsubscribe anytime.

share: twitter linkedin

Comments

Loading comments...