Skip to main content
TokenCost logoTokenCost
Model ReleaseJune 8, 2026·7 min read

Gemma 4 12B has no API price. Google built it to run on your laptop, not on its meter.

Google shipped Gemma 4 12B on June 3. It reads text, images, and audio, fits inside 16GB of memory, and carries an Apache 2.0 license. The one thing it does not have is a per-token rate, because Google is not selling it by the token. You download the weights or you call it free in AI Studio. For a site that exists to compare API prices, that absence is the story. Below we work out what “free” actually costs you, where the 12B lands next to the Gemmas you can rent and the cheap hosted multimodal models, and the point at which running it yourself beats paying someone else.

Gemma 4 12B unified transformer graphic with text, image and audio inputs on a dark background

Image source: Google

There is no rate card, and that is deliberate

Most launches we cover open with a number: GPT-5.4 Mini is $0.75 in and $4.50 out, Gemini 3 Flash is $0.50 and $3.00. Gemma 4 12B does not give you one. Google released it the way it releases every Gemma, as downloadable weights on Hugging Face, Kaggle, Ollama, and LM Studio, plus a free rate-limited slot in AI Studio. There is no paid first-party token endpoint. And unlike the bigger 26B and 31B, which a handful of providers already host for a few cents a million, the 12B had not picked up a hosted price tag in its first week.

That is not an oversight. The 12B is the laptop model in the family. Google sized it to run on the hardware you already have, so the comparison it wants you to make is not “our price versus theirs,” it is “free on your own GPU versus a per-token bill somewhere else.” So that is the comparison we will run.

What “free” actually costs

Free is doing some work in that sentence, so pin it down. There are two ways to run the 12B without paying a token rate, and each has a real cost hiding behind the zero.

The first is Google AI Studio. You get the model behind a free, rate-limited endpoint, no card required. Fine for prototyping and light personal use, useless for anything with sustained traffic, because the rate limits are the whole point. The second is self-hosting. The weights ship with quantization-aware-trained int4 checkpoints that compress to roughly 7 to 8GB, which is what lets the model fit a 16GB consumer GPU laptop or a 16GB Apple Silicon Mac. Pull it from Ollama with ollama run gemma4:12b and you are serving locally in minutes.

On hardware you already own, the marginal cost of a token is electricity, call it nothing. The catch is throughput. A laptop will answer one user at a time at modest speed; it will not stand up a busy API. The moment you need real concurrency you are renting a GPU, and that rental, not a per-token charge, is the actual bill. We will size it in a minute.

The Gemmas you can rent, and the one you can't

If you would rather not run anything yourself, the family does have rentable members. The larger two are hosted for cents on the million. The 12B is the odd one out: no hosted rate, because it is built to live on your own machine.

Gemma 4 modelInput / 1MOutput / 1MHow you run it
Gemma 4 12BSelf-hostSelf-hostLocal / free AI Studio tier
Gemma 4 26B A4B$0.06$0.30Hosted (OpenRouter)
Gemma 4 31B$0.12$0.36Hosted (OpenRouter)

Worth knowing: the 26B and 31B are text-and-image models. The 12B is the only one in the launch wave with native audio, so if your use case involves sound, the cheap hosted siblings will not do it for you anyway. You are self-hosting the 12B or reaching for a paid multimodal model elsewhere.

What the same multimodal work costs hosted

Here is the field if you skip self-hosting and pay for a multimodal API instead. These are the models you would actually reach for to read images and audio at scale, with their current rates.

ModelInput / 1MOutput / 1MModalities
Gemma 4 12B (self-host)~$0~$0Text, image, audio
Gemini 3 Flash$0.50$3.00Text, image, audio
GPT-5.4 Mini$0.75$4.50Text, image
Gemini 3.5 Flash$1.50$9.00Text, image, audio

The closest paid match on capability is Gemini 3 Flash, which also takes audio and sits at the bottom of this table on price. So the honest framing of the 12B is not “cheapest multimodal API,” because it is not an API. It is “the multimodal model you can run for free if you are willing to host it, against roughly $0.50 to $1.50 a million input if you are not.”

When running it yourself actually wins

This is the math that matters. Self-hosting trades a per-token bill for a fixed hardware bill, so the answer turns entirely on volume. Below is what a month of multimodal traffic costs on the two cheapest hosted options, at a 70/30 input-output split, next to the 12B on your own GPU.

Monthly volumeGemini 3 FlashGPT-5.4 MiniGemma 4 12B, owned GPU
10M tokens$12.50$18.75~$0 + power
100M tokens$125$187.50~$0 + power
1B tokens$1,250$1,875~$0 + power

The owned-GPU column reads like a cheat because it half is. If you already have a 16GB laptop or Mac sitting idle, the 12B is free from the first token and the only ceiling is how fast that one machine can serve. The real decision shows up when you do not own the hardware and have to rent it.

An always-on cloud GPU big enough to serve the 12B runs somewhere around $0.50 an hour, or roughly $365 a month if you keep it up around the clock. At that fixed cost, self-hosting pulls ahead of Gemini 3 Flash once you cross about 30 million tokens a month, and ahead of GPT-5.4 Mini closer to 20 million. Below that, the rented GPU sits mostly idle and you are better off paying per token. Above it, the flat hardware bill wins and keeps winning as you scale, which is the entire argument for an open model in the first place.

Encoder-free, and why that touches the bill

The architecture is the genuinely new thing here, and it is not just a research flourish. Most multimodal models bolt a separate vision encoder and audio encoder onto a language model. Google stripped those out. In its words, it “removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.” Image inputs flow in the same direct way. Fewer moving parts means a smaller memory footprint, which is a chunk of how a model that handles four input types still fits in 16GB.

It is also the first mid-sized Gemma that ingests audio natively, projecting the raw signal into the same space as text rather than routing it through a bolted-on speech model. Inputs are text, image, and audio; output is text only. Context is 256K tokens, the knowledge cutoff is January 2025, and it speaks 35-plus languages out of the box, on top of pre-training across 140-plus. None of that needs an encoder stack you have to feed and pay for separately.

Does a laptop model hold up on the benchmarks

Reasonably well for its size. These are the numbers from Google's official model card, so treat them as the vendor's own, but the spread is wide enough to tell you where the 12B is strong and where it is not.

BenchmarkScoreWhat it measures
MMLU Pro77.2General knowledge and reasoning
GPQA Diamond78.8Graduate-level science
MMMU Pro69.1Multimodal understanding
MATH-Vision79.7Math from images
LiveCodeBench v672.0Competitive coding
CoVoST38.5Speech translation

The knowledge and vision scores are genuinely good for 12 billion parameters, landing the model near the larger 26B on the headline reasoning tests. Audio is the soft spot: a CoVoST speech-translation score of 38.5 says the native-audio feature is real but early, more a capable transcription-and-understanding tool than a polished speech model. Run it for what it is, a free multimodal workhorse, not a frontier system, and the numbers line up with the price of zero.

Where this leaves you

Gemma 4 12B is the right call when you have steady multimodal volume, you can host your own model, and audio input is on the menu. In that lane nothing else is free, and the economics tilt harder in your favor every additional token, because you are paying for silicon once instead of renting it forever. If you have an idle 16GB machine, the decision is close to made.

It is the wrong call when your volume is low or spiky, when you do not want to run infrastructure, or when you need frontier-grade quality rather than competent-and-free. For sporadic traffic, a hosted model like Gemini 3 Flash at $0.50 a million input is cheaper than a GPU that sits idle most of the day, and you skip the ops entirely. The break-even is not a vibe, it is a number, and that number is your monthly token count.

Put the hosted options side by side on the full pricing table, or run your own token mix through the calculator to find the volume where self-hosting Gemma 4 12B finally beats paying per token.

Sources