How much does Step 3.7 Flash cost via API?

Step 3.7 Flash costs $0.20 per million input tokens and $1.15 per million output tokens on StepFun's platform. A cache hit drops input to $0.04 per million. There is no surcharge for vision input or for higher reasoning effort, and the same rate applies on OpenRouter and other hosts.

Is Step 3.7 Flash multimodal?

Yes. Step 3.7 Flash pairs a 196B-parameter language backbone with a 1.8B vision encoder and natively understands images and video. It also exposes a Python tool for high-resolution image probing. That makes it one of the cheapest models in 2026 that can actually see, at roughly an eighth of Gemini 3.5 Flash's per-token price.

Is Step 3.7 Flash open source?

Yes. StepFun released the weights under Apache 2.0 on Hugging Face at stepfun-ai/Step-3.7-Flash. It is a 198B-parameter mixture-of-experts model with about 11B active parameters per token and a 256K context window. Running it locally needs roughly 120 GB of VRAM or unified memory.

How does Step 3.7 Flash compare to Gemini 3.5 Flash?

Both are multimodal Flash-tier models. Step 3.7 Flash lists at $0.20/$1.15 per million tokens against Gemini 3.5 Flash's $1.50/$9.00, so Step is roughly 7-8x cheaper on both input and output. Gemini scores higher on the independent Artificial Analysis Intelligence Index and has Google's infrastructure behind it. Most of Step's coding and vision benchmarks are StepFun's own.

Model ReleaseJune 14, 2026·8 min read

Step 3.7 Flash reads images and codes like a mid-tier model, for an eighth of Gemini Flash's price

StepFun's open-weight vision model lists at $0.20 in and $1.15 out per million tokens, undercutting Gemini 3.5 Flash by roughly eight to one while doing the same job: reading images, watching video, and writing code. The independent score backs up the price. Almost everything else on the spec sheet is a number StepFun gave itself.

Blue light streaks against a dark background, suggesting speed and motion

Photo by Roxy Aln on Unsplash

Most of the cheap models that shipped this spring are text only. DeepSeek V4-Flash at fourteen cents, the budget Qwen coders, the little Mistrals: they are fast and they are cheap, and none of them can look at a screenshot. Step 3.7 Flash, which StepFun put out on May 29, is the exception worth knowing about. It reads images and video, it codes, and it costs about what the text-only floor costs.

The list price is $0.20 per million input tokens and $1.15 per million output. A cache hit takes input down to $0.04. There is no extra charge for sending it a picture, and no extra charge for turning the reasoning up. For a model that can do vision, that is the part that stops you scrolling.

On the meter, next to Gemini Flash

Model	Input / 1M	Cached / 1M	Output / 1M	Context
Step 3.7 Flash	$0.20	$0.04	$1.15	256K
Gemini 3.5 Flash	$1.50	$0.15	$9.00	1M

Source: platform.stepfun.ai. Model ID: step-3.7-flash

What twenty cents buys you now

Under the hood Step 3.7 Flash is a 198B-parameter mixture of experts that fires about 11B parameters per token, with a 1.8B vision encoder bolted to a 196B language backbone. The architecture is not the headline. The headline is that this combination ships at twenty cents, because until recently a model that read images well started somewhere north of a dollar.

StepFun built the vision path for work, not demos. There is a Python tool the model calls to zoom into high-resolution images, draw bounding boxes, and probe regions it cannot resolve in one pass, plus a visual search tool for long-tail entity recognition. The closest thing at this price is Qwen3.7 Plus, which also reads images and video and lists at $0.40 in, $1.60 out. Step undercuts it on both, and unlike Qwen it ships open weights.

If your product turns documents, screenshots, or video frames into structured output, this is the tier that just got cheaper. The question is whether the quality holds up, and that is where the spec sheet gets thinner than it looks.

One independent number, and a wall of in-house ones

Start with the figure StepFun did not control. Artificial Analysis puts Step 3.7 Flash at 43 on its Intelligence Index, which lands it mid-pack among the open-weight field and a notch below Gemini 3.5 Flash. The measured output speed is about 382 tokens a second with a 0.94s time to first token, which is genuinely fast for a model this size. Those are third-party measurements, and they are the ones to trust.

Everything else in the launch post is StepFun grading its own work. The coding claims are SWE-Bench Pro at 56.3% and Terminal-Bench 2.1 at 59.6%, both up from Step 3.5 Flash. The vision claims run higher: HR-Bench 4K at 89.1%, V* at 95.3%, SimpleVQA with search at 79.2%. These are plausible for the tier, but no outside lab has reproduced them yet, and the individual sub-scores behind the Artificial Analysis index were not broken out at the time of writing.

Benchmark	Step 3.7 Flash	Who ran it
AA Intelligence Index	43	Artificial Analysis
SWE-Bench Pro	56.3%	StepFun
Terminal-Bench 2.1	59.6%	StepFun
HR-Bench 4K (vision)	89.1%	StepFun

None of that makes the numbers wrong. It makes them unconfirmed, which is a different thing, and the right posture for a model that is two weeks old. The index of 43 is what you can bank on today.

The Opus comparison hiding in the footnotes

StepFun's most aggressive claim is not about Step 3.7 Flash alone. It is about a setup called Advisor Mode, where the cheap model runs the task and only escalates to a larger advisor model at the hard moments. On SWE-Bench Verified, StepFun says that combination reaches 76.3% at about $0.19 a task, against Claude Opus 4.6 at 78.7% and roughly $1.76 a task. Same ballpark on quality, one-ninth the cost.

Read that as a marketing chart until someone else runs it, because the comparison is StepFun's and the advisor escalation is the kind of harness that is easy to tune for a benchmark and hard to reproduce on your own repo. The honest version of the claim is narrower: a twenty-cent model that offloads its hardest steps can get close to a frontier coder on a specific test, and the per-task math is striking enough to be worth checking yourself.

Where it sits in the cheap-multimodal field

The useful comparison is not against Opus. It is against the other models you would actually pick for high-volume vision work. Among those, Step is the floor on price. DeepSeek V4-Flash is cheaper still, but it is text only, so it is in the table as the reminder of what you give up to save the last few cents.

Model	Input / 1M	Output / 1M	Vision
DeepSeek V4-Flash	$0.14	$0.28	No
Step 3.7 Flash	$0.20	$1.15	Image + video
Qwen3.7 Plus	$0.40	$1.60	Image + video
MiniMax M3	$0.60	$2.40	Image + video
Gemini 3.5 Flash	$1.50	$9.00	Image + video

List prices. MiniMax M3 shows here at its standard rate; a launch promo halves it for input under 512K. See tokencost.app/pricing for live rates.

The same vision workload, five invoices

Vision work is input-heavy in a way text chat is not, because images burn tokens. Take a pipeline that pushes 100M input tokens and writes 20M output tokens a month, a 5-to-1 split that matches a service reading documents and emitting structured JSON. Here is what the same month costs on each:

Model	Monthly cost	vs Step
DeepSeek V4-Flash (text only)	$20	0.5x
Step 3.7 Flash	$43	1x
Qwen3.7 Plus	$72	1.7x
MiniMax M3	$108	2.5x
Gemini 3.5 Flash	$330	7.7x

Caching cuts the Step number further, since repeated document templates or system prompts bill at four cents instead of twenty. The gap to Gemini is the real story: if you are running Flash because it is Google's cheap multimodal option and your bill has three digits, Step does the same shape of work for a fraction, and you can move the weights in-house if the API ever gets in your way.

Open weights, and the hardware they ask for

Step 3.7 Flash ships under Apache 2.0, weights on Hugging Face at stepfun-ai/Step-3.7-Flash, in BF16, FP8, NVFP4, and GGUF builds. That is the permissive license, not a research-only one, so commercial self-hosting is on the table. The floor to run it is around 120 GB of VRAM or unified memory, which a single high-end node or an Apple Silicon box with enough RAM can clear.

On the hosted side the model is OpenAI-compatible at platform.stepfun.ai, with the same step-3.7-flash id also live on OpenRouter and NVIDIA NIM, with DeepInfra and Fireworks listed as coming soon. Reasoning effort is a low/medium/high dial set per request, and turning it up does not change the price.

Swap it in for vision work, think twice for code

If you run a high-volume vision pipeline on Gemini 3.5 Flash and the bill matters, Step 3.7 Flash is the first thing to A/B this month. Same job, a fraction of the cost, and an escape hatch to open weights if you need it. Send it a representative batch of your real images and grade the structured output against what Gemini gives you. The price gap is wide enough that Step can lose a little accuracy and still win on cost.

If you are choosing a coder, slow down. The 43 on the Artificial Analysis index is solid but mid-pack, and the standout coding numbers, Advisor Mode included, are StepFun's own and unreproduced. Treat them as a reason to run the model, not a reason to trust the chart. For pure text reasoning with no images in sight, DeepSeek V4-Flash is still cheaper, so the case for Step is the vision, not the per-token floor.

Sources

Step 3.7 Flash API docs - platform.stepfun.ai (input $0.20 / cached $0.04 / output $1.15, 256K context)
Step 3.7 Flash launch post - StepFun (architecture, vision and coding benchmarks, Advisor Mode)
Step 3.7 Flash analysis - Artificial Analysis (Intelligence Index 43, 382 tok/s, 0.94s TTFT)
Step 3.7 Flash model card - Hugging Face (Apache 2.0 weights, quant builds, specs)
StepFun releases Step 3.7 Flash - MarkTechPost, May 29, 2026

Compare all model prices Calculate your API cost