Step 3.7 Flash reads images and codes like a mid-tier model, for an eighth of Gemini Flash's price
StepFun's open-weight vision model lists at $0.20 in and $1.15 out per million tokens, undercutting Gemini 3.5 Flash by roughly eight to one while doing the same job: reading images, watching video, and writing code. The independent score backs up the price. Almost everything else on the spec sheet is a number StepFun gave itself.

Most of the cheap models that shipped this spring are text only. DeepSeek V4-Flash at fourteen cents, the budget Qwen coders, the little Mistrals: they are fast and they are cheap, and none of them can look at a screenshot. Step 3.7 Flash, which StepFun put out on May 29, is the exception worth knowing about. It reads images and video, it codes, and it costs about what the text-only floor costs.
The list price is $0.20 per million input tokens and $1.15 per million output. A cache hit takes input down to $0.04. There is no extra charge for sending it a picture, and no extra charge for turning the reasoning up. For a model that can do vision, that is the part that stops you scrolling.
On the meter, next to Gemini Flash
| Model | Input / 1M | Cached / 1M | Output / 1M | Context |
|---|---|---|---|---|
| Step 3.7 Flash | $0.20 | $0.04 | $1.15 | 256K |
| Gemini 3.5 Flash | $1.50 | $0.15 | $9.00 | 1M |
Source: platform.stepfun.ai. Model ID: step-3.7-flash
What twenty cents buys you now
Under the hood Step 3.7 Flash is a 198B-parameter mixture of experts that fires about 11B parameters per token, with a 1.8B vision encoder bolted to a 196B language backbone. The architecture is not the headline. The headline is that this combination ships at twenty cents, because until recently a model that read images well started somewhere north of a dollar.
StepFun built the vision path for work, not demos. There is a Python tool the model calls to zoom into high-resolution images, draw bounding boxes, and probe regions it cannot resolve in one pass, plus a visual search tool for long-tail entity recognition. The closest thing at this price is Qwen3.7 Plus, which also reads images and video and lists at $0.40 in, $1.60 out. Step undercuts it on both, and unlike Qwen it ships open weights.
If your product turns documents, screenshots, or video frames into structured output, this is the tier that just got cheaper. The question is whether the quality holds up, and that is where the spec sheet gets thinner than it looks.
One independent number, and a wall of in-house ones
Start with the figure StepFun did not control. Artificial Analysis puts Step 3.7 Flash at 43 on its Intelligence Index, which lands it mid-pack among the open-weight field and a notch below Gemini 3.5 Flash. The measured output speed is about 382 tokens a second with a 0.94s time to first token, which is genuinely fast for a model this size. Those are third-party measurements, and they are the ones to trust.
Everything else in the launch post is StepFun grading its own work. The coding claims are SWE-Bench Pro at 56.3% and Terminal-Bench 2.1 at 59.6%, both up from Step 3.5 Flash. The vision claims run higher: HR-Bench 4K at 89.1%, V* at 95.3%, SimpleVQA with search at 79.2%. These are plausible for the tier, but no outside lab has reproduced them yet, and the individual sub-scores behind the Artificial Analysis index were not broken out at the time of writing.
| Benchmark | Step 3.7 Flash | Who ran it |
|---|---|---|
| AA Intelligence Index | 43 | Artificial Analysis |
| SWE-Bench Pro | 56.3% | StepFun |
| Terminal-Bench 2.1 | 59.6% | StepFun |
| HR-Bench 4K (vision) | 89.1% | StepFun |
None of that makes the numbers wrong. It makes them unconfirmed, which is a different thing, and the right posture for a model that is two weeks old. The index of 43 is what you can bank on today.
The Opus comparison hiding in the footnotes
StepFun's most aggressive claim is not about Step 3.7 Flash alone. It is about a setup called Advisor Mode, where the cheap model runs the task and only escalates to a larger advisor model at the hard moments. On SWE-Bench Verified, StepFun says that combination reaches 76.3% at about $0.19 a task, against Claude Opus 4.6 at 78.7% and roughly $1.76 a task. Same ballpark on quality, one-ninth the cost.
Read that as a marketing chart until someone else runs it, because the comparison is StepFun's and the advisor escalation is the kind of harness that is easy to tune for a benchmark and hard to reproduce on your own repo. The honest version of the claim is narrower: a twenty-cent model that offloads its hardest steps can get close to a frontier coder on a specific test, and the per-task math is striking enough to be worth checking yourself.
Where it sits in the cheap-multimodal field
The useful comparison is not against Opus. It is against the other models you would actually pick for high-volume vision work. Among those, Step is the floor on price. DeepSeek V4-Flash is cheaper still, but it is text only, so it is in the table as the reminder of what you give up to save the last few cents.
| Model | Input / 1M | Output / 1M | Vision |
|---|---|---|---|
| DeepSeek V4-Flash | $0.14 | $0.28 | No |
| Step 3.7 Flash | $0.20 | $1.15 | Image + video |
| Qwen3.7 Plus | $0.40 | $1.60 | Image + video |
| MiniMax M3 | $0.60 | $2.40 | Image + video |
| Gemini 3.5 Flash | $1.50 | $9.00 | Image + video |
List prices. MiniMax M3 shows here at its standard rate; a launch promo halves it for input under 512K. See tokencost.app/pricing for live rates.
The same vision workload, five invoices
Vision work is input-heavy in a way text chat is not, because images burn tokens. Take a pipeline that pushes 100M input tokens and writes 20M output tokens a month, a 5-to-1 split that matches a service reading documents and emitting structured JSON. Here is what the same month costs on each:
| Model | Monthly cost | vs Step |
|---|---|---|
| DeepSeek V4-Flash (text only) | $20 | 0.5x |
| Step 3.7 Flash | $43 | 1x |
| Qwen3.7 Plus | $72 | 1.7x |
| MiniMax M3 | $108 | 2.5x |
| Gemini 3.5 Flash | $330 | 7.7x |
Caching cuts the Step number further, since repeated document templates or system prompts bill at four cents instead of twenty. The gap to Gemini is the real story: if you are running Flash because it is Google's cheap multimodal option and your bill has three digits, Step does the same shape of work for a fraction, and you can move the weights in-house if the API ever gets in your way.
Open weights, and the hardware they ask for
Step 3.7 Flash ships under Apache 2.0, weights on Hugging Face at stepfun-ai/Step-3.7-Flash, in BF16, FP8, NVFP4, and GGUF builds. That is the permissive license, not a research-only one, so commercial self-hosting is on the table. The floor to run it is around 120 GB of VRAM or unified memory, which a single high-end node or an Apple Silicon box with enough RAM can clear.
On the hosted side the model is OpenAI-compatible at platform.stepfun.ai, with the same step-3.7-flash id also live on OpenRouter and NVIDIA NIM, with DeepInfra and Fireworks listed as coming soon. Reasoning effort is a low/medium/high dial set per request, and turning it up does not change the price.
Swap it in for vision work, think twice for code
If you run a high-volume vision pipeline on Gemini 3.5 Flash and the bill matters, Step 3.7 Flash is the first thing to A/B this month. Same job, a fraction of the cost, and an escape hatch to open weights if you need it. Send it a representative batch of your real images and grade the structured output against what Gemini gives you. The price gap is wide enough that Step can lose a little accuracy and still win on cost.
If you are choosing a coder, slow down. The 43 on the Artificial Analysis index is solid but mid-pack, and the standout coding numbers, Advisor Mode included, are StepFun's own and unreproduced. Treat them as a reason to run the model, not a reason to trust the chart. For pure text reasoning with no images in sight, DeepSeek V4-Flash is still cheaper, so the case for Step is the vision, not the per-token floor.
Sources
- Step 3.7 Flash API docs - platform.stepfun.ai (input $0.20 / cached $0.04 / output $1.15, 256K context)
- Step 3.7 Flash launch post - StepFun (architecture, vision and coding benchmarks, Advisor Mode)
- Step 3.7 Flash analysis - Artificial Analysis (Intelligence Index 43, 382 tok/s, 0.94s TTFT)
- Step 3.7 Flash model card - Hugging Face (Apache 2.0 weights, quant builds, specs)
- StepFun releases Step 3.7 Flash - MarkTechPost, May 29, 2026