Meta Muse Spark: no API pricing, no open weights, and one area where it's best in the world
Meta shipped its first proprietary model on April 8. Benchmarks are out - it's fourth overall but leads on healthcare. There's still no public API and no pricing. Here's what the numbers say and what to use while you wait.

Image source: Meta AI
- -Muse Spark is free on meta.ai. No public API, no pricing, no timeline. Developers cannot integrate it yet.
- -Fourth on the Artificial Analysis Intelligence Index (52) behind Gemini 3.1 Pro and GPT-5.4 (both 57). Leads specifically on healthcare: 42.8% HealthBench Hard vs GPT-5.4's 40.1%.
- -Llama 4 is unchanged. Scout is still $0.10/$0.30, Maverick is still $0.20/$0.60 on Together AI. Nothing about Meta's open-source Llama changed on April 8.
What Meta actually built
Muse Spark is the first model from Meta Superintelligence Labs (MSL), the team Alexandr Wang has been running since joining from Scale AI in mid-2025. Nine months of building from scratch - new infrastructure, new architecture, new data pipelines. Wang posted this himself on X when announcing the model.
It handles text, voice, and image inputs. Two modes: "Instant" for fast responses, "Contemplating" for harder problems - the second mode runs multiple sub-agents in parallel. Meta calls the underlying technique "thought compression." The model reportedly achieves its reasoning capability at over 10x less compute than Llama 4 Maverick.
You can try it at meta.ai with a Meta account. It's also rolling out inside WhatsApp, Instagram, Facebook, and Messenger over the coming weeks. Everything outside of that - building products, integrating into your stack - requires a public API that doesn't exist yet.
Where it stands vs the models you can actually use
Artificial Analysis ran the full Intelligence Index on Muse Spark using early access Meta provided. The index covers reasoning, coding, math, science, vision, and agentic tasks. Here's the full breakdown.
| Model | AA Index | GPQA Diamond | ARC-AGI-2 | HealthBench Hard |
|---|---|---|---|---|
| Gemini 3.1 Pro | 57 | 94.3% | 76.5 | 20.6% |
| GPT-5.4 | 57 | 92.8% | 76.1 | 40.1% |
| Claude Opus 4.6 | 53 | 92.7% | n/a | 14.8% |
| Muse Spark | 52 | 89.5% | 42.5 | 42.8% |
Source: Artificial Analysis Intelligence Index v4.0, April 8, 2026.
The ARC-AGI-2 number is the one that stands out. Gemini 3.1 Pro and GPT-5.4 both score around 76. Muse Spark scores 42.5. That's not a small gap - abstract reasoning is a real weakness here.
The health result is the opposite story. Muse Spark at 42.8% on HealthBench Hard beats GPT-5.4 (40.1%) by a meaningful margin. Claude Opus 4.6 scores 14.8% on the same benchmark. If your application touches healthcare or clinical reasoning, Muse Spark is - on paper - the best model available, assuming you can get access.
Coding performance and token efficiency
On SWE-bench Verified (real GitHub issues), Muse Spark scores 77.4%. Terminal-Bench 2.0 tells a different story: 59.0 vs GPT-5.4 at 75.1 - a 16-point gap on agentic coding tasks. If you're evaluating models for autonomous coding agents, that gap matters.
Token efficiency is more interesting. The Artificial Analysis run consumed 58M output tokens across the full Intelligence Index - comparable to Gemini 3.1 Pro at 57M. Claude Opus 4.6 used 157M and GPT-5.4 used 120M for the same evaluation. At identical per-token rates, a 2x token efficiency advantage translates directly to 2x cost savings. Whether that holds on production workloads is unknown, but it's a real data point.
The closed-source flip
In 2024, Zuckerberg wrote a 2,000-word post titled "Open Source AI is the Path Forward." The argument: if Meta were the only company using Llama, the ecosystem wouldn't develop. He drew comparisons to Linux.
Muse Spark has no open weights. The official statement from Meta Newsroom: "We hope to open-source future versions of the model." No timeline. No commitment.
Zuckerberg did say on Threads that Meta plans to "release increasingly advanced models that push the frontier" including "new open source models." The dual-track reading - Llama stays open, Muse is closed - is plausible. It's not something Meta said directly.
The model was codenamed "Avocado" internally and was reportedly delayed earlier in 2026 after falling short in internal evaluations for reasoning and coding. Given the ARC-AGI-2 score, those concerns weren't fully resolved before shipping. The Register's headline on April 8 : "Meta's new model is as open as Zuckerberg's private school."
What this changes for Llama costs (not much, yet)
Llama 4 weights are still available. You can still self-host or use hosted providers at the same prices as before. Nothing changed on April 8 for existing Llama-based workloads.
| Model | Input / 1M | Output / 1M | Provider |
|---|---|---|---|
| Llama 4 Scout (hosted) | $0.10 | $0.30 | Together AI |
| Llama 4 Maverick (hosted) | $0.20 | $0.60 | Together AI |
| GPT-5.4 | $2.50 | $15.00 | OpenAI |
| Gemini 3.1 Pro | $2.00 | $12.00 | |
| Claude Opus 4.6 | $5.00 | $25.00 | Anthropic |
| Muse Spark | n/a | n/a | No public API |
Together AI pricing as of April 9, 2026. See our full pricing page for all providers.
Self-hosting Llama 4: Scout (109B total, 17B active via MoE) needs roughly one H100 - around $2,500/month on-demand. Maverick (400B total, 17B active) needs eight H100s - about $20,000/month. Break-even against hosted APIs typically happens somewhere between 500M and 1B tokens per month.
The longer-term question is what happens when Muse 2 ships. Before April 8, the reasonable bet was that Meta's frontier capability would eventually appear as open weights. That bet got harder to hold. If the pattern continues, "run Meta's best model for free" becomes a past-tense statement.
The situation for developers right now
Muse Spark is available in "private preview via API to select partners." Meta didn't name those partners. There is no public API documentation, no pricing, and no waitlist. TechCrunch noted that Meta's competitors price these models behind a paywall - unclear if Meta will do the same or subsidize access to drive platform adoption.
Agentic performance is also below the competition today. On the GDPval-AA agentic benchmark, Muse Spark scores 1427 vs Claude Sonnet 4.6 at 1648 and GPT-5.4 at 1676. None of that changes your current stack decision.
If you're building in health - clinical decision support, patient-facing chat, medical documentation - it's worth keeping an eye on. The HealthBench Hard lead over GPT-5.4 is real and larger than similar gaps usually are at this tier. When API access opens, that specific use case has a reason to evaluate Muse Spark first.
Where things stand
Fourth on the overall intelligence index. First on healthcare. No public API. No pricing. Llama 4 unchanged.
The benchmarks suggest a model that's competitive but not yet ahead of GPT-5.4 or Gemini 3.1 Pro on most dimensions. The health lead is the exception. The abstract reasoning gap at ARC-AGI-2 is the problem. The token efficiency number is genuinely interesting and worth watching when API pricing eventually arrives.
For now, you can compare what's actually available on our pricing page, or run your own numbers with the cost calculator.