Llama 4 Scout

Name: Llama 4 Scout
Price: 0.15 USD
Author: Meta

Metamid

Meta's efficient MoE model with 109B total parameters but only 17B active per token. Features a groundbreaking 10M token context window and native multimodal support. Fits on a single GPU for inference, making it the premier self-hosted option for long-context multimodal work. Open-weight under Llama license.

Released 2025-11-14Knowledge cutoff: 2025-08

Needs review|Updated 116d ago|85% source confidence

Specifications

Context Window

10M tokens

Max Output

32K tokens

Input Price

$0.150 / 1M tokens

Output Price

$0.400 / 1M tokens

Latency Tier

Fast (speed score: 8/10)

Capability Profile

Long Context

10/10

Cost Efficiency

9/10

Speed

8/10

Reasoning

7/10

Coding

7/10

Structured Output

7/10

Multimodal

7/10

Factuality

7/10

Instruction Following

7/10

Tool Use

7/10

Conversational

7/10

Creativity

6.5/10

Safety & Enterprise

6.5/10

Feature Support

Vision Yes

Audio In No

Audio Out No

Video No

Image Generation No

Image Editing No

Function Calling Yes

JSON Mode Yes

Structured Output Yes

Streaming Yes

Reasoning No

Realtime No

Computer Use No

Web Search No

Best Use Cases

Self-hosted deployments needing multimodal + long context on a single GPU

Processing extremely long documents or codebases (up to 10M tokens)

Fine-tuning for domain-specific multimodal tasks

Cost-effective inference via hosted providers (Together, Fireworks)

Applications requiring open weights for compliance or customization

Not Ideal For

Tasks requiring the highest absolute quality (use frontier models)

Enterprise deployments needing Anthropic-level safety

Audio processing (vision only)

Complex agentic workflows where tool use reliability matters most

Strengths

10M token context window — the longest available in any model

Runs on a single H100 GPU due to 17B active parameters per token

Native multimodal (text + images) at an open-weight model price

Open weights enable full customization and fine-tuning

MoE architecture provides excellent throughput for self-hosting

Weaknesses

Quality at 10M context is best for retrieval-style tasks; reasoning degrades at extreme lengths

Coding quality is decent but well below Claude/GPT-5 tier models

Instruction following is less precise than closed-source competitors

Safety alignment is basic compared to Anthropic or OpenAI models

Structured output compliance can be inconsistent

Edge Cases & Notes

10M context works well for needle-in-haystack but reasoning over the full context is limited

Single-GPU inference requires quantization for consumer hardware; H100 recommended for full precision

Pricing shown is approximate via hosted providers — varies significantly by provider

MoE architecture means 109B params but only 17B compute per token

Provider Notes

Open-weight under Meta's Llama license. Available through Together AI, Fireworks, Replicate, and self-hosted. Self-hosting on a single H100 is feasible. Pricing shown is approximate (Together AI).

Benchmarks

MMLU83%

HumanEval80.5%

Arena Elo1250

Benchmark Notes

Strong for a model that runs on a single GPU. MMLU-Pro 83%. Long-context benchmarks are its standout — near-perfect recall at 1M, good at 5M. Open LLM Leaderboard shows competitive MoE performance.

Research Meta

Last Evaluated

2026-03-15

Source Confidence

85%

Evaluation Method

Open LLM Leaderboard, LMSYS Arena, long-context benchmarks, self-hosting evaluation

Needs Re-evaluation

Sources

Meta Llama 4 technical report
Open LLM Leaderboard
LMSYS Chatbot Arena
Together AI benchmarks

Continue exploring

Route a prompt

See how Llama 4 Scout ranks

Compare models

Side-by-side analysis

Browse registry

Explore all 24 models