Grok 4.20

xAIfrontier

xAI's flagship model with a massive 2M token context window, strong reasoning capabilities, and vision support. Known for its straightforward, less filtered conversational style and real-time information access through X integration. Competitive with GPT-5.4 on reasoning benchmarks.

Released 2026-01-08Knowledge cutoff: 2025-12

Medium confidence|Updated 72d ago|82% source confidence

Specifications

Context Window

2M tokens

Max Output

128K tokens

Input Price

$2.00 / 1M tokens

Output Price

$6.00 / 1M tokens

Latency Tier

Fast (speed score: 7/10)

Capability Profile

Long Context

10/10

Reasoning

9/10

Coding

8.5/10

Conversational

8.5/10

Structured Output

8/10

Creativity

8/10

Factuality

8/10

Instruction Following

8/10

Tool Use

7.5/10

Speed

7/10

Cost Efficiency

7/10

Multimodal

6.5/10

Safety & Enterprise

6/10

Feature Support

Vision Yes

Audio In No

Audio Out No

Video No

Image Generation No

Image Editing No

Function Calling Yes

JSON Mode Yes

Structured Output Yes

Streaming Yes

Reasoning Yes

Realtime No

Computer Use No

Web Search No

Best Use Cases

Ultra-long document analysis leveraging the 2M context window

Reasoning-heavy tasks requiring strong analytical capability

Applications wanting a more direct, less filtered conversational AI

Real-time information synthesis through X/web integration

Large codebase analysis that benefits from massive context

Not Ideal For

Enterprise deployments requiring strict safety guardrails

Regulated industries with content policy requirements

Audio or video processing (vision only)

Applications requiring the broadest ecosystem of integrations

Strengths

Largest context window available (2M tokens) with good recall

Strong reasoning — competitive with o4-mini on many tasks

More permissive content policy than OpenAI or Anthropic models

Competitive pricing — $2/$6 is below GPT-5.4 and Claude Opus

Real-time information through X integration is unique

Good at code generation and analysis

Weaknesses

Weaker safety alignment may be a liability for enterprise/regulated use

Coding quality below Claude Opus/Sonnet 4.6 on SWE-bench

Smaller ecosystem — fewer integrations, libraries, and third-party support

No audio processing

API reliability and rate limits are less mature than OpenAI/Anthropic

Structured output compliance is good but not best-in-class

Edge Cases & Notes

2M context is genuine but quality does degrade at the extremes (past 1.5M)

X real-time data can introduce noise and requires careful prompt design

Content policy is notably more permissive — this can be a feature or a risk depending on use case

Rate limits may be more restrictive than established providers during peak usage

Provider Notes

Available through the xAI API. X/Twitter integration available but optional. API maturity is improving but still behind OpenAI and Anthropic in terms of features like batching and caching.

Benchmarks

MMLU91%

HumanEval89.5%

Arena Elo1375

Benchmark Notes

MMLU-Pro 91%. Strong LMSYS Arena showing, especially in reasoning and conversational categories. SWE-bench ~48%. Long-context benchmarks are its standout — near-perfect needle-in-haystack at 1M tokens.

Research Meta

Last Evaluated

2026-03-15

Source Confidence

82%

Evaluation Method

LMSYS Arena, MMLU-Pro, long-context benchmarks, SWE-bench, conversational evaluation

Needs Re-evaluation

Sources

xAI Grok 4.20 announcement
LMSYS Chatbot Arena
Independent long-context evaluations

Continue exploring

Route a prompt

See how Grok 4.20 ranks

Compare models

Side-by-side analysis

Browse registry

Explore all 24 models