o3

Name: o3
Price: 1 USD
Author: OpenAI

OpenAIspecialized

OpenAI's full-power reasoning model. Uses extended chain-of-thought to solve the hardest problems in math, science, and formal logic. Slower and more expensive than standard models but achieves state-of-the-art accuracy on competition-level benchmarks. Best reserved for genuinely hard reasoning tasks.

Released 2025-04-16Knowledge cutoff: 2025-06

Needs review|Updated 118d ago|91% source confidence

Specifications

Context Window

200K tokens

Max Output

100K tokens

Input Price

$1.00 / 1M tokens

Output Price

$4.00 / 1M tokens

Latency Tier

Slow (speed score: 3.5/10)

Capability Profile

Reasoning

10/10

Factuality

9.5/10

Coding

9/10

Safety & Enterprise

8.5/10

Structured Output

7.5/10

Instruction Following

7.5/10

Long Context

7/10

Tool Use

7/10

Creativity

6.5/10

Multimodal

5/10

Cost Efficiency

5/10

Conversational

5/10

Speed

3.5/10

Feature Support

Vision Yes

Audio In No

Audio Out No

Video No

Image Generation No

Image Editing No

Function Calling Yes

JSON Mode Yes

Structured Output Yes

Streaming Yes

Reasoning Yes

Realtime No

Computer Use No

Web Search No

Best Use Cases

Competition-level math (AIME, AMC, Putnam-style problems)

Formal logic, theorem proving, and abstract reasoning

PhD-level science questions requiring deep multi-step analysis

Code debugging that requires understanding complex state transitions

Any task where correctness matters far more than speed or cost

Not Ideal For

Casual chat or simple Q&A — massive overkill

High-throughput production workloads

Creative writing or conversational AI

Tasks where GPT-5.4 already gets the right answer

Real-time interactive applications (latency is 10-90s)

Strengths

Highest reasoning accuracy of any publicly available model

Self-corrects through internal chain-of-thought verification

Near-perfect on GSM8K, strong on AIME and GPQA Diamond

Can solve problems that stump all other models including GPT-5.4

Weaknesses

Very slow — 10 to 90 seconds for complex queries

Internal reasoning tokens are billed but invisible to the user

Overthinks simple problems, wasting time and money

Conversational ability is stilted and unnatural

Tool use is functional but less polished than GPT-5.4

Edge Cases & Notes

Reasoning tokens count toward output cost but are hidden — effective cost is often 3-5x the visible output

Adjustable reasoning_effort parameter (low/medium/high) controls depth-cost tradeoff

At 'low' reasoning effort, behaves more like a fast model but loses its core advantage

Vision support exists but is secondary — reasoning about images is slower and less reliable

Provider Notes

Use only when the task genuinely requires deep reasoning. For most coding and general tasks, GPT-5.4 or o4-mini are better value. Available through OpenAI API with Tier 3+ access.

Benchmarks

MMLU93.2%

HumanEval93.8%

Arena Elo1395

GSM8K99.1%

Benchmark Notes

GSM8K near-saturated at 99.1%. AIME 2025 score ~85%. GPQA Diamond >70%. Top reasoning model on most hard benchmarks. Arena Elo reflects reasoning tasks specifically.

Research Meta

Last Evaluated

2026-03-15

Source Confidence

91%

Evaluation Method

AIME 2025, GPQA Diamond, GSM8K, MATH, SWE-bench, LMSYS Arena (reasoning category)

Needs Re-evaluation

Sources

OpenAI o3 system card
LMSYS Chatbot Arena
AIME 2025 evaluation results
Independent reasoning benchmarks

Continue exploring

Route a prompt

See how o3 ranks

Compare models

Side-by-side analysis

Browse registry

Explore all 24 models