N
NexusRoute
Back to Models

o3

OpenAIspecialized

OpenAI's full-power reasoning model. Uses extended chain-of-thought to solve the hardest problems in math, science, and formal logic. Slower and more expensive than standard models but achieves state-of-the-art accuracy on competition-level benchmarks. Best reserved for genuinely hard reasoning tasks.

Released 2025-04-16Knowledge cutoff: 2025-06
Medium confidence|Updated 72d ago|91% source confidence

Specifications

Context Window

200K tokens

Max Output

100K tokens

Input Price

$1.00 / 1M tokens

Output Price

$4.00 / 1M tokens

Latency Tier

Slow (speed score: 3.5/10)

Capability Profile

Reasoning
10/10
Factuality
9.5/10
Coding
9/10
Safety & Enterprise
8.5/10
Structured Output
7.5/10
Instruction Following
7.5/10
Long Context
7/10
Tool Use
7/10
Creativity
6.5/10
Multimodal
5/10
Cost Efficiency
5/10
Conversational
5/10
Speed
3.5/10

Feature Support

Vision Yes
Audio In No
Audio Out No
Video No
Image Generation No
Image Editing No
Function Calling Yes
JSON Mode Yes
Structured Output Yes
Streaming Yes
Reasoning Yes
Realtime No
Computer Use No
Web Search No

Best Use Cases

Competition-level math (AIME, AMC, Putnam-style problems)
Formal logic, theorem proving, and abstract reasoning
PhD-level science questions requiring deep multi-step analysis
Code debugging that requires understanding complex state transitions
Any task where correctness matters far more than speed or cost

Not Ideal For

Casual chat or simple Q&A — massive overkill
High-throughput production workloads
Creative writing or conversational AI
Tasks where GPT-5.4 already gets the right answer
Real-time interactive applications (latency is 10-90s)

Strengths

Highest reasoning accuracy of any publicly available model
Self-corrects through internal chain-of-thought verification
Near-perfect on GSM8K, strong on AIME and GPQA Diamond
Can solve problems that stump all other models including GPT-5.4

Weaknesses

Very slow — 10 to 90 seconds for complex queries
Internal reasoning tokens are billed but invisible to the user
Overthinks simple problems, wasting time and money
Conversational ability is stilted and unnatural
Tool use is functional but less polished than GPT-5.4

Edge Cases & Notes

Reasoning tokens count toward output cost but are hidden — effective cost is often 3-5x the visible output
Adjustable reasoning_effort parameter (low/medium/high) controls depth-cost tradeoff
At 'low' reasoning effort, behaves more like a fast model but loses its core advantage
Vision support exists but is secondary — reasoning about images is slower and less reliable

Provider Notes

Use only when the task genuinely requires deep reasoning. For most coding and general tasks, GPT-5.4 or o4-mini are better value. Available through OpenAI API with Tier 3+ access.

Benchmarks

MMLU93.2%
HumanEval93.8%
Arena Elo1395
GSM8K99.1%

Benchmark Notes

GSM8K near-saturated at 99.1%. AIME 2025 score ~85%. GPQA Diamond >70%. Top reasoning model on most hard benchmarks. Arena Elo reflects reasoning tasks specifically.

Research Meta

Last Evaluated

2026-03-15

Source Confidence

91%

Evaluation Method

AIME 2025, GPQA Diamond, GSM8K, MATH, SWE-bench, LMSYS Arena (reasoning category)

Needs Re-evaluation

No

Sources

  • OpenAI o3 system card
  • LMSYS Chatbot Arena
  • AIME 2025 evaluation results
  • Independent reasoning benchmarks