The State Of LLMs 2025: Progress, Problems, and Predictions
Date: Dec 30, 2025
URL: https://magazine.sebastianraschka.com/p/state-of-llms-2025
Likes: 465
Image Count: 28
Table of Contents
- The Year of Reasoning, RLVR, and GRPO
- 1.1 The DeepSeek Moment
- 1.2 LLM Focus Points
- GRPO, the Research Darling of the Year
- LLM Architectures: A Fork in the Road?
- It’s Also The Year of Inference-Scaling and Tool Use
- Word of the Year: Benchmaxxing
- AI for Coding, Writing, and Research
- 6.1 Coding
- 6.2 Codebases and code libraries
- 6.3 Technical writing and research
- 6.4 LLMs and Burnout
- The Edge: Private data
- Building LLMs and Reasoning Models From Scratch
- Surprises in 2025 and Predictions for 2026
- 9.1 Noteworthy and Surprising Things in 2025
- 9.2 Predictions for 2026
- Bonus: A Curated LLM Research Papers List
Key Takeaways
1. The DeepSeek Moment (January 2025)
- DeepSeek R1 released as open-weight model comparable to best proprietary models
- Training cost estimated ~$5M (much cheaper than previously assumed $50-500M)
- Introduced RLVR (Reinforcement Learning with Verifiable Rewards) with GRPO algorithm
- “Verifiable” = deterministic correctness labels (math, code)
2. LLM Development Focus by Year
- 2022: RLHF + PPO
- 2023: LoRA SFT
- 2024: Mid-Training
- 2025: RLVR + GRPO
- 2026 prediction: RLVR extensions + inference-time scaling
- 2027 prediction: Continual learning
3. GRPO Improvements (Research Darling)
OLMo 3 adopted: - Zero gradient signal filtering (DAPO) - Active sampling - Token-level loss - No KL loss - Clip higher - Truncated importance sampling - No standard deviation normalization
DeepSeek V3.2 adopted: - KL tuning with domain-specific strengths - Reweighted KL - Off-policy sequence masking - Keep sampling mask for top-p/top-k
4. Architecture Trends
- MoE (Mixture-of-Experts) now standard for large models
- Efficiency tweaks: GQA, sliding-window attention, MLA
- Linear attention variants emerging: Gated DeltaNets (Qwen3-Next, Kimi Linear), Mamba-2 (Nemotron 3)
- Text diffusion models gaining traction (LLaDA 2.0 at 100B params)
5. Inference-Time Scaling & Tool Use
- GPT 4.5 showed pure scaling has diminishing returns
- Better training pipelines + inference scaling drove progress
- Tool use significantly reduces hallucinations
- gpt-oss models designed with tool use in mind
6. “Benchmaxxing” Problem
- Llama 4 scored well on benchmarks but disappointed in practice
- Benchmark numbers no longer trustworthy indicators
- Useful as minimum thresholds, not for comparing models above threshold
7. AI for Coding/Writing/Research
- LLMs as “superpowers” for productivity
- Author writes core code himself, uses LLM for boilerplate
- Expert-crafted codebases not replaced by pure LLM generation
- Warning: Overusing LLMs can lead to burnout (hollow work)
- Chess analogy: AI as partner, not replacement
8. Private Data as Edge
- Companies declining to sell proprietary data to LLM providers
- LLM development becoming commoditized
- Prediction: Companies will develop in-house LLMs using private data
9. 2025 Surprises
- Gold-level math performance earlier than expected
- Llama fell out of favor; Qwen overtook in popularity
- Mistral 3 uses DeepSeek V3 architecture
- New contenders: Kimi, GLM, MiniMax, Yi
- MCP became standard for tool access
- OpenAI released open-weight model (gpt-oss)
10. 2026 Predictions
- Consumer-facing diffusion model (Gemini Diffusion first)
- Open-weight community adopts local tool use
- RLVR expands beyond math/code (chemistry, biology)
- Classical RAG fades as long-context handling improves
- Progress from tooling/inference rather than training alone
Article Structure Analysis
Length: ~6,500 words
Sections Pattern:
- Chronological narrative (starts with January 2025)
- Technical deep dives with figures
- Personal reflections (coding, writing, burnout)
- Industry analysis (private data, commoditization)
- Personal updates (books, research)
- Predictions (concrete, numbered)
- Bonus content for paid subscribers
Visual Elements:
- 22+ figures referenced
- Comparison tables
- Architecture diagrams
- Training loss curves
- Code snippets
- Book covers
Cross-references:
- Links to previous Ahead of AI articles
- Links to author’s books
- Links to GitHub repositories
Tone:
- First person, conversational
- Honest about limitations (“no one knows what these might look like”)
- Practical advice mixed with technical content
- Personal anecdotes (consulting, writing process)