LLM Research Papers: The 2025 List (July to December)

Date: JUL 19, 2025

URL: https://magazine.sebastianraschka.com/p/llm-research-papers-2025-part2

Image Count: 16

Images

Figure
- Caption: The State Of LLMs 2025: Progress, Progress, and Predictions
Figure
- Caption: The State of Reinforcement Learning for LLM Reasoning
Figure
- Caption: Figure 1: Annotated figure from Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Figure
- Caption: Figure 2: Improvement via two types of inference scaling, (1) self-consistency and (2) self-refinement. Annotated figure from the DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning paper
Figure
- Caption: Figure 3: Annotated figure from the On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models paper
Figure
- Caption: Figure 4: Annotated figure from The Art of Scaling Reinforcement Learning Compute for LLMs
Figure
- Caption: Figure 5: Annotated figure from XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
Figure
- Caption: Figure 6: The NVIDIA Nemotron 3 Nano transformer-mamba hybrid architecture described in Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
Figure
- Caption: Figure 7: Annotated figure from Stronger Normalization-Free Transformers
Figure

Caption: Figure 8: Annotated figure from Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

Figure

Caption: Beyond Standard LLMs

Figure

Caption: Figure 9: Annotated figure from Diffusion Language Models are Super Data Learners

Figure

Caption: Understanding Multimodal LLMs

Figure

Caption: Figure 10: Annotated figure from Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Figure

Caption: Figure 11: Annotated figure from BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Figure

Caption: The State Of LLMs 2025: Progress, Progress, and Predictions

Full Text Content

In June, I shared a bonus article with my curated and bookmarked research paper lists to the paid subscribers who make this Substack possible.

In a similar vein, as a thank-you to all the kind supporters, I have prepared a list below of the interesting research articles I bookmarked and categorized from July to December 2025.

I skimmed over the abstracts of these papers but only read a very small fraction. However, I still like to keep collecting these organized lists as I often go back to them when working on a given project.

By the way, I was also working on my annual LLM review article, State of LLMs 2025: Progress, Problems, and Predictions, which I published today as well. You can find it here:

The State Of LLMs 2025: Progress, Progress, and Predictions SEBASTIAN RASCHKA, PHD · DECEMBER 30, 2025 Read full story

Originally, I planned to include this list in the article above. However, the article was already getting quite long, so I decided to share the list here in a separate post instead. I hope you do not mind receiving two emails today. My thinking was that splitting things up would make both articles easier to read, scan, and revisit later without getting lost in an overly long page.

The categories for this research paper list are as follows (you can use the table of contents in the web view of this article to navigate to them directly):

Reasoning Models

1a. Training Reasoning Models

1b. Inference-Time Reasoning Strategies

1c. Evaluating LLMs and/or Understanding Reasoning

Other Reinforcement Learning Methods for LLMs

Other Inference-Time Scaling Methods

Model Releases / Technical Reports

Architectures

Efficient Training

Diffusion-Based Language Models

Multimodal & Vision-Language Models

Data & Pre-training Datasets

Reasoning Models

As you may be able to tell from the three subsections in the “reasoning models” category in the table of contents, this year, my list is very heavy on reasoning models. This is because most of my work, and my recent book, are centered around reasoning models.

Also, as I mentioned in my State of LLMs 2025 report, reasoning methods have been one of the biggest themes and drivers of LLM progress this year (kickstarted by DeepSeek R1 in January 2025).

So, I decided to subdivide it into 3 categories: Training, inference-time scaling, and more general understanding/evaluation.

1a. Training Reasoning Models

This subsection focuses on training strategies designed to improve reasoning abilities in LLMs. In the first half of the year (January to June), much of the momentum centered around reinforcement learning with verifiable rewards (RLVR). For more background on RLVR, you might like my The State of Reinforcement Learning for LLM Reasoning article:

The State of Reinforcement Learning for LLM Reasoning SEBASTIAN RASCHKA, PHD · APRIL 19, 2025 Read full story

In the second half of the year, RL is still a major theme, but the emphasis has shifted from “RLVR works” to “how do we make it scale and generalize”. This includes papers on better exploration and credit assignment, more stable and efficient optimization, extensions to long-context and agentic settings, and even RL-style training beyond strictly verifiable domains.

Figure 1: Annotated figure from Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

18 Dec, Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning, https://arxiv.org/abs/2512.16917

18 Dec, INTELLECT-3: Technical Report, https://arxiv.org/abs/2512.16144

14 Dec, QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management, https://arxiv.org/abs/2512.12967

8 Dec, Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning, https://arxiv.org/abs/2512.07461

4 Dec, Efficient Reinforcement Learning with Semantic and Token Entropy for LLM Reasoning, https://arxiv.org/abs/2512.04359

3 Dec, PretrainZero: Reinforcement Active Pretraining, https://arxiv.org/abs/2512.03442

17 Nov, P1: Mastering Physics Olympiads with Reinforcement Learning, https://arxiv.org/abs/2511.13612

11 Nov, SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization, https://arxiv.org/abs/2511.06411

9 Nov, Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B, https://arxiv.org/abs/2511.06221

28 Oct, Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning, https://arxiv.org/abs/2510.23038

2 Oct, ExGRPO: Learning to Reason from Experience, https://arxiv.org/abs/2510.02245

29 Sep, ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory, https://arxiv.org/abs/2509.25140

29 Sep, DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search, https://arxiv.org/abs/2509.25454

24 Sep, Language Models that Think, Chat Better, https://arxiv.org/abs/2509.20357

14 Sep, Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning, https://arxiv.org/abs/2509.11420

9 Sep, K2-Think: A Parameter-Efficient Reasoning System, https://arxiv.org/abs/2509.07604

3 Sep, Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning, https://arxiv.org/abs/2509.03646

28 Aug, rStar2-Agent: Agentic Reasoning Technical Report, https://arxiv.org/abs/2508.20722

27 Aug, Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning, https://arxiv.org/abs/2508.19828

14 Aug, SSRL: Self-Search Reinforcement Learning, https://arxiv.org/abs/2508.10874

7 Aug, Learning to Reason for Factuality, https://arxiv.org/abs/2508.05618

1 Aug, ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network, https://arxiv.org/abs/2508.00429

21 Jul, LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization, https://arxiv.org/abs/2507.15758

19 Jul, MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization, https://arxiv.org/abs/2507.14683

16 Jul, Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training, https://arxiv.org/abs/2507.12507

1b. Inference-Time Reasoning Strategies

This part of the list covers methods that improve reasoning dynamically at test time, without requiring retraining. Often, these papers are focused on trading computational performance for modeling performance.

Figure 2: Improvement via two types of inference scaling, (1) self-consistency and (2) self-refinement. Annotated figure from the DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning paper

15 Dec, Let’s (not) just put things in Context: Test-Time Training for Long-Context LLMs, https://arxiv.org/abs/2512.13898

1 Dec, The Art of Scaling Test-Time Compute for Large Language Models, https://arxiv.org/abs/2512.02008

27 Nov, DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning, https://arxiv.org/abs/2511.22570

11 Nov, Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models, https://arxiv.org/abs/2511.08577

16 Oct, Reasoning with Sampling: Your Base Model is Smarter Than You Think, https://arxiv.org/abs/2510.14901

7 Oct, MixReasoning: Switching Modes to Think, https://arxiv.org/abs/2510.06052

6 Oct, Less is More: Recursive Reasoning with Tiny Networks, https://arxiv.org/abs/2510.04871

30 Aug, ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute, https://arxiv.org/abs/2509.04475

25 Aug, MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains, https://arxiv.org/abs/2508.18260

22 Aug, Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling, https://arxiv.org/abs/2508.16745

21 Aug, Deep Think with Confidence, https://arxiv.org/abs/2508.15260

22 Jul, Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning, https://arxiv.org/abs/2507.16784

11 Jul, KV Cache Steering for Inducing Reasoning in Small Language Models, https://arxiv.org/abs/2507.08799

2 Jul, Test-Time Scaling with Reflective Generative Model, https://arxiv.org/abs/2507.01951

1c. Evaluating LLMs and/or Understanding Reasoning

In this section, I collected papers that try to analyze (or evaluate) reasoning models in various ways, which is useful to refine and improve the current generation of LLM-based reasoning models.

Figure 3: Annotated figure from the On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models paper

19 Dec, When Reasoning Meets Its Laws, https://arxiv.org/abs/2512.17901

16 Dec, Universal Reasoning Model, https://arxiv.org/abs/2512.14693

11 Dec, The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality, https://arxiv.org/abs/2512.10791

8 Dec, On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models, https://arxiv.org/abs/2512.07783

26 Nov, How to Correctly Report LLM-as-a-Judge Evaluations, https://arxiv.org/abs/2511.21140

24 Nov, Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?, https://arxiv.org/abs/2504.13837

24 Nov, What does it mean to understand language?, https://arxiv.org/abs/2511.19757

17 Nov, On the Fundamental Limits of LLMs at Scale, https://arxiv.org/abs/2511.12869

11 Nov, The Path Not Taken: RLVR Provably Learns Off the Principals, https://arxiv.org/abs/2511.08567

3 Nov, Towards Robust Mathematical Reasoning, https://arxiv.org/abs/2511.01846

28 Oct, Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs, https://arxiv.org/abs/2510.27246

15 Oct, LLMs Can Get “Brain Rot”!, https://arxiv.org/abs/2510.13928

13 Oct, Demystifying Reinforcement Learning in Agentic Reasoning, https://arxiv.org/abs/2510.11701

29 Sep, Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training, https://arxiv.org/abs/2509.25758

26 Sep, When Does Reasoning Matter? A Controlled Study of Reasoning’s Contribution to Model Performance, https://arxiv.org/abs/2509.22193

25 Sep, TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them, https://arxiv.org/abs/2509.21117

12 Sep, Is In-Context Learning Learning?, https://arxiv.org/abs/2509.10414

11 Sep, The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs, https://arxiv.org/abs/2509.09677

11 Sep, All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens, https://arxiv.org/abs/2509.09650

7 Sep, Reverse-Engineered Reasoning for Open-Ended Generation, https://arxiv.org/abs/2509.06160

5 Sep, Talk Isn’t Always Cheap: Understanding Failure Modes in Multi-Agent Debate, https://arxiv.org/abs/2509.05396

4 Sep, Why Language Models Hallucinate, https://arxiv.org/abs/2509.04664

28 Aug, On the Theoretical Limitations of Embedding-Based Retrieval, https://arxiv.org/abs/2508.21038

25 Aug, UQ: Assessing Language Models on Unsolved Questions, https://arxiv.org/abs/2508.17580

18 Aug, Has GPT-5 Achieved Spatial Intelligence? An Empirical Study, https://arxiv.org/abs/2508.13142

15 Aug, When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs, https://arxiv.org/abs/2508.11383

14 Aug, Why Cannot Large Language Models Ever Make True Correct Reasoning?, https://arxiv.org/abs/2508.10265

11 Aug, Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning, https://arxiv.org/abs/2508.08221

2 Aug, Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens, https://arxiv.org/abs/2508.01191

21 Jul, Learning without training: The implicit dynamics of in-context learning, https://arxiv.org/abs/2507.16003

19 Jul, Inverse Scaling in Test-Time Compute, https://arxiv.org/abs/2507.14417

15 Jul, Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety, https://arxiv.org/abs/2507.11473

14 Jul, Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination, https://arxiv.org/abs/2507.10532

11 Jul, One Token to Fool LLM-as-a-Judge, https://arxiv.org/abs/2507.08794

9 Jul, Rethinking Verification for LLM Code Generation: From Generation to Testing, https://arxiv.org/abs/2507.00885

8 Jul, A Survey on Latent Reasoning, https://arxiv.org/abs/2507.06203

1 Jul, Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning, https://arxiv.org/abs/2507.00432

1 Jul, Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check, https://arxiv.org/abs/2507.00885

Other Reinforcement Learning Methods for LLMs

Beyond reasoning-focused reinforcement learning (RL), there is a broader collection of work on reinforcement learning. This includes RL for LLM alignment, optimization, and deployment at scale.

So, in this section, I list papers that explore preference modeling and reward design, and alternatives and extensions to classical RLHF, as well as large-scale infrastructure and optimization techniques for training LLMs with reinforcement learning.

All in all, these papers provide context for how reinforcement learning is evolving in the LLM landscape beyond directly improving reasoning ability.

Figure 4: Annotated figure from The Art of Scaling Reinforcement Learning Compute for LLMs

23 Dec, Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning, https://arxiv.org/abs/2512.20605

24 Nov, DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research, https://arxiv.org/abs/2511.19399

20 Nov, Evolution Strategies at the Hyperscale, https://arxiv.org/abs/2511.16652

18 Nov, Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning, https://arxiv.org/abs/2511.14617

10 Nov, GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning, https://arxiv.org/abs/2511.11653

15 Oct, The Art of Scaling Reinforcement Learning Compute for LLMs, https://arxiv.org/abs/2510.13786

9 Oct, Don’t Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting, https://arxiv.org/abs/2510.08696

8 Oct, Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense, https://arxiv.org/abs/2510.07242

7 Oct, Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels, https://arxiv.org/abs/2510.06499

6 Oct, Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning, https://arxiv.org/abs/2510.04786

25 Sep, Tree Search for LLM Agent Reinforcement Learning, https://arxiv.org/abs/2509.21240

23 Sep, Reinforcement Learning on Pre-Training Data, https://arxiv.org/abs/2509.19249

28 Aug, Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning, https://arxiv.org/abs/2508.20751

7 Aug, On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification, https://arxiv.org/abs/2508.05629

30 Jul, Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning, https://arxiv.org/abs/2507.22565

26 Jul, Agentic Reinforced Policy Optimization, https://arxiv.org/abs/2507.19849

25 Jul, Group Sequence Policy Optimization, https://arxiv.org/abs/2507.18071

8 Jul, The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains, https://arxiv.org/abs/2507.06187

7 Jul, Pre-Trained Policy Discriminators are General Reward Models, https://arxiv.org/abs/2507.05197

Other Inference-Time Scaling Methods

Similar to how reinforcement learning is not limited to improving reasoning, inference-time scaling is also not just about improved reasoning. Instead, it is also a general approach for improving LLM efficiency and deployment-time performance.

This section covers more general inference-time techniques (not specific to reasoning), such as adaptive routing and agent orchestration, memory and KV-cache management, and decoding strategies that improve quality or efficiency without additional training.

Figure 5: Annotated figure from XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

30 Oct, Context Engineering 2.0: The Context of Context Engineering, https://arxiv.org/abs/2510.26493

2 Oct, The Unreasonable Effectiveness of Scaling Agents for Computer Use, https://arxiv.org/abs/2510.02250

30 Aug, Universal Deep Research: Bring Your Own Model and Strategy, https://arxiv.org/abs/2509.00244

28 Aug, Adaptive LLM Routing under Budget Constraints, https://arxiv.org/abs/2508.21141

14 Aug, XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization, https://arxiv.org/abs/2508.10395

18 Aug, Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing, https://arxiv.org/abs/2508.12631

12 Aug, A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models, https://arxiv.org/abs/2508.08712

6 Jul, LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers, https://arxiv.org/abs/2507.04404

Model Releases / Technical Reports

This section collects papers on new LLM releases and architectural directions.

The included papers span pure transformer models (Olmo 3), and sparse and hybrid architectures (for example, the sparse attention mechanism in DeepSeek V3.2 and Nemotron 3 Nano’s Mamba-Transformer designs).

Figure 6: The NVIDIA Nemotron 3 Nano transformer-mamba hybrid architecture described in Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

23 Dec, Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning, https://arxiv.org/abs/2512.20848

15 Dec, Olmo 3, https://arxiv.org/abs/2512.13961

2 Dec, DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models, https://arxiv.org/abs/2512.02556

30 Oct, Kimi Linear: An Expressive, Efficient Attention Architecture, https://arxiv.org/abs/2510.26692

30 Sep, CWM: An Open-Weights LLM for Research on Code Generation with World Models, https://arxiv.org/abs/2510.02387

17 Sep, Apertus: Democratizing Open and Compliant LLMs for Global Language Environments, https://arxiv.org/abs/2509.14233

25 Aug, Hermes 4 Technical Report, https://arxiv.org/abs/2508.18255

8 Aug, GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models, https://arxiv.org/abs/2508.06471

2 Aug, Motif 2.6B Technical Report, https://arxiv.org/abs/2508.09148

28 Jul, Kimi K2: Open Agentic Intelligence, https://arxiv.org/abs/2507.20534

17 Jul, Apple Intelligence Foundation Language Models: Tech Report 2025, https://arxiv.org/abs/2507.13575

Architectures

In this section, I collected architectural and training-time papers aimed at improving efficiency and other aspects that are not necessarily tied to a big open-weight model release.

This includes papers that explore alternatives to standard dense attention, including sliding-window and hybrid attention schemes for long contexts, and normalization-free and energy-based training approaches.

Figure 7: Annotated figure from Stronger Normalization-Free Transformers

31 Dec, mHC: Manifold-Constrained Hyper-Connections, https://arxiv.org/pdf/2512.24880

11 Dec, Sliding Window Attention Adaptation, https://arxiv.org/abs/2512.10411

11 Dec, Stronger Normalization-Free Transformers, https://arxiv.org/abs/2512.10938

20 Nov, Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs, https://arxiv.org/abs/2511.16664

12 Nov, DoPE: Denoising Rotary Position Embedding, https://arxiv.org/abs/2511.09146

22 Oct, Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning, https://arxiv.org/abs/2510.19338

14 Oct, Dr.LLM: Dynamic Layer Routing in LLMs, https://arxiv.org/abs/2510.12773

14 Jul, Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation, https://arxiv.org/abs/2507.10524

10 Jul, Dynamic Chunking for End-to-End Hierarchical Sequence Modeling, https://arxiv.org/abs/2507.07955

2 Jul, Energy-Based Transformers are Scalable Learners and Thinkers, https://arxiv.org/abs/2507.02092

Efficient Training

This section covers (efficient) training techniques that did not fit neatly into the previous categories: low-rank adaptation, quantization, and large-scale training infrastructure.

Figure 8: Annotated figure from Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

9 Oct, Optimal Scaling Needs Optimal Norm, https://arxiv.org/abs/2510.03871

29 Sep, Pretraining Large Language Models with NVFP4, https://arxiv.org/abs/2509.25149

18 Sep, Pre-training under infinite compute, https://arxiv.org/abs/2509.14786

5 Sep, Scaling Performance of Large Language Model Pretraining, https://arxiv.org/abs/2509.05258

9 Jul, Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful, https://arxiv.org/abs/2507.07101

8 Jul, SingLoRA: Low Rank Adaptation Using a Single Matrix, https://arxiv.org/abs/2507.05566

Diffusion-Based Language Models

Of course, many researchers are looking for the next thing after transformer-based LLMs. Last year, state space models were considered a hot candidate (which had a comeback in Nemotron 3 Nano’s Mamba-2 layers). This year, the focus was a bit more on language diffusion models. I recently discussed diffusion models for text data in my Beyond Standard LLMs article:

Beyond Standard LLMs SEBASTIAN RASCHKA, PHD · NOVEMBER 4, 2025 Read full story Figure 9: Annotated figure from Diffusion Language Models are Super Data Learners

17 Dec, DEER: Draft with Diffusion, Verify with Autoregressive Models, https://arxiv.org/abs/2512.15176

15 Dec, ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding, https://arxiv.org/abs/2512.13586

12 Nov, TiDAR: Think in Diffusion, Talk in Autoregression, https://arxiv.org/abs/2511.08923

5 Nov, Diffusion Language Models are Super Data Learners, https://arxiv.org/abs/2511.03276

6 Oct, ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs, https://arxiv.org/abs/2510.04767

29 Sep, Learning to Parallel: Accelerating Diffusion Large Language Models via Adaptive Parallel Decoding, https://arxiv.org/abs/2509.25188

8 Sep, Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models, https://arxiv.org/abs/2509.06949

27 Aug, Diffusion Language Models Know the Answer Before Decoding, https://arxiv.org/abs/2508.19982

4 Aug, Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference, https://arxiv.org/abs/2508.02193

1 Aug, Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models, https://arxiv.org/abs/2508.00819

21 Jul, Deep Researcher with Test-Time Diffusion, https://arxiv.org/abs/2507.16075

Multimodal & Vision-Language Models

Multimodal LLMs are a natural extension of text-based LLMs. Besides supporting data formats other than text, one big hope in the research community is that it also unlocks more data that can be used during pre-training to make LLMs “smarter” and more knowledgeable in general. (I think this has not really panned out yet; yes, including images in pre-training is useful if you want your LLM to understand image inputs, but as far as I know, it has not really had a big impact on, say, the text-based problem-solving capabilities of LLMs.)

Anyways, this section brings together research at the intersection of text, image, and video. Also, if you are interested in multimodal LLMs, you may also find my introductory article helpful:

Understanding Multimodal LLMs SEBASTIAN RASCHKA, PHD · NOVEMBER 3, 2024 Read full story Figure 10: Annotated figure from Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

26 Nov, Qwen3-VL Technical Report, https://arxiv.org/abs/2511.21631

20 Nov, SAM 3: Segment Anything with Concepts, https://arxiv.org/abs/2511.16719

20 Nov, Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation, https://arxiv.org/abs/2511.16671

17 Nov, Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework, https://arxiv.org/abs/2511.13189

14 Aug, We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning, https://arxiv.org/abs/2508.10433

14 Aug 2025, NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale, https://arxiv.org/abs/2508.10711

11 Aug, Reinforcement Learning in Vision: A Survey, https://arxiv.org/abs/2508.08189

4 Aug, Qwen-Image Technical Report, https://arxiv.org/abs/2508.02324

21 Jul, GUI-G2: Gaussian Reward Modeling for GUI Grounding, https://arxiv.org/abs/2507.15846

10 Jul, Scaling RL to Long Videos, https://arxiv.org/abs/2507.07966

1 Jul, GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning, https://arxiv.org/abs/2507.01006

Data & Pre-training Datasets

Good models need good data. This final section collects papers focused on dataset creation, data quality, pre-training practices, and synthetic data generation.

As the saying goes, data work may not be the flashiest part of AI, but it’s some of the most essential.

Figure 11: Annotated figure from BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

26 Sep, In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reasoners, https://arxiv.org/abs/2509.22230

14 Aug, BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining, https://arxiv.org/abs/2508.10975

8 Aug, Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs, https://arxiv.org/abs/2508.06601

12 Jul, Scaling Laws for Optimal Data Mixtures, https://arxiv.org/abs/2507.09404

I hope you found this list useful for your personal reference. And I hope you perhaps found a handful of interesting reads to check out in more detail.

However, please don’t treat this list as a todo list to work through. It’s more meant as a reference to skim and hopefully helps you find some of the interesting works published this year that may be relevant to projects you are currently working on!

By the way, I was also working on my annual LLM review article, State of LLMs 2025: Progress, Problems, and Predictions, which I published today as well. You can find it here:

The State Of LLMs 2025: Progress, Progress, and Predictions SEBASTIAN RASCHKA, PHD · DECEMBER 30, 2025

As 2025 comes to a close, I want to look back at some of the year’s most important developments in large language models, reflect on the limitations and open problems that remain, and share a few thoughts on what might come next.

Read full story

Thanks so much for subscribing to my Ahead of AI blog and for supporting my work this year. I really appreciate it. Your support makes this work feasible in a very real sense and allows me to keep spending the time needed to write, experiment, and think deeply about these topics!