Latest Test-Time Compute Research Papers
The newest Test-Time Compute papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Test-Time Compute so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Test-Time Compute papers in your inbox — free →Recent papers
- OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinibAbhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
- Safety and accuracy follow different scaling laws in clinical large language modelsSebastian Wind, Tri-Thien Nguyen, Jeta Sopa, Mahshad Lotfinia et al. · arXiv · May 5, 2026
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, …
- Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria ScoringIndraneil Paul, Glavaš Glavas, Iryna Gurevych · arXiv · May 1, 2026
Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparativ…
- Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian, Li Yuan · arXiv · Apr 29, 2026
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference…
- Test-Time Compute GamesAnder Artola Velasco, Dimitrios Rontogiannis, Stratis Tsirtsis, Manuel Gomez Rodriguez · ICLR 2026 Workshop AIMS · Mar 2, 2026
Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since…
- Mode-conditioning unlocks superior test-time compute scalingChen Henry Wu, Sachin Goyal, Aditi Raghunathan · ICLR 2026 Poster · Jan 26, 2026
Parallel sampling is essential to test-time scaling and reinforcement learning (RL), but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We …
- Provable Scaling Laws for the Test-Time Compute of Large Language ModelsYanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding et al. · NeurIPS 2025 poster · Sep 18, 2025
We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first g…
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple VerifiersShalev Lifshitz, Sheila A. McIlraith, Yilun Du · COLM 2025 · Jul 8, 2025
By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses *verifiers* to evaluate candidate outputs. In this work, we propose a novel scaling dimen…
- Rank1: Test-Time Compute for Reranking in Information RetrievalOrion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates et al. · COLM 2025 · Jul 8, 2025
We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for distillation…
- Optimizing Test-Time Compute via Meta Reinforcement FinetuningYuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall et al. · ICML 2025 poster · May 1, 2025
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these appr…
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (Abridged)Shalev Lifshitz, Sheila A. McIlraith, Yilun Du · SSI-FM Poster · Mar 8, 2025
By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses *verifiers* to evaluate candidate outputs. In this work, we propose a novel scaling dimen…
- Optimizing Test-Time Compute via Meta Reinforcement FinetuningYuxiao Qu, Matthew Y. R. Yang, Lewis Tunstall, Edward Emanuel Beeching et al. · SSI-FM Poster · Mar 8, 2025
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these appr…
- Optimizing Test-Time Compute via Meta Reinforcement FinetuningYuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall et al. · ICLR 2025 FM-Wild Workshop · Mar 6, 2025
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these appr…
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (Abridged)Shalev Lifshitz, Sheila A. McIlraith, Yilun Du · ICLR 2025 Workshop VerifAI Poster · Mar 6, 2025
By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses *verifiers* to evaluate candidate outputs. In this work, we propose a novel scaling dimen…
- Scaling Test-Time Compute Without Verification or RL is SuboptimalAmrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar · ICLR 2025 Workshop VerifAI Oral · Mar 6, 2025
Despite substantial improvements in LLM capabilities by scaling test-time compute, an ongoing debate in the community is how it should be scaled up so as to enable continued and efficient improvements with scaling. There are largely two app…
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (Abridged)Shalev Lifshitz, Sheila A. McIlraith, Yilun Du · MCDC @ ICLR 2025 · Mar 6, 2025
By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses *verifiers* to evaluate candidate outputs. In this work, we propose a novel scaling dimen…
- Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (Abridged)Shalev Lifshitz, Sheila A. McIlraith, Yilun Du · Reasoning and Planning for LLMs @ ICLR2025 · Mar 5, 2025
By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses *verifiers* to evaluate candidate outputs. In this work, we propose a novel scaling dimen…
- Optimizing Test-Time Compute via Meta Reinforcement FinetuningYuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall et al. · Reasoning and Planning for LLMs @ ICLR2025 · Mar 5, 2025
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these appr…
- Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for ReasoningCharlie Victor Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar · ICLR 2025 Oral · Jan 22, 2025
Enabling LLMs to improve their outputs by using more test-time compute is a critical step towards building self-improving agents that can operate on open-ended natural language. In this paper, we scale up inference-time computation in LLMs,…