Agents & Foundation

Latest Test-Time Compute Research Papers

The newest Test-Time Compute papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Test-Time Compute so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Test-Time Compute papers in your inbox — free →

Recent papers

Test-Time Scaling for Small VLMs on Multilingual Visual MCQ
Spiros Baxevanakis, Peng-Jian Yang · arXiv · Jul 10, 2026
Test-time scaling (TTS) reliably improves reasoning in large language models, but whether it transfers to small open vision-language models remains unclear. We examine this on EXAMS-V, a multilingual visual multiple-choice benchmark, compar…
DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
RSS26-W: FM4RoboPlan Oral · Jul 8, 2026
Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usa…
TriA Pipeline: A Large-Scale Automatic Audio Annotation Pipeline For Audio Classification In Specific Scenarios
Hong Lyu, Mingru Yang, Qianhua He, Yanxiong Li et al. · arXiv · Jul 7, 2026
There are some datasets of varying scales for audio classification (AC) applied to different tasks. However, annotated data is limited for most scenarios, such as domestic environments. To address this challenge, we propose an $\textbf{A}$u…
DecompRL: Solving Harder Problems by Learning Modular Code Generation
Juliette Decugis, Fabian Gloeckle, Francis Bach, Taco Cohen et al. · arXiv · Jul 2, 2026
How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attem…
QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling
Michael Y. Li, Anthony Zhan, Kanishk Gandhi, Noah D. Goodman et al. · arXiv · Jul 1, 2026
Scaling inference compute, by generating many parallel attempts per problem, is a costly but reliable lever for improving language model capabilities. By default these attempts are generated independently, wasting inference compute on redun…
DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar et al. · RSS SemRob 2026 Poster · Jun 30, 2026
Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usa…
VGB for Masked Diffusion Model: Efficient Test-time Scaling for Reward Satisfaction and Sample Editing
Kijung Jeon, Thuy-Duong Vuong, Molei Tao · arXiv · Jun 26, 2026
Inference-time scaling is a promising paradigm to improve generative models, especially when outputs must satisfy structural constraints or optimize downstream rewards. We consider Masked Diffusion Model (MDM) and introduce MDM-VGB, a discr…
Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents
Changdae Oh, Wendi Li, Seongheon Park, Samuel Yeh et al. · arXiv · Jun 24, 2026
Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both…
Hedgementation = Hedgerow Segmentation: A Remote Sensing Benchmark
Nathan Senyard, Salem Hamdani, Astrid Zhang, Derek Wang et al. · arXiv · Jun 22, 2026
We propose Hedgementation: a new benchmark to evaluate machine learning models for hedgerow mapping from remote sensing data at country scale and 10m$^2$ spatial resolution. We combine and harmonize multiple remote sensing data products and…
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
Tianyi Li, Zhiqiang Shen · arXiv · Jun 22, 2026
Linear mode connectivity (LMC) provides a promising foundation for understanding and merging independently trained neural networks, but existing methods typically optimize the interpolation path from only one model endpoint, limiting their …
Allocation, Not Volume: Test-Time Compute for Agentic Forecasting
Atin Aboutorabi, Gaetan de Rassenfosse, Nicolas Flammarion, Maksym Andriushchenko · Forecast@ICML26 Oral · Jun 11, 2026
Test-time compute scaling has been studied extensively in verifiable domains such as math and code; how to spend an inference budget for forecasting future events, where no test-time verifier exists, is far less studied. We compare three mu…
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu et al. · arXiv · Jun 10, 2026
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and pred…
OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib
Abhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
Test-Time Compute Games
Ander Artola Velasco, Dimitrios Rontogiannis, Stratis Tsirtsis, Manuel Gomez Rodriguez · ICLR 2026 Workshop AIMS · Mar 2, 2026
Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since…
Mode-conditioning unlocks superior test-time compute scaling
Chen Henry Wu, Sachin Goyal, Aditi Raghunathan · ICLR 2026 Poster · Jan 26, 2026
Parallel sampling is essential to test-time scaling and reinforcement learning (RL), but its effectiveness is sharply limited by diversity collapse, where models concentrate on a few modes and repeated samples produce the same mistakes. We …
Provable Scaling Laws for the Test-Time Compute of Large Language Models
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding et al. · NeurIPS 2025 poster · Sep 18, 2025
We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first g…
ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu et al. · arXiv.org · Aug 30, 2025
Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant b…
Inverse Scaling in Test-Time Compute
Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi et al. · Trans. Mach. Learn. Res. · Jul 19, 2025
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four …
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
Shalev Lifshitz, Sheila A. McIlraith, Yilun Du · COLM 2025 · Jul 8, 2025
By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses *verifiers* to evaluate candidate outputs. In this work, we propose a novel scaling dimen…
Rank1: Test-Time Compute for Reranking in Information Retrieval
Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates et al. · COLM 2025 · Jul 8, 2025
We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for distillation…
Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs
Mohammad Ali Alomrani, Yingxue Zhang, Derek Li, Qianyi Sun et al. · arXiv.org · Jul 2, 2025
Large language models (LLMs) have rapidly progressed into general-purpose agents capable of solving a broad spectrum of tasks. However, current models remain inefficient at reasoning: they apply fixed inference-time compute regardless of ta…
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax Aili Chen, Aonian Li, Bangwei Gong, Binyan Jiang et al. · arXiv.org · Jun 16, 2025
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is develo…
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Rajagopal Setlur, Matthew Y. R. Yang, C. Snell, Jeremy Greer et al. · arXiv.org · Jun 10, 2025
Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep…
ScaleRTL: Scaling LLMs with Reasoning Data and Test-Time Compute for Accurate RTL Code Generation
Chenhui Deng, Yun-Da Tsai, Guan-Ting Liu, Zhongzhi Yu et al. · Workshop on Machine Learning for CAD · Jun 5, 2025
Recent advances in large language models (LLMs) have enabled near-human performance on software coding benchmarks, but their effectiveness in RTL code generation remains limited due to the scarcity of high-quality training data. While prior…
Optimizing Test-Time Compute via Meta Reinforcement Finetuning
Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall et al. · ICML 2025 poster · May 1, 2025
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these appr…
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou et al. · AAAI Conference on Artificial Intelligence · Apr 1, 2025
Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited proces…
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning
Yuxiao Qu, Matthew Y. R. Yang, Amrith Rajagopal Setlur, Lewis Tunstall et al. · International Conference on Machine Learning · Mar 10, 2025
Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches ef…
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (Abridged)
Shalev Lifshitz, Sheila A. McIlraith, Yilun Du · SSI-FM Poster · Mar 8, 2025
By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses *verifiers* to evaluate candidate outputs. In this work, we propose a novel scaling dimen…
Optimizing Test-Time Compute via Meta Reinforcement Finetuning
Yuxiao Qu, Matthew Y. R. Yang, Lewis Tunstall, Edward Emanuel Beeching et al. · SSI-FM Poster · Mar 8, 2025
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these appr…
Optimizing Test-Time Compute via Meta Reinforcement Finetuning
Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall et al. · ICLR 2025 FM-Wild Workshop · Mar 6, 2025
Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these appr…

Track Test-Time Compute on Distill AI — start free →

Latest Test-Time Compute Research Papers

Recent papers

Related topics