Latest Benchmarks & Evaluation Research Papers
The newest Benchmarks & Evaluation papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Benchmarks & Evaluation so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Benchmarks & Evaluation papers in your inbox — free →Recent papers
- OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinibAbhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
- The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational LearningShuo Wang, Xiangyu Wang, Quanxin Wang, Bailin Wu et al. · arXiv · Jun 4, 2026
Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces system…
- Resolution Diagnostics for Paired LLM EvaluationAnany Kotawala · arXiv · May 28, 2026
Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-P…
- Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised MLJai Moondra, Ayela Chughtai, Bhargavi Lanka, Swati Gupta · arXiv · May 7, 2026
Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit glo…
- Raising the Ceiling: Better Empirical Fixation Densities for Saliency BenchmarkingSusmit Agrawal, Jannis Hollman, Matthias Kümmerer · arXiv · May 5, 2026
Empirical fixation densities, spatial distributions estimated from human eye-tracking data, are foundational to saliency benchmarking. They directly shape benchmark conclusions, leaderboard rankings, failure case analyses, and scientific cl…
- Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian, Li Yuan · arXiv · Apr 29, 2026
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference…