Safety & Ethics

Latest Benchmarks & Evaluation Research Papers

The newest Benchmarks & Evaluation papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Benchmarks & Evaluation so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Benchmarks & Evaluation papers in your inbox — free →

Recent papers

Enterprise LLM Router: Learning Quality–Capacity–Capability Trade-offs from 2026 Model Metadata
Lily Peng · Journal of Computer Science... · Jul 19, 2026
Enterprise language-model selection is a constrained decision problem, not a single leaderboard lookup. This study integrated three 2026 metadata tables covering 22 models from eight providers and evaluated benchmark quality, capability req…
HELMify: A Hybrid Rule- and LLM-Based Generator of Peptide Monomer HELM Names
Robert P. Sheridan, Rajvi Shah, Kathryn Mcgarty, Michael Garrigou et al. · Journal of Chemical Informa... · Jul 16, 2026
HELM is a hierarchical notation system for biopolymers that is an increasingly popular choice for representing peptides. In this system, each monomer name must uniquely identify a single monomer, and until now, chemists have named peptide m…
Evaluating large language model compression: a comparative analysis on state-of-the-art models across diverse hardware platforms
Dominik Hildebrand, Benjamin Kiefer, Andreas Zell · Artificial Intelligence Review · Jul 12, 2026
Abstract This work presents a systematic, empirical comparison of contemporary compression techniques for large language models (LLMs), namely quantization, pruning, and parameter-efficient fine-tuning (PEFT) using a representative set of o…
TriA Pipeline: A Large-Scale Automatic Audio Annotation Pipeline For Audio Classification In Specific Scenarios
Hong Lyu, Mingru Yang, Qianhua He, Yanxiong Li et al. · arXiv · Jul 7, 2026
There are some datasets of varying scales for audio classification (AC) applied to different tasks. However, annotated data is limited for most scenarios, such as domestic environments. To address this challenge, we propose an $\textbf{A}$u…
Les benchmarks sont une source de biais des LLM : MMLU, CommonSenseQA et MGSM au microscope
Fanny Ducel, Lucie Digoin-Caprros, Ibrahim Al Kotob, Shayan Ahmed Shariff et al. · HAL (Le Centre pour la Comm... · Jun 29, 2026
National audience...
Hedgementation = Hedgerow Segmentation: A Remote Sensing Benchmark
Nathan Senyard, Salem Hamdani, Astrid Zhang, Derek Wang et al. · arXiv · Jun 22, 2026
We propose Hedgementation: a new benchmark to evaluate machine learning models for hedgerow mapping from remote sensing data at country scale and 10m$^2$ spatial resolution. We combine and harmonize multiple remote sensing data products and…
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
Tianyi Li, Zhiqiang Shen · arXiv · Jun 22, 2026
Linear mode connectivity (LMC) provides a promising foundation for understanding and merging independently trained neural networks, but existing methods typically optimize the interpolation path from only one model endpoint, limiting their …
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu et al. · arXiv · Jun 10, 2026
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and pred…
OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib
Abhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning
Shuo Wang, Xiangyu Wang, Quanxin Wang, Bailin Wu et al. · arXiv · Jun 4, 2026
Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces system…
Resolution Diagnostics for Paired LLM Evaluation
Anany Kotawala · arXiv · May 28, 2026
Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-P…

Track Benchmarks & Evaluation on Distill AI — start free →

Latest Benchmarks & Evaluation Research Papers

Recent papers

Related topics