Learning Paradigms

Latest Synthetic Data Research Papers

The newest Synthetic Data papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Synthetic Data so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.

Get the latest Synthetic Data papers in your inbox — free →

Recent papers

Synthetic data generation framework for quality control automation in gravure printing
Korota Arsène Coulibaly, Mohamed Hamlich, Khalid Hmali, Andrea Trombin · arXiv · Jul 23, 2026
Quality control in printing, particularly in rotogravure printing, still depends on slow, costly, and subjective manual inspection. Automated surface defect detection is critical for maintaining high-quality standards in rotogravure printin…
Requential Coding: Pushing the Limits of Model Compression with Self-Generated Training Data
Shikai Qiu, Marc Finzi, Yujia Zheng, Kun Zhang et al. · arXiv · Jul 13, 2026
Compression is fundamental to intelligence. A model that can represent its training data as a short code has discovered regularities that enable generalization. Large neural networks may learn functions far simpler than their parameter coun…
Entropy-Constrained Machine Learning with Residual Data Augmentation for Modeling Chemical Kinetics
Okezzi Ukorigho, Opeoluwa Owoyele · arXiv · Jul 10, 2026
We present a physics-constrained machine learning framework for accelerating the direct numerical simulation (DNS) of turbulent reacting flows. The model replaces the direct evaluation of detailed chemical source terms with a surrogate that…
Collaborative Synthetic Data Generation for Knowledge Transfer in Federated Learning
Maximilian Andreas Hoefler, Karsten Mueller, Wojciech Samek · arXiv · Jul 8, 2026
One-shot federated learning (OSFL) addresses the communication overhead of federated learning by limiting training to a single round, but doing so without sacrificing model quality is non-trivial, particularly when client data distributions…
Assessing the Operational Impact of Poisoning Attacks over Augmented 3D Point Cloud Public Datasets for Connected and Autonomous Vehicles
Marwan Lazrag, Badis Hammi, Lorena Gonzalez-Manzano, Joaquin Garcia-Alfaro · arXiv · Jul 7, 2026
Poisoning attacks against public datasets lead to major concerns, such as (i) misclassification of perceived objects when the poisoned data is used for training and (ii) embedding of backdoors that may eventually be triggered later on, when…
TriA Pipeline: A Large-Scale Automatic Audio Annotation Pipeline For Audio Classification In Specific Scenarios
Hong Lyu, Mingru Yang, Qianhua He, Yanxiong Li et al. · arXiv · Jul 7, 2026
There are some datasets of varying scales for audio classification (AC) applied to different tasks. However, annotated data is limited for most scenarios, such as domestic environments. To address this challenge, we propose an $\textbf{A}$u…
Autodata: An agentic data scientist to create high quality synthetic data
Ilia Kulikov, Chenxi Whitehouse, Tianhao Wu, Yixin Nie et al. · arXiv · Jun 24, 2026
We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even…
Hedgementation = Hedgerow Segmentation: A Remote Sensing Benchmark
Nathan Senyard, Salem Hamdani, Astrid Zhang, Derek Wang et al. · arXiv · Jun 22, 2026
We propose Hedgementation: a new benchmark to evaluate machine learning models for hedgerow mapping from remote sensing data at country scale and 10m$^2$ spatial resolution. We combine and harmonize multiple remote sensing data products and…
Scaling Linear Mode Connectivity and Merging to Billion Parameter Pretrained Transformers
Tianyi Li, Zhiqiang Shen · arXiv · Jun 22, 2026
Linear mode connectivity (LMC) provides a promising foundation for understanding and merging independently trained neural networks, but existing methods typically optimize the interpolation path from only one model endpoint, limiting their …
Valid Inference with Synthetic Data via Task Exchangeability
Lezhi Tan, Tijana Zrnic · arXiv · Jun 11, 2026
There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "…
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu et al. · arXiv · Jun 10, 2026
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and pred…
OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib
Abhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
PluRel: Synthetic Data unlocks Scaling Laws for Relational Foundation Models
Vignesh Kothapalli, Rishabh Ranjan, Valter Hudovernik, Vijay Prakash Dwivedi et al. · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
Relational Foundation Models (RFMs) facilitate data-driven decision-making by learning from complex multi-table databases. However, the diverse relational databases needed to train such models are rarely public due to privacy constraints. W…
ESDAE: Evaluating Synthetic Data for Agent Evaluation
Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
Agent evaluation is often performed on static datasets of execution trajectories, but real traces may be sensitive, proprietary, or too small to support comprehensive testing. Practitioners may therefore replace or augment real datasets wit…
Learning from Synthetic Data Improves Multi-hop Reasoning
Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė et al. · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data,…
Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations
Firas Darwish, George Nicholson, Aiden Doherty, Hang Yuan · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
Synthetic data offers a compelling path to scalable pretraining when real-world data is scarce, but models pretrained on synthetic data often fail to transfer reliably to deployment settings. We study this problem in full-body human motion,…
SynQuE: Estimating Synthetic Dataset Quality Without Annotations
Arthur Chen, Victor Zhong · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
We introduce and formalize the Synthetic Dataset Quality Estimation (SYNQUE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open chal…
Less is More: Adaptive Coverage Sampling for Synthetic Training Data
Sasan Tavakkol, Max Springer, Mohammadhossein Bateni, Vincent Cohen-Addad et al. · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
Large Language Models (LLMs) enable rapid generation of synthetic training data for downstream classifiers, offering a solution when human-labeled data is costly, scarce, or time-sensitive. However, synthetic datasets suffer from systematic…
EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors
Amin Banayeeanzade, Qingchuan Yang, Deqing Fu, Spencer Hong et al. · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have …
Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence
Bingji Yi, Qiyuan Liu, Yuwei Cheng, Haifeng Xu · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
Synthetic data has been increasingly used to train frontier generative models. However, recent study raises key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model perfor…
ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
Krishna Teja Chitty-Venkata, Murali Emani · ICLR 2026 Workshop DATA-FM · Mar 2, 2026
We develop ImageNet-Think-250K, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet-21k dataset, providin…
InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy
Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai et al. · arXiv.org · Nov 20, 2025
Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models'generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not p…
Enhancing pine wilt disease detection with synthetic data and external attention-based transformers
Sareer Ul Amin, Yonghoon Jung, M. Fayaz, Bumsoo Kim et al. · Engineering applications of artificial intelligence · Nov 1, 2025
Building a Foundational Guardrail for General Agentic Systems via Synthetic Data
Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing et al. · arXiv.org · Oct 10, 2025
While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing gua…
Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
Feiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad et al. · Conference on Empirical Methods in Natural Language Processing · Oct 2, 2025
Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirica…
WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye et al. · arXiv.org · Sep 16, 2025
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as Bro…
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza et al. · arXiv.org · Aug 14, 2025
Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a pro…
A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment
Jean-Philippe Corbeil, Amin Dada, Jean-Michel Attendu, Asma Ben Abacha et al. · Annual Meeting of the Association for Computational Linguistics · May 15, 2025
High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical do…
The DCR Delusion: Measuring the Privacy Risk of Synthetic Data
Zexi Yao, Natavsa Krvco, Georgi Ganev, Y. Montjoye · European Symposium on Research in Computer Security · May 2, 2025
Synthetic data has become an increasingly popular way to share data without revealing sensitive information. Though Membership Inference Attacks (MIAs) are widely considered the gold standard for empirically assessing the privacy of a synth…
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
Anna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai et al. · arXiv.org · Apr 7, 2025
Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks,…

Track Synthetic Data on Distill AI — start free →

Latest Synthetic Data Research Papers

Recent papers

Related topics