Latest Synthetic Data Research Papers
The newest Synthetic Data papers from across the field — arXiv, NeurIPS, CVPR, Nature, and more — refreshed daily and ranked by relevance. Distill AI tracks Synthetic Data so you don’t have to: get the standout work delivered to your inbox every morning, with 2-sentence summaries and the option to chat with any paper.
Get the latest Synthetic Data papers in your inbox — free →Recent papers
- OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinibAbhijoy Sarkar, Aarchi Singh Thakur · arXiv · Jun 9, 2026
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computatio…
- PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data AugmentationSrikar Kashyap Pulipaka · arXiv · May 6, 2026
We present our system for SemEval-2026 Task 9: Multilingual Polarization Detection, a binary classification task spanning 22 languages. Our approach fine-tunes separate Gemma~3 models (12B and 27B parameters) per language using Low-Rank Ada…
- Synthetic Computers at Scale for Long-Horizon Productivity SimulationTao Ge, Baolin Peng, Hao Cheng, Jianfeng Gao · arXiv · Apr 30, 2026
Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data…
- Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language ModelsGongbo Zhang, Wen Wang, Ye Tian, Li Yuan · arXiv · Apr 29, 2026
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference…
- Synthetic Data Meets Finance: Generative Models for Privacy Preserving AnalyticsYongbin Yang, Jingyun Yang · Journal of Banking and Fina... · Apr 21, 2026
The financial industry faces increasing pressure from privacy regulations, including the General Data Protection Regulation (GDPR) and sector-specific compliance frameworks, which restrict access to sensitive transaction data critical for t…
- Application of Machine Learning for Effective Screening of Enhanced Oil Recovery MethodsJawad Ali, Ubedullah Ansari, Fateh Ali, Tariq Javed et al. · Reservoir Science · Feb 27, 2026
Selecting the most suitable enhanced oil recovery (EOR) technique remains challenging due to severe class imbalance in historical datasets and the limitations of traditional screening criteria. To address data imbalance while preserving dom…
- Building a Foundational Guardrail for General Agentic Systems via Synthetic DataYue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing et al. · arXiv.org · Oct 10, 2025
While LLM agents can plan multi-step tasks, intervening at the planning stage-before any action is executed-is often the safest way to prevent harm, since certain risks can lead to severe consequences once carried out. However, existing gua…
- Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and PitfallsFeiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad et al. · Conference on Empirical Methods in Natural Language Processing · Oct 2, 2025
Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirica…
- WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement LearningKuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye et al. · arXiv.org · Sep 16, 2025
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as Bro…
- BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale PretrainingPratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza et al. · arXiv.org · Aug 14, 2025
Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a pro…
- A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks AlignmentJean-Philippe Corbeil, Amin Dada, Jean-Michel Attendu, Asma Ben Abacha et al. · Annual Meeting of the Association for Computational Linguistics · May 15, 2025
High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical do…
- The DCR Delusion: Measuring the Privacy Risk of Synthetic DataZexi Yao, Natavsa Krvco, Georgi Ganev, Y. Montjoye · European Symposium on Research in Computer Security · May 2, 2025
Synthetic data has become an increasingly popular way to share data without revealing sensitive information. Though Membership Inference Attacks (MIAs) are widely considered the gold standard for empirically assessing the privacy of a synth…
- Synthetic Data Generation & Multi-Step RL for Reasoning & Tool UseAnna Goldie, Azalia Mirhoseini, Hao Zhou, Irene Cai et al. · arXiv.org · Apr 7, 2025
Reinforcement learning has been shown to improve the performance of large language models. However, traditional approaches like RLHF or RLAIF treat the problem as single-step. As focus shifts toward more complex reasoning and agentic tasks,…
- Scaling Laws of Synthetic Data for Language ModelsZeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong et al. · arXiv.org · Mar 25, 2025
Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a …
- Synthetic data generation: a privacy-preserving approach to accelerate rare disease researchJorge M. Mendes, Aziz Barbar, Marwa Refaie · Frontiers Digit. Health · Mar 18, 2025
Rare disease research faces significant challenges due to limited patient data, strict privacy regulations, and the need for diverse datasets to develop accurate AI-driven diagnostics and treatments. Synthetic data—artificially generated da…
- Synthetic Data Generation Using Large Language Models: Advances in Text and CodeMihai Nadǎş, Laura Dioşan, Andreea Tomescu · IEEE Access · Mar 18, 2025
This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment o…
- Synthetic Data is an Elegant GIFT for Continual Vision-Language ModelsBin Wu, Wuxuan Shi, Jinqiao Wang, Mang Ye · Computer Vision and Pattern Recognition · Mar 6, 2025
Pre-trained Vision-Language Models (VLMs) require Continual Learning (CL) to efficiently update their knowledge and adapt to various downstream tasks without retraining from scratch. However, for VLMs, in addition to the loss of knowledge p…
- Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10Ranjan Sapkota, Manoj Karkee · IFAC-PapersOnLine · Feb 26, 2025
This study evaluated the performance of the YOLOv12 object detection model, and compared against the performances YOLOv11 and YOLOv10 for apple detection in commercial orchards based on the model training completed entirely on synthetic ima…
- mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic DataHaonan Chen, Liang Wang, Nan Yang, Yutao Zhu et al. · Annual Meeting of the Association for Computational Linguistics · Feb 12, 2025
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders …
- A scoping review of privacy and utility metrics in medical synthetic dataB. Kaabachi, J. Despraz, T. Meurers, K. Otte et al. · npj Digital Medicine · Jan 27, 2025
The use of synthetic data is a promising solution to facilitate the sharing and reuse of health-related data beyond its initial collection while addressing privacy concerns. However, there is still no consensus on a standardized approach fo…
- User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System EvaluationK. Balog, ChengXiang Zhai · arXiv.org · Jan 8, 2025
User simulation is an emerging interdisciplinary topic with multiple critical applications in the era of Generative AI. It involves creating an intelligent agent that mimics the actions of a human user interacting with an AI system, enablin…