Data Closed-Loop Research: Synthetic Data Accounts for Over 50%, Full-process Automated Toolchain Gradually Implemented
Key Points:
From 2023 to 2025, the proportion of synthetic data increased from 20%-30% to 50%-60%, becoming a core resource to fill long-tail scenarios.
Full-process automated toolchain from collection to deployment is gradually implemented, helping reduce costs and improve efficiency.
Efficient collaboration of the vehicle-cloud integrated data closed-loop is a key factor in achieving faster iterations.
The essence of autonomous driving data closed-loop is a cyclic optimization system of "collection-transmission-processing-training-deployment". In 2025, the industry is accelerating from the "0→1" stage to the "high-quality and high-efficiency" era, with core contradictions focusing on long-tail scenario coverage and cost control. OEMs and Tier 1 suppliers are actively establishing their own data closed-loop solutions. Through efficient data collection, processing and analysis processes, they continuously improve autonomous driving algorithms, thereby significantly enhancing the accuracy and stability of intelligent driving systems.
I. From 2023 to 2025, the Proportion of Synthetic Data Increased from 20%-30% to Over 50%
The efficiency of acquiring high-quality data determines the evolution speed of intelligent driving. Currently, data sources in the automotive field include mass-produced vehicle-triggered data transmission, high-value specific scenario data collection by collection vehicles, engineering practices for physical world restoration through roadside real data, and data synthesis technology based on world models. The core path for the large-scale application of autonomous driving technology → real data anchors basic capabilities, and synthetic data breaks through capability boundaries. From 2023 to 2025, the proportion of real data and synthetic data in autonomous driving training data has undergone significant changes, gradually shifting from a real data-dominated model in the early stage to a hybrid model with an increasing proportion of synthetic data.
2023: Real data dominates, synthetic data starts (synthetic data accounts for 20%-30%): Real data is still the main body, mainly used for basic scenario training, but faces the problem of insufficient coverage of long-tail scenarios. For example, Tesla relied on real road test data from over one million vehicles in the early stage, but the collection efficiency of extreme scenarios (such as pedestrians breaking in during heavy rain) is low. Synthetic data accounts for about 20%-30%, mainly used to supplement long-tail scenarios. Experiments by Applied Intuition show that after adding 30% of synthetic data with frequent appearance of cyclists to real data, the recognition accuracy (mAP score) of the perception model for cyclists is significantly improved.
2024: Accelerated penetration of synthetic data (proportion rises to 40%-50%): Synthetic data has upgraded from an "auxiliary tool" to a "core production material". Its penetration rate rising to 40%-50% marks that intelligent driving has entered a new data-driven paradigm. At the end of 2024, the Shanghai High-level Autonomous Driving Demonstration Zone launched a plan of 100 data collection vehicles. Through a hybrid model of "real data collection + world model-generated virtual data", the proportion of synthetic data is close to 50%; for example, Nvidia DRIVE Sim generates synthetic data of distant objects (100-350 meters) to solve the problem of sparse real annotations. After adding 92,000 synthetic images, the detection accuracy (F1 score) of vehicles 200 meters away is improved by 33%.
2025: Synthetic data surpasses (accounts for over 50%): The ratio of synthetic data to real data moves towards "5:5" or even higher. Academician Wu Hequan pointed out that 90% of the training for L4/L5 is simulation data, and only 10%-20% of real data is retained as a "gene pool" to avoid model deviation. In terms of innovative applications of synthetic data, take Li Auto as an example. It uses world models to reconstruct historical scenarios and expand variants (such as virtualizing ordinary intersections into rainy night and foggy conditions), and automatically generates extreme cases for cyclic training. The proportion of synthetic data in Li Auto exceeds 90%, replacing real-vehicle testing and verifying reliability.
According to Lang Xianpeng from Li Auto, in 2023, the effective real-vehicle test mileage of Li Auto was about 1.57 million kilometers, with a cost of 18 yuan per kilometer. By the first half of 2025, a total of 40 million kilometers had been tested, including only 20,000 kilometers of real-vehicle testing and 38 million kilometers of synthetic data. The test cost dropped to an average of 0.5 yuan per kilometer. Moreover, the test quality is high, all scenarios can be inferred from one instance, and complete retesting is possible.
The advantages of synthetic data are not only reflected in cost and efficiency but also in its value density beyond human experience. Synthetic data is generated in batches through technical means at extremely low cost, perfectly matching the high-frequency training needs of AI; it can also independently generate extreme corner case scenarios that "humans have not experienced but comply with physical laws".
II. Full-process Automated Toolchain from Collection to Deployment is Gradually Implemented, Helping Reduce Costs and Improve Efficiency
The autonomous driving data closed-loop has shifted from focusing on a single link (such as improving annotation efficiency) in the early stage to an end-to-end automated architecture covering "collection-annotation-training-simulation-deployment". The core breakthrough is to break through data flow barriers through AI large models and cloud-edge collaboration technology, realizing closed-loop self-evolution.
LiangDao Intelligence LD Data Factory is a full-link 4D ground truth solution from collection to delivery. The LD Data Factory toolchain product has been delivered to more than a dozen automotive OEMs and Tier 1s in China, Germany, and Japan. This automated 4D annotation tool software has automatically annotated more than 3,300 hours of road-collected data for customers, obtaining high-quality 4D continuous frame ground truth; by the middle of 2025, LiangDao Intelligence had delivered more than 55 million frames of data to a well-known German luxury car brand.
LD Data Factory integrates "data collection, automated annotation, manual annotation, quality control, and performance evaluation". The toolchain includes AI preprocessing and VLM-assisted collection, an automated annotation module for target detection, full-process closed loop of automatic quality inspection, and hybrid cloud and private deployment. LD Data Factory covers several core modules and realizes data management and task collaboration through a unified data management platform: including time synchronization and spatial calibration, distributed storage and indexing services, a visual annotation platform LDEditor (full-stack annotation), an automated quality control module LD Validator, and a perception performance evaluation module LD KPI.
Main products under MindFlow currently include an integrated data annotation platform, a data management platform (including a vector database), and a model training platform, covering the entire value chain from raw data to model implementation. Users can complete the entire algorithm development process in one stop without switching multiple tools or platforms, redefining a new paradigm of AI data services. The technical highlights of its MindFlow SEED platform (third generation) include support for 4D point cloud annotation (lane lines, segmentation), RPA automated processes, and AI pre-annotation covering more than 4,000 functional modules.
Currently, MindFlow empowers customers including SAIC Group, Changan Automobile, Great Wall Motors, Geely Automobile, FAW Group, Li Auto, Huawei, Bosch, ECARX, MAXIEYE, NavInfo and RoboSense.
III. Efficient Collaboration of the Vehicle-Cloud Integrated Data Closed-Loop is a Key Factor in Achieving Faster Iterations
The essence of the vehicle-cloud integrated data closed-loop is to build a collaborative system of "vehicle-side lightweight + cloud-side intelligence", break through data flow barriers, and realize the continuous evolution of intelligent vehicles. The vehicle side is responsible for real-time collection of environmental perception data (such as road conditions, vehicle operation data), which is uploaded to the cloud after desensitization, encryption, and compression. The cloud processes massive amounts of data (PB/EB level), performs annotation, model training, and algorithm optimization, generates new capabilities, and issues them to the vehicle side to realize OTA upgrades.
The ExceedData data closed-loop solution is a vehicle-cloud integrated solution, which has gained the trust and mass production application of more than 15 automotive OEMs and is deployed in more than 30 mainstream models.
The composition of the ExceedData data closed-loop solution includes the vehicle-side edge computing engine (vCompute), edge data engine (vADS), edge database (vData), as well as the cloud-side algorithm development tool (vStudio), cloud computing engine (vAnalyze), and cloud management platform (vCloud). This solution can reduce data transmission costs by 75%, cloud storage costs by 90%, and cloud computing costs by 33%. According to the calculation of an OEM case cooperating with ExceedData: the total cost optimization can be reduced by 85%.
In terms of OEMs, take Xpeng Motors as an example. Its self-built "cloud-side model factory" has a computing power reserve of 10 EFLOPS in 2025, and the end-to-end iteration cycle is shortened to an average of 5 days, supporting rapid closed-loop from cloud-side pre-training to vehicle-side model deployment.
Xpeng launched China's first 72 billion parameter multimodal world base model for L4 high autonomous driving, which has chain-of-thought (CoT) reasoning capabilities and can simulate human common-sense reasoning and generate control signals. Through model distillation technology, the capabilities of the base model are migrated to the vehicle-side small model, realizing personalized deployment of "small size and high intelligence".
High-value data (such as corner cases) is initially screened through the vehicle-side rule engine. The cloud combines synthetic data generation technologies (such as GAN, diffusion models) to fill data gaps and improve model generalization capabilities. At the same time, end-to-end (E2E) and VLA models integrate multimodal inputs to directly output control commands, relying on cloud-side large model training (such as Xpeng's 72 billion parameter base model) to achieve lightweight deployment on the vehicle side.
With the comprehensive modeling of the entire intelligent driving system, car companies are pursuing "better cost, higher efficiency, and more stable services" in the data closed-loop. The delivery method of intelligent driving is accelerating from delivering code for single-vehicle deployment to a subscription-based cloud service as the core. The efficiently collaborative data closed-loop of vehicle-cloud integration is the key for intelligent vehicles to achieve faster iterations driven by AI.
     
    
        China Autonomous Driving Data Closed Loop Research Report, 2025
Data Closed-Loop Research: Synthetic Data Accounts for Over 50%, Full-process Automated Toolchain Gradually Implemented
Key Points:From 2023 to 2025, the proportion of synthetic data increased from 2...
Automotive Glass and Smart Glass Research Report, 2025
Automotive Glass Report: Dimmable Glass Offers Active Mode, Penetration Rate Expected to Reach 10% by 2030 
ResearchInChina releases the Automotive Glass and Smart Glass Research Report, 2025. This r...
Passenger Car Brake-by-Wire (BBW) Research Report, 2025
Brake-by-Wire: EHB to Be Installed in 12 Million Vehicles in 2025
1. EHB Have Been Installed in over 10 Million Vehicles, A Figure to Hit 12 Million in 2025.
In 2024, the brake-by-wire, Electro-Hydr...
Autonomous Driving Domain Controller and Central Computing Unit (CCU) Industry Report, 2025
Research on Autonomous Driving Domain Controllers: Monthly Penetration Rate Exceeded 30% for the First Time, and 700T+ Ultrahigh-compute Domain Controller Products Are Rapidly Installed in Vehicles
L...
China Automotive Lighting and Ambient Lighting System Research Report, 2025
Automotive Lighting System Research: In  2025H1, Autonomous Driving System (ADS) Marker Lamps Saw an 11-Fold Year-on-Year Growth and the Installation Rate of Automotive LED Lighting Approached 90...
Ecological Domain and Automotive Hardware Expansion Research Report, 2025
ResearchInChina has released the Ecological Domain and Automotive Hardware Expansion Research Report, 2025, which delves into the application of various automotive extended hardware, supplier ecologic...
Automotive Seating Innovation Technology Trend Research Report, 2025
Automotive Seating Research: With Popularization of Comfort Functions, How to Properly "Stack Functions" for Seating? 
This report studies the status quo of seating technologies and functions in aspe...
Research Report on Chinese Suppliers’ Overseas Layout of Intelligent Driving, 2025 
Research on Overseas Layout of Intelligent Driving: There Are Multiple Challenges in Overseas Layout, and Light-Asset Cooperation with Foreign Suppliers Emerges as the Optimal Solution at Present 
20...
High-Voltage Power Supply in New Energy Vehicle (BMS, BDU, Relay, Integrated Battery Box) Research Report, 2025
The high-voltage power supply system is a core component of new energy vehicles. The battery pack serves as the central energy source, with the capacity of power battery affecting the vehicle's range,...
Automotive Radio Frequency System-on-Chip (RF SoC) and Module Research Report, 2025
Automotive RF SoC Research: The Pace of Introducing "Nerve Endings" such as UWB, NTN Satellite Communication, NearLink, and WIFI into Intelligent Vehicles Quickens  
RF SoC (Radio Frequency Syst...
Automotive Power Management ICs and Signal Chain Chips Industry Research Report, 2025
Analog chips are used to process continuous analog signals from the natural world, such as light, sound, electricity/magnetism, position/speed/acceleration, and temperature. They are mainly composed o...
Global and China Electronic Rearview Mirror Industry Report, 2025
Based on the installation location, electronic rearview mirrors can be divided into electronic interior rearview mirrors (i.e., streaming media rearview mirrors) and electronic exterior rearview mirro...
Intelligent Cockpit Tier 1 Supplier Research Report, 2025 (Chinese Companies)
Intelligent Cockpit Tier1 Suppliers Research: Emerging AI Cockpit Products Fuel Layout of Full-Scenario Cockpit Ecosystem
This report mainly analyzes the current layout, innovative products, and deve...
Next-generation Central and Zonal Communication Network Topology and Chip Industry Research Report, 2025
The automotive E/E architecture is evolving towards a "central computing + zonal control" architecture, where the central computing platform is responsible for high-computing-power tasks, and zonal co...
Vehicle-road-cloud Integration and C-V2X Industry Research Report, 2025
Vehicle-side C-V2X Application Scenarios: Transition from R16 to R17, Providing a Communication Base for High-level Autonomous Driving, with the C-V2X On-board Explosion Period Approaching
In 2024, t...
Intelligent Cockpit Patent Analysis Report, 2025
Patent Trend: Three Major Directions of Intelligent Cockpits in 2025
This report explores the development trends of cutting-edge intelligent cockpits from the perspective of patents. The research sco...
Smart Car Information Security (Cybersecurity and Data Security) Research Report, 2025
Research on Automotive Information Security: AI Fusion Intelligent Protection and Ecological Collaboration Ensure Cybersecurity and Data Security
At present, what are the security risks faced by inte...
New Energy Vehicle 800-1000V High-Voltage Architecture and Supply Chain Research Report, 2025
Research on 800-1000V Architecture: to be installed in over 7 million vehicles in 2030, marking the arrival of the era of full-domain high voltage and megawatt supercharging.
In 2025, the 800-1000V h...