Research on Intelligent Driving Large Models: A Critical Period for Technological Competition and Paradigm Integration
As autonomous driving technology rapidly iterates from L2 to L3?L4, intelligent driving systems are shifting profoundly from traditional rule-driven architectures to the new generation of data-driven + cognition-driven architectures. As the underlying core enabler, intelligent driving large models have become the core track in industry competition. As the accelerated arrival of the Physical AI era, autonomous driving stands as its first large-scale application scenario, promoting automobiles to evolve rapidly into super agents that transcend the nature of traditional transportation tools and become all-scenario intelligent hubs connecting mobility, mobile office, home life, and third-party ecosystems.
From an industrial perspective, Physical AI remains in the early stage of technological fission, and the global autonomous driving market holds massive untapped potential. According to the data, there is a global ownership of about 1.5 billion passenger cars, 280 million commercial vehicles and trucks, and 18 million operating taxis. The total annual global driving mileage reaches 13 trillion kilometers, while the autonomous driving mileage is only 700 million kilometers, accounting for only about 0.006%. The future incremental potential is significant.
Judging further from the pace of technological implementation, intelligent driving large models are ushering in a critical technological iteration window period. The segmented end-to-end solution has come into mass production during 2024-2025, and the one-model end-to-end and VLA technologies are intensively implemented during 2025-2026. Coupled with the continuous upgrading of intelligent driving experience and the accelerated maturation of L3-L4 high-level autonomous driving technology, physical AI is accelerating. ResearchInChina predicts three major evolution trends of intelligent driving large models.
Trend 1: The Core Focus of Autonomous Driving Large Model Evolution in 2026 Will Be Competition and Deep Integration of Multiple Technical Routes.
Bosch, Momenta Integration Mode 1: One-model End-to-End + World Model + Reinforcement Learning, Representative Suppliers: WeRide, Bosch and Momenta
Features: The one-model end-to-end model serves as the core neural network of intelligent driving, directly connecting sensor input and driving output with zero information loss and extremely high performance ceiling; the world model is responsible for future deduction of road conditions and can generate massive long-tail scenarios at low cost for simulation training; reinforcement learning iterates and optimizes in the deduction space relying on the reward mechanism, outputs the optimal driving strategy, and copes with various sudden working conditions. The combination of the three forms a powerful closed loop of "data generation (world model) → policy training (reinforcement learning) → decision and execution (end-to-end model)". This enables intelligent driving systems to learn from massive driving data and keep evolving.
Integration Mode 2: E2E + Foundation Model (VLM/VLA) + Reinforcement Learning + World Model, Representative Suppliers: Horizon Robotics and Afari Technology
Features: The vision-language large model acts as the "cerebrum" responsible for cognitive reasoning, and the small end-to-end model acts as the "cerebellum" responsible for rapid execution.
Horizon Robotics adopts the one-model E2E + VLM + reinforcement learning + world model. Horizon Robotics’ “fast thinking + slow thinking” dual-track intelligent driving architecture takes reinforcement learning as the hub. On the one hand, it empowers the end-to-end intuition model through the world model and simulation training, enabling it to respond in milliseconds while complementing the ability to handle rare short-time-sequence long-tail scenarios. On the other hand, it empowers the VLM cognitive model through reasoning enhancement, strengthening its semantic understanding and logical reasoning capabilities for long-time-sequence complex scenarios. It finally realizes the migration of VLM capabilities to the vehicle model, and completes lightweight deployment by quantization and distillation, building a balanced closed loop of "millisecond-level fast response + long-time-sequence slow reasoning".
Afari Technology adopts the VLA + E2E + world model architecture, in which the VLA model is responsible for reasoning similar to the high-level decision by the slow system, and the E2E end-to-end algorithm is responsible for mapping actions similar to the fast system. The 32B-parameter large model is used for large-scale multimodal pre-training (VLM) → distilled into a 7B lightweight model, balances performance and deployment (VLM) → aligning perception and driving actions, introduces driving domain knowledge (VLA) → supervised fine-tuning, and learns high-level driving strategies and behavioral norms → reinforcement learning aligning human driving styles and safety constraints, realizing perception-decision-control closed-loop optimization.
Integration Mode 3: VLA + World Model, Representative Suppliers: Zhuoyu Technology and XPeng
Features: VLA is responsible for perceiving the current environment, learning historical driving patterns, and determining the next action. The world model is responsible for deducing how each target on the road will interact in the next 5 to 10 seconds. VLA is good at understanding the present but not predicting the future; the world model is good at prediction but does not reflect on and reason about the prediction results. The combination of the two constitutes a complete brain.
Trend 2: The VLA and world model fusion paradigm is expected to become one of the main ways for the implementation of Physical AI.
The core of the future evolution of intelligent driving large models is the fundamental reconstruction of the underlying paradigm from "imitating human driving" to "understanding the physical world". VLA and world model are not an either-or choice. The future intelligent driving large model will be a fusion of the two. At present, the divergence between the two routes lies in that VLA advocates believe that "understanding" is the premise of driving, while world model advocates believe that "prediction" is the key.
World model advocates believe that changes in the physical world are continuous and high-dimensional. Language is a discrete, low-dimensional symbolic system - the transformation from physics to language is inevitably accompanied by information loss. The world model directly operates physical representations with higher bandwidth. VLA advocates believe that the biggest advantage of VLA is that it can be fine-tuned with the world model or model-based reinforcement learning. It can absorb the advantages of the world model, while the world model cannot utilize the advantages of VLM/VLA. Language brings strong generalization capability for it is a compressed package of human common sense. VLA possesses "common sense reasoning" capability and Chain-of-Thought (CoT) via language, thus gaining self-explanation capability.
Based on the advantages and divergences of the two routes, the industry has begun to explore the fusion path of the two. At present, there are three mainstream fusion modes for VLA and world model: latent space unified fusion, in-depth fusion at the architectural level, and modular collaborative fusion (cloud simulator type).
Fusion Mode 1: Latent Space Unified Fusion, Representatives: Xiaomi OneVL and Huawei DriveVLA-W0
The core is to embed the prediction capability of the world model into the training objectives of VLA, rather than adding additional modules in the reasoning stage. Specifically, it adds a future image prediction task to the training process of the VLA model, allowing the model to not only learn to predict actions, but also the environmental state (i.e., future images) at future moments. This design forces the model to learn the underlying dynamic laws of the driving environment, rather than just fitting sparse action supervision signals.
Case 1 of Latent Space Unified Fusion: Xiaomi OneVL Autonomous Driving Model
On May 13, 2026, Xiaomi officially released Xiaomi OneVL, a fully open-sourced autonomous driving model which unifies the three technical routes of VLA, world model and latent space reasoning into the same framework. The core breakthrough of this model is the in-depth unification of multiple technical paradigms through latent space reasoning. Differing from traditional solutions that decompose the reasoning process into human-readable natural language and generate deduction logic word by word, Xiaomi OneVL directly completes end-to-end logical operations in the high-dimensional vectorized latent space. This latent space integrates both the scenario perception and understanding capability of VLA and the environmental time-series prediction capability of the world model, and all reasoning operations are carried out at the vector level rather than the text level, achieving a significant leap in reasoning efficiency compared with traditional VLA solutions.
In terms of implementation mechanism, firstly, two types of latent variables are introduced inside the model: visual latent token and language latent token. The former is responsible for encoding physical relationships and time-series changes in the scene, carrying the prediction capability of the world model. The latter is responsible for expressing driving intentions and semantic logic, carrying the understanding capability of VLA.
Secondly, OneVL introduces two auxiliary decoders, which are only used in the training stage. The language auxiliary decoder is responsible for restoring human-readable CoT text from the language latent token, explaining why the model makes a certain driving decision. The visual auxiliary decoder is responsible for predicting future frame visual tokens (images after 0.5 seconds and 1.0 seconds) from the visual latent token, allowing the model to predict scene changes. During inference, both decoders are removed, and the model directly outputs planning results, realizing one-step reasoning and completely eliminating the delay accumulation caused by autoregression.
Case 2 of Latent Space Unified Fusion: Huawei DriveVLA-W0 Predicts Future Images Through World Modeling Tasks
Traditional VLA models face a fundamental problem: Supervision Deficit. The input of VLA models is high-dimensional multimodal data (front-view image sequences, language instructions, historical actions, etc.), but the supervision signal is only low-dimensional action tokens. Most of the model’s representation capacity is wasted, resulting in its inability to fully learn the complex dynamics of the driving environment, and the huge potential of VLA models cannot be effectively released.
As can be seen from the figure below, as the amount of training data increases from 700,000 frames to 7 million frames and then to 70 million frames (ever more data), the collision rate shows a downward trend, that is, the more training data, the better the safety. However, for the traditional VLA technical paradigm without the world model, when the data increases from 7 million frames to 70 million frames, the decline in collision rate slows down, indicating that data has limited effect on improving the safety performance of VLA.
To solve the sore points of VLA such as sparse supervision, failure of data scaling law, and lack of physical time-series prediction capability, Huawei proposed the DriveVLA-W0 training paradigm in its paper, introducing the world model to predict future images as dense self-supervision signals during the training stage, so as to increase future time-series prediction while maintaining the ability to understand environmental dynamics. Compared with traditional VLA, DriveVLA-W0 adds world modeling (predicting future road conditions): the more data, the greater the advantage is magnified, and the data scaling law is strengthened.
Specifically, it adds a future image prediction task to the training process of the VLA model, allowing the model to not only learn to predict actions, but also the environmental state (i.e., future images) at future moments. This design forces the model to learn the underlying dynamic laws of the driving environment, rather than just fitting sparse action supervision signals.
Fusion Mode 2: In-depth Fusion at the Architectural Level, Representative: VLA-World
Differing from pre-training fusion (external reinforcement), where the world model acts as an external tool to generate first and then transmit, in-depth fusion at the architectural level internalizes the world model capability into the native capability of VLA, with planning and generation growing together in the same architecture.
VLA-World, jointly proposed by Shanghai Jiao Tong University and Huawei Central Research Institute in April 2026, is an integrated VLA architecture with deeply embedded world model capabilities. In traditional solutions, the world model and VLA are independent of each other, with the former responsible for generating simulation videos and the latter for perception reasoning and decision output. VLA-World adopts a single VLA backbone network for feature sharing between visual generation and decision reasoning. It integrates trajectory prediction and visual generation into continuous links of the same decision chain, and follows the causal logic of predicting motion trajectory first and then deducing future images based on the trajectory, realizing deep module coupling and highly coherent reasoning chain.
Working Mechanism:
Trajectory Perception Conditioning: VLA-World predicts the trajectory first, and then generates future frames conditioned on the trajectory: the trajectory prediction result directly serves as the conditioning signal for visual generation to guide the generation process. In this way, the trajectory determines "where to go", and the image presents "what to see when arriving there", forming a causal dependency.
Unified Generation and Reasoning: Differing from the past when the world model and VLA were two independent modules, VLA-World enables the two to share the same VLA backbone, that is, unifying visual generation and reasoning in the same VLA structure.
GRPO End-to-End Alignment: GRPO (Group Relative Policy Optimization) is used to optimize the model during the reinforcement learning stage. The model generates multiple candidate trajectories and corresponding future images, and rewards those results where the "imagined future" is consistent with the "real safe decision". This mechanism makes visual generation no longer an independent task, but always serves the quality of downstream decisions.
Trend 3: The Evolution of Intelligent Driving AI Towards Foundation Models Accelerates, and the Industry Will Enter A Competition Period of General Cognitive and Reasoning Capabilities of Foundation Models.
2026 is the first year of the launch of autonomous driving foundation models. DeepRoute.ai, Afari Technology, Zhuoyu Technology, Li Auto, and XPeng have launched related products. The core of foundation models is to build a universal and reusable cognitive base for the physical world, realizing full-level intelligent driving compatibility and cross-scenario capability migration.
Firstly, autonomous driving is essentially a typical scaling problem, and current implementation is mainly restricted by insufficient model capacity and low efficiency of data closed-loop. First of all, the existing foundation models have limited scale and insufficient generalization capability for long-tail complex scenarios; secondly, high-value data mining relies on manual screening and review, with fragmentation and low automation, limiting long-term iterative capabilities.
To address the two bottlenecks of insufficient model capacity and inefficient data closed-loop, DeepRoute.ai proposed a solution, a unified 40B-parameter VLA foundation model. The core innovation lies in the "trinity" model role design, allowing the same model to play three roles simultaneously: driver (visual input → real-time driving decision), analyst (diagnostic understanding of key scenarios), and critic/ referee (evaluating the safety and rationality of driving behavior), upgrading the driving system from a simple execution system to an intelligent system with cognitive capabilities.
In the pre-training stage, DeepRoute.ai abandons the traditional approach of the end-to-end model relying on trajectory supervision (data utilization rate is only 0.001%), and instead adopts the video prediction task, enabling the model to learn the dynamic structure of the real world by predicting video sequences, turning every pixel into a supervision signal and increasing the data utilization rate to nearly 100%.
In the core training stage (Mid-train), the model conducts joint training around three tasks: V+A (vision + action) to learn conventional end-to-end driving, V+A→L (explanation after action) to activate the analyst and critic roles, and V→L+A (multimodal logical reasoning) to train a driver with reasoning capability, using Chain-of-Thought to let the model first output language descriptions and decision logic of key events, and then output specific driving trajectories.
In terms of engineering implementation, DeepRoute.ai controls the single-step processing latency of 1,000 visual tokens and dozens of reasoning tokens within 60-85 milliseconds using optimization methods such as KV Cache, Multi-Token Prediction (MTP), model quantization, and self-developed reasoning engine, realizing 10-15Hz real-time closed-loop control capability. Moreover, the foundation model can be flexibly distilled according to the computing power of vehicle chips, and deploy a pure driving VA model on a 100 TOPS platform, and a VLA model with logical reasoning capability on a 500 TOPS platform.
Then the foundation model pre-trains to learn the physical laws and spatial logic of the real world, with native zero-shot migration capability. With a universal cognitive base, it adapts to all levels from L2 assisted driving to L4 autonomous driving through model distillation, computing power tailoring, and capability fine-tuning. It is first applied to autonomous driving, and will migrate to multiple tracks such as humanoid robots and industrial robots in the future, realizing "one foundation making all things intelligent".
In 2026, Zhuoyu Technology fully transforms its strategy. Taking the native multimodal foundation model as the technical base, it aims to upgrade from an "intelligent driving Tier 1 supplier" to a "mobile physical AI company", focusing on mass production expansion across all scenarios and vertical domains covering passenger cars, commercial vehicles, L4 products and overseas layout, and extending to the field of embodied robots.
Zhuoyu launched VLA (VLA World Model, native multimodal FM): it uses a unified Backbone to process visual, text, and sensor data, completes physical reasoning in the latent space, and directly outputs driving actions. From the pre-training stage, it conducts joint training with image/video/text/driving/robot data, and performs prediction and reasoning of the physical world in a unified latent space, understanding both semantics and physical laws.
In 2026, a critical year for the technological iteration and paradigm fusion of intelligent driving large models, the competition and integration of multiple technical routes, the collaborative implementation of VLA and world model, and the large-scale launch of foundation models will jointly promote the intelligent driving industry to accelerate from "technological exploration" to "large-scale implementation". Whether it is technological innovation of multi-route integration or generalized layout of foundation models, the core is to revolve around the goal of "safer, more efficient, and more adaptable to real driving scenarios". The trend of "physical AI" implementation will further drive intelligent driving systems to evolve from "imitating humans" to "understanding the world", realizing true intelligent driving.
In the future, with the continuous iteration of technologies and the coordinated improvement of the industry chain, intelligent driving large models will gradually break through existing bottlenecks, become the core support for the large-scale implementation of autonomous driving, reshape the development pattern of the mobility sector, and also facilitate the extension and application of mobile physical AI in more scenarios.