LatentVLA: Transforming Autonomous Driving through Reasoning

⚡

Key Takeaways

1LatentVLA offers an innovative approach to autonomous driving, avoiding natural language for more efficient reasoning.

2The model uses raw driving data to predict latent actions, simplifying the decision-making process.

3An encoder-decoder method separates the driver's actions from environmental dynamics, optimizing accuracy.

💡Why it matters — LatentVLA could transform autonomous driving by making models faster and more accurate, without relying on natural language.

Introduction

The AlpamayoR1 (AR1) model has been designed for autonomous driving using a visual language model (VLM) as its reasoning foundation. This model relies on a meticulously collected dataset of causal chains, enabling AR1 to resolve complex driving situations using natural language. However, in scenarios where a quick reaction is crucial, natural language may not be the most effective medium for reasoning. Human drivers often react instinctively in critical situations rather than following detailed verbal reasoning. Thus, an alternative to language-based models is necessary.

The LatentVLA architecture proposes an innovative approach that diverges from traditional language-based methods. It performs reasoning in a latent space, without requiring natural language data, and utilizes knowledge distillation to meet real-time constraints.

AlpamayoR1: Large Causal Reasoning Models for Autonomous Driving

Learning Latent Actions

The success of AlpamayoR1 largely hinges on its causal chain dataset, the collection of which required considerable industrial effort, featuring a sophisticated labeling pipeline and rigorous validation. In contrast, LatentVLA adopts a radically different approach. The authors of LatentVLA argue that raw driving data already contains the necessary structure to train a high-performing model. They contend that natural language is biased and difficult to align with driving actions, and that natural language reasoning chains can be inefficient, with some words adding no value to the reasoning process.

LatentVLA introduces a self-supervised framework to predict ego-centered latent actions in a constrained latent space. This means that the model uses unlabeled driving data to predict the actions that the driver should have taken to generate that data. These latent actions become the foundational elements for reasoning in the latent space.

Representation Learning

To predict latent actions from unlabeled data, the authors draw inspiration from the LAPO (learning to act without actions) method. This approach employs an encoder-decoder configuration. The encoder, also known as the inverse dynamics model (IDM), uses two successive images to predict a continuous action vector. The decoder, or forward dynamics model (FDM), uses the current image and the predicted action vector to reconstruct the next image.

This setup requires the learned action representation to describe the action necessary to observe state transitions in the dataset. However, this continuous representation is not compatible with the VLMs that LatentVLA aims to utilize. To address this issue, the authors employ a vector quantized variational autoencoder (VQ-VAE), which associates continuous vectors with the nearest discrete vectors in a dictionary of discrete actions learned in a differentiable manner. This discrete action is then used by the FDM to decode the next image.

By optimizing the reconstruction error of the next image, the IDM and FDM are jointly trained to encode a predictive discrete action representation.

Distinguishing Ego-Actions from Environmental Noise

A question arises: "Aren't the driver's actions influenced by other factors, like a bird passing in front of the camera?" The authors acknowledge this issue and propose an elegant solution to dissociate the impact of the driver's actions from environmental dynamics.

The solution consists of a two-step encoder-decoder configuration:

First, the encoder, conditioned on the actual trajectory, the ego state, and the previous image, predicts a latent action. This action, conditioned by the vehicle's dynamics, models only the environmental dynamics to allow the decoder to reconstruct the next image. This "environmental action" is quantified, and the codebook used is fixed for the next step.
Next, the encoder, conditioned on the previous image and the environmental action, encodes another latent action. Since the environmental dynamics are known, this second latent action encodes ego-centered dynamics. A new codebook is used to quantify this action into a discrete ego-action.

Finally, both actions are provided to the decoder to reconstruct the next image. This configuration ensures a clear separation between ego-actions and environmental dynamics.

Building on the learned action representation, the authors train a Qwen2.5-VL model to predict the same latent actions as the encoder-decoder model. This is achieved by having the encoder predict a trajectory of 12 latent actions.