NVIDIA Cosmos Predict 2.5: LoRA and DoRA for Robotics

⚡

Key Takeaways

1NVIDIA Cosmos Predict 2.5 generates realistic videos for robotics, requiring specific adjustments.

2LoRA and DoRA allow for model adaptation without losing general knowledge, reducing memory requirements.

3Training uses PyTorch and requires an 80 GB GPU, with optimal results on 8 H100 GPUs.

💡Why it matters — This advancement facilitates the development of robotic policies, making training more accessible and cost-effective.

Introduction to NVIDIA Cosmos Predict 2.5

NVIDIA Cosmos Predict 2.5 is a large-scale world model designed to generate physically plausible videos. These videos can be conditioned on text, images, or even video clips. However, to adapt this model to specific domains, such as robotic manipulation or a particular camera viewpoint, targeted fine-tuning is necessary.

Training robotic policies requires demonstration data. However, collecting trajectories from real robots is both a slow and costly process. An alternative is to generate synthetic trajectories using a fine-tuned world video model. However, fully fine-tuning a model that contains 2 billion parameters is an expensive task. Additionally, this can lead to catastrophic forgetting of the general knowledge acquired by the model.

LoRA and DoRA: Innovative Solutions

To address this issue, LoRA and DoRA methods have been developed. These methods allow for the injection of small trainable adapter modules into the frozen base model. This reduces memory requirements while keeping the adapter files small and portable. With this approach, fine-tuning becomes feasible on a single GPU, and it is easy to swap adapters for different domains during inference.

The guide explains how to effectively fine-tune the parameters of Cosmos Predict 2.5 using LoRA and DoRA. It utilizes the diffusers and accelerate libraries, which support training on one or multiple GPUs. After this fine-tuning, the model can be used to generate synthetic robotic trajectories, useful for downstream robotic learning tasks.

Technical Prerequisites

To implement this fine-tuning, you need PyTorch 2.5+ with CUDA, as well as the diffusers and accelerate libraries. Installing wandb is optional but recommended for monitoring training. At a minimum, an 80 GB GPU is required for training on a single GPU. However, for faster iteration, 8 H100 GPUs are recommended.

Here’s how to install the dependencies on your machine:

pip install -U "diffusers[torch]" transformers accelerate peft wandb

After installing diffusers, it is advisable to navigate to examples/cosmos to explore the sample code. The datasets used for training and testing include 92 robotic manipulation videos with text prompts describing pick-and-place tasks, as well as 50 pairs (prompt, image) for testing.

To download and preprocess the training and testing datasets, use the following script:

bash download_and_preprocess_datasets.sh

The resulting training dataset folder looks like this:

gr1_dataset/train
└── metadata.csv

The evaluation dataset is a flat directory of paired .txt and .png files for the (prompt, image) pairs:

gr1_dataset/test
├── filename1.txt
├── filename1.png
├── filename2.txt
├── filename2.png

Implementation and Training

In this section, we walk through the implementation in train_cosmos_predict25_lora.py. VideoDataset loads each sample as a (caption, video) pair from args.train_data_dir (gr1_dataset/train in our example). For videos longer than args.num_frames, it samples a random continuous window of args.num_frames at each epoch, allowing for temporal augmentation. Internally, VideoProcessor from diffusers.video_processor resizes and normalizes the raw images into a tensor of shape (channels, images, height, width).

train_dataset = VideoDataset(
    dataset_dir=args.train_data_dir,
    num_frames=args.num_frames,
    video_size=[args.height, args.width],
)

Cosmos Predict 2.5 consists of three sub-modules:

A VAE that encodes videos into latents
A text encoder that encodes text prompts into prompt embeddings
DiT for diffusion in the latent space

During training, all weights of the VAE, text encoder, and DiT are frozen. The LoRA adapters are injected into the attention projections of DiT (to_q, to_k, to_v, to_out.0) and into the feedforward layers (ff.net.0.proj, ff.net.2). The trainable LoRA parameters are then converted to float32 for numerical stability under bf16 mixed precision.

from diffusers import Cosmos2_5_PredictBasePipeline
from peft import LoraConfig

pipe = Cosmos2_5_PredictBasePipeline.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B",
    revision="diffusers/base/post-trained",
    torch_dtype=torch.bfloat16,
)

dit = pipe.[transformer](/glossaire/transformer)
text_encoder = pipe.text_encoder
dit.requires_grad_(False)
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

lora_config = LoraConfig(
    r=args.lora_rank,
    lora_alpha=args.lora_alpha,
    target_modules=['to_q', 'to_k', 'to_v', 'to_out.0', 'ff.net.0.proj', 'ff.net.2'],
    use_dora=args.use_dora,
)

dit.add_adapter(lora_config)
cast_training_params(dit, dtype=torch.float32)

Setting use_dora=True switches to DoRA, which decomposes each weight into magnitude and direction before applying the low-rank update. No other modifications to the training loop are necessary.

Cosmos Predict 2.5 uses a rectified flow: the model is trained to predict the velocity that linearly transports a noise sample to the original "clean" data. Specifically, at time t, a noisy interpolation xt is constructed at a sampled noise level σt, and the model learns to predict the target velocity noise - clean via mean squared errors (MSE loss). The first two frames of the video are used as conditioning, and thus no noise is added to their latents.

The training loss follows the rectified flow formulation used by Cosmos Predict 2.5:

sigma_t = sample_train_sigma_t(bsz, distribution='logitnormal', device=device)
xt = noise * sigma_t + clean_latent * (1 - sigma_t)
xt = clean_latent * cond_mask + xt * (1 - cond_mask)
in_timestep = cond_indicator * 0.0001 + (1 - cond_indicator) * sigma_t
pred_velocity = dit(
    hidden_states=xt,
    condition_mask=cond_mask,
    timestep=in_timestep,
    encoder_hidden_states=prompt_embeds,
    padding_mask=padding_mask,
    return_dict=False,
    target_velocity=noise - clean_latent
)
pred_velocity = target_velocity * cond_mask + pred_velocity * (1 - cond_mask)
loss = F.mse_loss(pred_velocity.float(), target_velocity.float())

Optimizer and Scheduler

We use torch.optim.AdamW as the optimizer and get_linear_schedule_with_warmup from diffusers.optimization as the scheduler. The scheduler linearly increases the learning rate over scheduler_warm_up_steps, peaks at scheduler_f_max × learning_rate, and then linearly decreases to scheduler_f_min × learning_rate over the remaining training steps.

lora_params = [p for p in dit.parameters() if p.requires_grad]
optimizer = torch.optim.AdamW(lora_params, lr=args.learning_rate, weight_decay=args.weight_decay)

lr_scheduler = get_linear_schedule_with_warmup(
    num_warmup_steps=args.scheduler_warm_up_steps,
    num_training_steps=args.num_training_steps,
    f_min=args.scheduler_f_min,
    f_max=args.scheduler_f_max,
)

The LoRA weights are saved in the diffusers format every args.checkpointing_epochs epochs:

if (epoch+1) % args.checkpointing_epochs == 0:
    if accelerator.is_main_process:
        save_path = os.path.join(args.output_dir, f"checkpoint-{epoch}")
        accelerator.save_state(save_path)

accelerator.save_state() writes a pytorch_lora_weights.safetensors file in save_path, which is the adapter file you will pass to the pipeline during inference.

Training Command

Use the provided shell script as a starting point:

export MODEL_NAME="nvidia/Cosmos-Predict2.5-2B"
export DATA_DIR="gr1_dataset/train"
export OUT_DIR=YOUR_OUTPUT_DIR

accelerate launch --mixed_precision="bf16" train_cosmos_predict25_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--revision diffusers/base/post-trained \
--train_data_dir=$DATA_DIR \
--train_batch_size=1 \
--num_train_epochs=500 \
--checkpointing_epochs=100 \
--output_dir=$OUT_DIR \
--report_to=wandb \
--height 432 --width 768 \
--allow_tf32 --gradient_checkpointing \
--lora_rank $lora_rank --lora_alpha $lora_rank

lora_rank controls the rank of the low-rank decomposition. A higher rank means more trainable parameters and greater expressive capacity, at the cost of increased memory consumption and a larger adapter file. We use rank=32 as a starting point, which results in approximately 50 million trainable parameters.

lora_alpha is a scaling factor applied to the LoRA update: the weight variation is scaled by lora_alpha / lora_rank before being added to the frozen base weights. By setting lora_alpha = lora_rank (as here), this scaling factor remains at 1.0, so the LoRA update is applied at full strength without any additional damping.

To use DoRA instead of LoRA, add --use_dora to the command.

For multi-GPU training, accelerate automatically handles distribution. Empirically, we find that training for 100 epochs already yields decent results on this task, taking 17 hours on a single H100 and 2.5 hours on 8 H100 GPUs.

Running Inference with Your LoRA

Once training is complete, use eval_cosmos_predict25_lora.py to generate videos from the evaluation dataset. The script reads the paired .png and .txt files from gr1_dataset/test, generates a video for each, and writes .mp4 files to --output_dir.

ImageDataset reads the .txt file into a prompt string and uses load_image from diffusers.utils to load the .png as a PIL.Image.Image:

def __getitem__(self, idx):
    img_path, txt_path, stem = self.samples[idx]
    image = load_image(img_path)
    with open(txt_path) as f:
        prompt = f.read().strip()
    return {"image": image, "prompt": prompt, "stem": stem}

Loading the Pipeline and LoRA/DoRA Weights

from diffusers import Cosmos2_5_PredictBasePipeline

pipe = Cosmos2_5_PredictBasePipeline.from_pretrained(
    "nvidia/Cosmos-Predict2.5-2B",
    revision="diffusers/base/post-trained",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

pipe.load_lora_weights("/path/to/lora/checkpoint")
pipe.fuse_lora(lora_scale=1.0)

fuse_lora merges the adapter weights into the base model, eliminating any inference overhead due to the LoRA/DoRA decomposition.

Generating Initial Latent Noise

To ensure reproducibility, the function arch_invariant_rand generates the initial latent noise via NumPy, making the noise invariant to GPU architectures. If reproducibility is not a concern, users do not need to provide input noise to the pipeline.

latent_shape = pipe.get_latent_shape_cthw(args.height, args.width, args.num_output_frames)