World Models vs VLAs: Two Paths to Robot Intelligence

This is Part 3 of my World Models series. Part 1 covers foundations and definitions, Part 2 explores V-JEPA2 for robot learning.

In the previous posts, I laid out what world models are and explored V-JEPA2 as a specific approach. Now let’s do a systematic comparison between world models and Vision-Language-Action (VLA) models - the two dominant paradigms for robot learning.

The core distinction:

World Models ask: “If I do action X, what will the world look like?” They simulate the future, then plan.
VLAs ask: “Given what I see and what I’m told, what action should I take?” They directly map observations to actions.

Let me break down the leading approaches in each category.

World Models

V-JEPA 2: Predicting in Representation Space

V-JEPA 2 is a non-generative world model. Instead of predicting pixels, it predicts in an abstract latent space.

Training:

Pre-trained on 1M+ hours of internet video using masked modeling
Action-conditioned variant (V-JEPA 2-AC) trained on ~62 hours of robot data (DROID dataset)
The encoder is frozen; only a lightweight transformer learns action dynamics

Inference:

Uses Model Predictive Control (MPC) with goal images
Samples action sequences, simulates them in latent space, picks the one closest to the goal
Zero-shot: no task-specific fine-tuning needed

Limitations:

Relies on goal images, not language
Sensitive to camera positioning
Planning loop adds computational overhead

DreamZero: Generative World Action Models

DreamZero takes the opposite approach: generate pixels, but do it jointly with actions.

Training:

Built on Wan2.1-14B, a massive video diffusion model
Uses flow matching to jointly denoise video frames AND robot actions
Co-trained on internet video (physics) + robot data (actions)

Inference:

Generates a “chunk” of future video and actions in parallel
The video acts as an implicit visual planner
Runs at ~7Hz despite being 14B parameters

Limitations:

Computationally heavy (14B params)
If video generation hallucinates, actions will fail

Vision-Language-Action Models

OpenVLA: The Accessible Generalist

OpenVLA takes the straightforward approach: fine-tune an LLM to output robot actions as tokens.

Architecture:

Prismatic backbone (DINOv2 + SigLIP) → Llama 2 (7B)
Actions discretized into 256 bins, mapped to vocabulary tokens

Training:

Trained on Open X-Embodiment dataset (970k trajectories)
Designed for LoRA fine-tuning on consumer GPUs

Inference:

Autoregressive: generates action dimensions one at a time
~7Hz inference speed

Limitations:

Discretization causes “jerky” motion
Single-image input (no video history)
Autoregressive generation is slow

Pi-Zero: The Hybrid Approach

Pi-Zero separates high-level reasoning from low-level control.

Architecture:

VLM backbone (PaliGemma, ~2B) for semantic reasoning
Smaller Action Expert (~300M) for motor control
Post-training switches to flow matching for smooth, continuous actions

Inference:

VLM predicts a text sub-task (“open the drawer”)
Action Expert generates 50 continuous actions via flow matching
Up to 50Hz control

Limitations:

Complex hybrid architecture
Flow matching adds latency vs single-step feedforward

Head-to-Head Comparison

Feature	World Models	VLAs
Core question	“What happens if I do X?”	“What should I do?”
Mechanism	Predict future → plan	Direct observation → action mapping
Training data	Can use video-only data (no actions)	Requires robot data with action labels
Physical generalization	Stronger: understands action outcomes	Limited by training distribution
Semantic generalization	Weaker: often goal-image only	Stronger: LLM backbone understands language
Inference speed	Slower: planning loops	Faster: feedforward
Zero-shot capability	Yes (for new tasks with goal images)	Needs fine-tuning

When to Use What

Choose World Models when:

You need to handle novel physical scenarios not seen in training
You can specify goals visually (goal images)
Compute at inference is less constrained
You want to leverage large-scale video pretraining without action labels

Choose VLAs when:

Language-based instruction following is critical
You need fast inference for real-time control
Tasks are well-represented in robot datasets
You want to leverage pre-trained LLM capabilities

Consider Hybrid Approaches when:

You need both semantic understanding and smooth control
You can handle architectural complexity
You want to balance instruction-following with precise manipulation

The Convergence

The line between these paradigms is already blurring:

DreamZero jointly generates video and actions
Pi-Zero uses hierarchical reasoning with continuous action generation
V-JEPA 2 is exploring language conditioning

The most capable systems will likely combine:

Physical reasoning from world models
Semantic understanding from VLAs
Efficient planning from learned policies

My Take

Having built world models with probabilistic graphical models and now working with V-JEPA2, I see the appeal of both paradigms.

World models feel more principled - you learn how the world works, then reason about it. But VLAs are pragmatic - they work, they’re fast, and language is how humans specify tasks.

The future isn’t one or the other. It’s systems that can:

Understand physics (world models)
Follow instructions (VLAs)
Plan efficiently (amortized policies)
Generalize broadly (foundation model pretraining)

We’re in the early days of figuring out how to combine these pieces.

This concludes the World Models series. Part of my research for CMU’s 11-977 Multi-Modal ML course. Reach out with questions or thoughts.