LeWorldModel vs V-JEPA: What Actually Changed
LeWorldModel removes masking, EMA, and stop-gradient from V-JEPA, trains end-to-end from pixels in hours on one GPU, and plans 48x faster. Here’s the full architecture and what it learns differently.
Here, I explore AI, philosophy, economics, mind and the journey of learning.
This space is where I think out loud.
Writing crystallizes thought — it transforms the nebulous cloud of half-formed ideas into concrete structures I can examine, refine, and build upon. When I write about a concept, I’m not just explaining it; I’m understanding it more deeply myself.
Two beliefs drive this blog:
First, knowledge should be democratized. The most powerful ideas shouldn’t hide behind academic jargon or intimidating prerequisites. When people understand how things work, they’re empowered to build, create, and innovate.
Second, complexity is often a function of the unknown - learning a language seems difficult not because language is inherently complex, but because you don’t yet understand the symbols, the sounds, and the rules that govern it. Strip away the jargon, and suddenly the truth reveals itself in its simple beautiful form.
My posts attempt to dissolve these illusions of complexity. Whether I’m breaking down CUDA kernels, exploring consciousness, or implementing transformers from scratch — the goal remains constant: illuminate the seemingly daunting.
Some posts are technical deep-dives where I work through implementations and examples. Others are philosophical explorations about intelligence and consciousness. All are me trying to understand our universe better, one concept at a time.
If you’re reading this, you’re probably curious about how things work too. Let’s figure it out together.
LeWorldModel removes masking, EMA, and stop-gradient from V-JEPA, trains end-to-end from pixels in hours on one GPU, and plans 48x faster. Here’s the full architecture and what it learns differently.
V-JEPA prevents collapse with heuristics. SIGReg prevents it with a theorem. Here’s what isotropic Gaussian means, why variance isn’t enough, and how the Cramér-Wold trick makes it cheap.
Going back through V-JEPA’s encoder, predictor, and target encoder in detail — the mechanics of masking, EMA, stop-gradient, and why collapse prevention is harder than it looks.
World models have become a buzzword in AI, but the concept is often misunderstood. Let me trace their history and give you a precise definition.
What if robots could learn just by watching? V-JEPA2 suggests that video understanding alone, combined with goal images, might be sufficient for robot manipulation.
A detailed comparison of world model approaches (V-JEPA, DreamZero) and Vision-Language-Action models (Pi-Zero, OpenVLA) for robot learning.
GPU programming often feels intimidating with terminology like threads, blocks, warps, and SMs. But at its core, it’s about one beautiful idea: breaking big problems into thousands of tiny parallel tasks.