SIGReg: The Anti-Collapse Trick That Actually Has a Proof

This is Part 5 of my World Models series. Part 4 covers V-JEPA’s internals and the collapse problem — read that first.


In the previous post I walked through how V-JEPA prevents representation collapse using EMA + stop-gradient — two heuristics that work empirically but have no formal guarantee.

LeWorldModel replaces both with a single regularizer called SIGReg. Before it makes sense, you need a clear picture of what it’s trying to enforce.


What does “isotropic Gaussian” actually mean?

Your encoder takes a frame and outputs a vector — say 256 numbers. Every frame in your dataset becomes one point in 256-dimensional space. Over a full training dataset of thousands of frames, you get thousands of these points scattered somewhere in that space.

Now think about what collapse looks like geometrically. All points pile up at roughly the same location. The “cloud” of points shrinks to a dot. No information is preserved — every frame, regardless of what’s in it, maps to the same place.

What does a good representation cloud look like? The points should be spread out. Different frames should land at different locations. But “spread out” is vague — there are infinitely many ways to be spread out.

Gaussian means the distribution of points follows a bell curve — most points near the center, fewer further out, tapering smoothly.

Isotropic means spread equally in all directions. No direction is special. In 2D, an isotropic Gaussian looks like a perfect circle of points centered at zero. In 256D, it’s a sphere — the cloud has equal volume in every direction simultaneously.

collapsed:          isotropic Gaussian:
all points ≈ [0,0]  points spread like a sphere
     ·               · · ·
                    · · · · ·
                     · · ·

Why is this specifically the right target?

First, it’s provably not collapsed — the points have real variance in every direction. Second, it’s not redundant — no dimension dominates, so all 256 dimensions are being used. Third, it’s well-behaved for the predictor to work in — the latent space is smooth and regular, which makes next-state prediction easier to learn.

The alternative failure mode — which SIGReg also catches — is dimensional collapse. This is subtler than full collapse. The encoder uses, say, only 10 of its 256 dimensions and ignores the rest. Points are spread out, but only in a thin 10D subspace. An isotropic Gaussian has equal variance in every dimension. It detects the ignored dimensions (variance = 0 there, not ~1), so SIGReg detects and penalizes it too.


Why not just measure variance directly?

A reasonable question. You could add a loss term that maximizes the variance of embeddings across the batch — if variance is high, they’re spread out, not collapsed.

The problem: variance alone doesn’t capture the shape of the distribution. You could have high variance but all points lying on a line (1D spread in 256D space). Or high variance but points clustered in two separate clumps. Both would fool a pure variance penalty.

Gaussianness is a much stronger condition. An isotropic Gaussian is spread out and smooth and uses all dimensions equally and has no clusters. It’s the shape that maximally prevents all forms of collapse simultaneously.

VICReg (used in some other self-supervised methods) tries to enforce this through multiple separate terms — a variance term, an invariance term, a covariance term. That’s three terms, each with its own hyperparameter, often fighting each other during training.

SIGReg replaces all of that with one condition: does this batch look Gaussian? If yes, you’ve implicitly satisfied variance, covariance, and non-collapse all at once.


The random projection trick

The remaining problem: directly testing if a 256-dimensional distribution is Gaussian is computationally expensive. Classical normality tests (Kolmogorov-Smirnov, Shapiro-Wilk etc.) don’t scale to high dimensions.

SIGReg uses a clever workaround called the Cramér-Wold theorem:

A distribution in d dimensions is Gaussian if and only if every 1D projection of it is Gaussian.

So instead of testing 256D normality directly, SIGReg:

  1. Samples a random unit vector r (a random direction in 256D space)
  2. Projects every embedding in the batch onto that direction: computes the scalar r · z for each z
  3. You now have N scalars — one per example in the batch
  4. Tests if those N scalars follow a 1D normal distribution (fast and cheap)
  5. Repeats with many random directions
  6. Aggregates the penalties
256D embeddings: z₁, z₂, ... zₙ
         ↓  project onto random direction r
N scalars: r·z₁, r·z₂, ... r·zₙ
         ↓  normality test
penalty for this direction
         ↓  repeat for many r's, average
SIGReg score

The number of random directions barely matters empirically — the paper shows performance is stable even with just a handful. So λ = 0.1 ends up being the only hyperparameter you actually need to tune, and even that can be found with a simple bisection search.

The result: a collapse penalty that’s cheap to compute, has a mathematical proof behind it, and catches all forms of collapse — full collapse, dimensional collapse, and distributional clustering — in one shot.


What this enables

With SIGReg handling collapse prevention, LeWorldModel can throw away all of V-JEPA’s anti-collapse machinery:

Removed Why it was needed in V-JEPA Why SIGReg replaces it
EMA target encoder Stable targets prevent collapse SIGReg directly prevents collapse
Stop-gradient Blocks “both sides go to zero” shortcut SIGReg penalizes zero-variance distributions
2× encoder memory Need two copies of the encoder One encoder, no copy

The next post covers the full LeWorldModel architecture that this enables — 15M parameters, end-to-end from pixels, single GPU.


Part of my World Models series. Previous: Part 1: What is a World Model?, Part 2: World Models for Robot Learning, Part 3: World Models vs VLAs, Part 4: V-JEPA Internals.