Video-GMAE learns video correspondence by encoding clips as moving 3-D Gaussians. Identity-preserving Gaussian updates make long-range consistency part of the pretext task, so tracking emerges zero-shot and improves further with light fine-tuning.
Self-supervised video pretraining for correspondence. The model predicts Gaussians for frame 1 and residual updates for later frames, enforcing identity across time.
Gaussian video tokens
Represent each clip with a fixed set of 3-D Gaussians that move in time, matching the 2-D projection of a dynamic 3-D scene.
Correspondence by construction
Predict per-Gaussian deltas (translation + color) per frame so identities persist; reconstruction loss must honor temporal consistency.
Tracking emerges
Project Gaussian trajectories to the image plane to obtain zero-shot point tracking; fine-tuning yields further gains on TAP-Vid and Kubric.
Abstract
We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.
Method (High Level)
Masked video → ViT encoder → Gaussians for frame 1 + deltas for later frames → differentiable splatting reconstruction.
Video masked autoencoding via Gaussian splatting.
Pretraining
High masking ratio; decoder predicts Gaussians for frame 1 plus per-frame deltas; splatting closes the loop.
Zero-shot tracking
Project Gaussian motion to a flow field and advect points—no tracking labels.
Projecting Gaussian motion to image-plane flow for zero-shot tracking.
Zero-shot tracking recipe
Render Gaussian motion as dense flow, then advect points.
Placeholder (anonymous) citation; update once arXiv metadata is available.
Copy-ready BibTeX
@misc{videogmae2026,
title = {Tracking by Predicting 3-D Gaussians Over Time},
author = {Anonymous},
year = {2026},
note = {CVPR submission, under review}
}
Acknowledgements: We borrow this website template from MonST3R, inspired by SD+DINO and originally DreamBooth.