Tracking by Predicting 3-D Gaussians Over Time

Authors are anonymous for CVPR submission.

Video-GMAE learns video correspondence by encoding clips as moving 3-D Gaussians. Identity-preserving Gaussian updates make long-range consistency part of the pretext task, so tracking emerges zero-shot and improves further with light fine-tuning.

Method Results Zero-shot Comparisons Fine-tuned Recon BibTeX
Video-GMAE overview figure
Self-supervised video pretraining for correspondence. The model predicts Gaussians for frame 1 and residual updates for later frames, enforcing identity across time.

Gaussian video tokens

Represent each clip with a fixed set of 3-D Gaussians that move in time, matching the 2-D projection of a dynamic 3-D scene.

Correspondence by construction

Predict per-Gaussian deltas (translation + color) per frame so identities persist; reconstruction loss must honor temporal consistency.

Tracking emerges

Project Gaussian trajectories to the image plane to obtain zero-shot point tracking; fine-tuning yields further gains on TAP-Vid and Kubric.

Abstract

We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.

Method (High Level)

Masked video → ViT encoder → Gaussians for frame 1 + deltas for later frames → differentiable splatting reconstruction.

Pretraining pipeline with Gaussian splatting
Video masked autoencoding via Gaussian splatting.

Pretraining

High masking ratio; decoder predicts Gaussians for frame 1 plus per-frame deltas; splatting closes the loop.

Zero-shot tracking

Project Gaussian motion to a flow field and advect points—no tracking labels.

Zero-shot tracking schematic
Projecting Gaussian motion to image-plane flow for zero-shot tracking.

Zero-shot tracking recipe

Render Gaussian motion as dense flow, then advect points.

Projected centers: \(x_i^{(t)} = \Pi(\mu_i^{(t)})\)

Displacement: \(\Delta x_i^{(t)} = x_i^{(t+1)} - x_i^{(t)}\)

Flow: \(F^{(t)}(u) = \sum_i \alpha_i^{(t)}(u)\,\Delta x_i^{(t)}\)

Update: \(p^{(t+1)} = p^{(t)} + F^{(t)}(p^{(t)})\)

Occlusion-aware variant (optional)

Keep a fixed top‑k anchor set from the first frame, track their visibility, and mix flow with anchor proposals when visible mass is low.

Anchor mass: \(\omega^{(t)} = \sum_{i\in\mathcal{S}} \alpha_i^{(t)}(p^{(t)})\)

Weights over anchors: \(\tilde{\pi}_i^{(t)} = \dfrac{\alpha_i^{(t)}(p^{(t)})}{\sum_{j\in\mathcal{S}} \alpha_j^{(t)}(p^{(t)}) + \varepsilon}\)

Anchor proposal: \(\hat{p}_{\text{anch}}^{(t+1)} = \sum_{i\in\mathcal{S}} \tilde{\pi}_i^{(t)} \left(x_i^{(t)} + \Delta x_i^{(t+1)}\right)\)

Blend with flow if visible: \(p^{(t+1)} = (1-\beta)\big(p^{(t)} + F^{(t)}(p^{(t)})\big) + \beta\,\hat{p}_{\text{anch}}^{(t+1)}\)

Otherwise use anchors only: \(p^{(t+1)} = \hat{p}_{\text{anch}}^{(t+1)}\)

Hyperparams (from paper): k = 8, \(\tau_{\text{vis}} = 0.5\), \(\beta = 0.3\).

Fine-tuning cross-attention readout
Fine-tuning: cross-attention readout over encoder latents improves precision and occlusion handling.

Results

From zero-shot tracking to fine-tuned precision; videos are the primary evidence.

Zero-shot

Stable tracks emerge from Gaussian motion without labels.

Jump to zero-shot
Comparisons

Video-GMAE vs. GMRW‑C: stability vs. tiny-detail fidelity.

See comparisons
Fine-tuned

Light supervision sharpens trajectories and occlusion handling.

View fine-tuned
Reconstructions

Pretraining renders show what the Gaussians capture.

Watch recon

Zero-shot Tracking

Correspondence emerges directly from Gaussian motion—no tracking labels.

DAVIS zero-shot (two sequences, looped).

Kinetics zero-shot (two sequences, looped).

Zero-shot failure cases.

Back to top

Comparisons vs. GMRW‑C

Video-GMAE is more temporally stable; GMRW‑C can better preserve tiny, fast details.

TAP-Vid Kinetics comparison.

TAP-Vid DAVIS comparison.

Static strips (optional)
Comparison strip 1
Comparison strip 2

Back to top

Fine-tuned Tracking

Light supervision sharpens localization and occlusion handling.

Fine-tuned: TAP-Vid DAVIS (two sequences, looped).

Fine-tuned: TAP-Vid Kinetics (two sequences, looped).

Back to top

Pretraining Reconstructions

Rendered Gaussian trajectories during pretraining capture coarse structure and motion.

Dynamic reconstructions from Gaussians.

Back to top

BibTeX

Placeholder (anonymous) citation; update once arXiv metadata is available.

Copy-ready BibTeX
@misc{videogmae2026,
  title        = {Tracking by Predicting 3-D Gaussians Over Time},
  author       = {Anonymous},
  year         = {2026},
  note         = {CVPR submission, under review}
}

Acknowledgements: We borrow this website template from MonST3R, inspired by SD+DINO and originally DreamBooth.