Tracking by Predicting 3-D Gaussians Over Time

Authors are anonymous for CVPR submission.

Video-GMAE learns video correspondence by encoding clips as moving 3-D Gaussians. Identity-preserving Gaussian updates make long-range consistency part of the pretext task, so tracking emerges zero-shot and improves further with light fine-tuning.

Paper (PDF) Supplementary (PDF) Videos (ZIP) Code arXiv (TBD)

Method Results Zero-shot Comparisons Fine-tuned Recon BibTeX

Video-GMAE overview figure — **Self-supervised video pretraining for correspondence.** The model predicts Gaussians for frame 1 and residual updates for later frames, enforcing identity across time.

Gaussian video tokens

Represent each clip with a fixed set of 3-D Gaussians that move in time, matching the 2-D projection of a dynamic 3-D scene.

Correspondence by construction

Predict per-Gaussian deltas (translation + color) per frame so identities persist; reconstruction loss must honor temporal consistency.

Tracking emerges

Project Gaussian trajectories to the image plane to obtain zero-shot point tracking; fine-tuning yields further gains on TAP-Vid and Kubric.

Abstract

We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches.

Method (High Level)

Masked video → ViT encoder → Gaussians for frame 1 + deltas for later frames → differentiable splatting reconstruction.

Pretraining

High masking ratio; decoder predicts Gaussians for frame 1 plus per-frame deltas; splatting closes the loop.

Zero-shot tracking

Project Gaussian motion to a flow field and advect points—no tracking labels.

Zero-shot tracking recipe

Render Gaussian motion as dense flow, then advect points.

Projected centers: \(x_i^{(t)} = \Pi(\mu_i^{(t)})\)

Displacement: \(\Delta x_i^{(t)} = x_i^{(t+1)} - x_i^{(t)}\)

Flow: \(F^{(t)}(u) = \sum_i \alpha_i^{(t)}(u)\,\Delta x_i^{(t)}\)

Update: \(p^{(t+1)} = p^{(t)} + F^{(t)}(p^{(t)})\)

Occlusion-aware variant (optional)

Keep a fixed top‑k anchor set from the first frame, track their visibility, and mix flow with anchor proposals when visible mass is low.

Anchor mass: \(\omega^{(t)} = \sum_{i\in\mathcal{S}} \alpha_i^{(t)}(p^{(t)})\)

Weights over anchors: \(\tilde{\pi}_i^{(t)} = \dfrac{\alpha_i^{(t)}(p^{(t)})}{\sum_{j\in\mathcal{S}} \alpha_j^{(t)}(p^{(t)}) + \varepsilon}\)

Anchor proposal: \(\hat{p}_{\text{anch}}^{(t+1)} = \sum_{i\in\mathcal{S}} \tilde{\pi}_i^{(t)} \left(x_i^{(t)} + \Delta x_i^{(t+1)}\right)\)

Blend with flow if visible: \(p^{(t+1)} = (1-\beta)\big(p^{(t)} + F^{(t)}(p^{(t)})\big) + \beta\,\hat{p}_{\text{anch}}^{(t+1)}\)

Otherwise use anchors only: \(p^{(t+1)} = \hat{p}_{\text{anch}}^{(t+1)}\)

Hyperparams (from paper): k = 8, \(\tau_{\text{vis}} = 0.5\), \(\beta = 0.3\).

Fine-tuning cross-attention readout — Fine-tuning: cross-attention readout over encoder latents improves precision and occlusion handling.

Results

From zero-shot tracking to fine-tuned precision; videos are the primary evidence.

Zero-shot

Stable tracks emerge from Gaussian motion without labels.

Jump to zero-shot

Comparisons

Video-GMAE vs. GMRW‑C: stability vs. tiny-detail fidelity.

See comparisons

Fine-tuned

Light supervision sharpens trajectories and occlusion handling.

View fine-tuned

Reconstructions

Pretraining renders show what the Gaussians capture.

Watch recon

Zero-shot Tracking

Correspondence emerges directly from Gaussian motion—no tracking labels.

DAVIS: deforming objects and occlusions.
Kinetics: diverse human actions and interactions.
Failure modes: camera motion + high-frequency backgrounds + 256-Gaussian budget.

DAVIS zero-shot (two sequences, looped).

Kinetics zero-shot (two sequences, looped).

Zero-shot failure cases.

Back to top

Comparisons vs. GMRW‑C

Video-GMAE is more temporally stable; GMRW‑C can better preserve tiny, fast details.

TAP-Vid Kinetics comparison.

TAP-Vid DAVIS comparison.

Static strips (optional)

Back to top

Fine-tuned Tracking

Light supervision sharpens localization and occlusion handling.

Fine-tuned: TAP-Vid DAVIS (two sequences, looped).

Fine-tuned: TAP-Vid Kinetics (two sequences, looped).

Back to top

Pretraining Reconstructions

Rendered Gaussian trajectories during pretraining capture coarse structure and motion.

Dynamic reconstructions from Gaussians.

Back to top

BibTeX

Placeholder (anonymous) citation; update once arXiv metadata is available.

@misc{videogmae2026,
  title        = {Tracking by Predicting 3-D Gaussians Over Time},
  author       = {Anonymous},
  year         = {2026},
  note         = {CVPR submission, under review}
}

Acknowledgements: We borrow this website template from MonST3R, inspired by SD+DINO and originally DreamBooth.