C4D: 4D Made from 3D through Dual Correspondences

Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction.

To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. Specifically, apart from predicting pointmaps, C4D captures two types of Correspondences: short-term optical flow and long-term point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes.

Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction. Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking.

Overview of C4D. C4D takes monocular video as input and jointly predicts dense 3D pointmaps and temporal correspondences, including short-term optical flow and long-term point tracking. These correspondences are utilized to predict motion masks and participate in the optimization process with 3D pointmaps to obtain 4D outputs.

Tracker

↑ Dynamic-aware Point Tracker (DynPT). We train a new kind of point tracker that can predict the mobility of tracking points. Mobility means whether the point is moving or static in the world coordinate system.

→ Correspondence-guided motion mask prediction. We utilize the mobility information, along with the optical flow and point tracking, to estimate the motion mask quickly and robustly.

Camera pose estimation results across 3D/4D formulations. Evaluation on the Sintel, TUM-dynamic, and ScanNet datasets. The best results are highlighted in bold. Our 4D formulation, C4D, consistently improves the performance based on 3D models.

Video depth estimation results across 3D/4D formulations. We evaluate scale-and-shift-invariant depth on Sintel, Bonn, and KITTI. The best results are highlighted in bold. Our 4D fomulation, C4D, consistently improve the performance based on 3D models.

4D reconstruction results. C4D can utilize 3D models (e.g., DUSt3R) to reconstruct the dynamic scene and recover camera parameters from monocular video input.

Illustration of Point Trajectory Smoothness (PTS) objective in the optimization process. The temporal depth and 3D trajectories become more smooth after applying PTS objective.

Qualitative comparison of motion mask. Our motion mask is more accurate than MonST3R’s.

BibTeX

@inproceedings{wang2025c4d,
  author    = {Wang, Shizun and Jiang, Zhenxiang and Yang, Xingyi and Wang, Xinchao},
  title     = {C4D: 4D Made from 3D through Dual Correspondences},
  journal   = {ICCV},
  year      = {2025},
}

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

A huge thanks to Nerifies for generously sharing the website template.

C4D: 4D Made from 3D through Dual Correspondences

TL;DR: C4D aims to provide a more "plug-and-play" way to upgrade current 3D models (e.g., DUSt3R) into 4D ones without requiring 4D data.

Abstract

Method

Tracker

Motion Mask

Results

Quantitative Results

Qualitative Results

BibTeX