GFlow: Recovering 4D World from Monocular Video

National University of Singapore
AAAI 2025

GFlow represents the video as a 4D explicit representation, entailing a flow of Gaussian splatting through space and time.

*Note: All videos are played at a low FPS to facilitate careful observation.

A) Reconstructed frames

B) 2D tracking

C) 3D tracking

D) Zero-shot segmentation of moving object

E) Center of all Gaussians

F) Center of moving Gaussians

G) Center of still Gaussians

H) Consistent depth


Novel View Synthesis on 4D Scene

Optimization Process

Optimization process of frames (Moving area is masked)

Optimization process of Gaussian centers


Abstract

Recovering 4D world from monocular video is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view videos, known camera parameters, or static scenes. In this paper, we relax all these constraints and tackle a highly ambitious but practical task: With only one monocular video without camera parameters, we aim to recover the dynamic 3D world alongside the camera poses. To solve this, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video to a 4D scene, as a flow of 3D Gaussians through space and time. GFlow starts by segmenting the video into still and moving parts, then alternates between optimizing camera poses and the dynamics of the 3D Gaussian points. This method ensures consistency among adjacent points and smooth transitions between frames. Since dynamic scenes always continually introduce new visual content, we present prior-driven initialization and pixel-wise densification strategy for Gaussian points to integrate new content. By combining all those techniques, GFlow transcends the boundaries of 4D recovery from causal videos; it naturally enables tracking of points and segmentation of moving objects across frames. Additionally, GFlow estimates the camera poses for each frame, enabling novel view synthesis by changing camera pose. This capability facilitates extensive scene-level or object-level editing, highlighting GFlow's versatility and effectiveness.


Overview

method

A) Given a monocular video input consisting of image sequence \(\{I_i\}\), the associated depth \(\{D_t\}\), optical flow \(\{F_t\}\) and camera intrinsic \(K\) are obtained using off-the-shelf prior. B) For each frame , GFlow first clustering the scene into still part \(G_t^{s}\) and moving part \(G_t^{m}\). Then optimization process in GFlow consists of two steps: C1) Only the camera pose \(P_t\) is optimized by aligning the appearance, depth and optical flow within the still cluster. C2) Under the optimized camera pose \(P_t^*\), the Gaussian points \(\{G_t\}\) are optimized and densified based on appearance, depth, optical flow and the two scene clusters. D) The same procedure of steps B, C1, and C2 loops for the next frame. The colorful marks under the dashed line represent the variables involved in the optimization.



More Results

3D tracking


2D tracking


Center of Gaussians


Consistent Depth


Moving area segmentation in video


BibTeX

@inproceedings{wang2024gflow,
      title={GFlow: Recovering 4D World from Monocular Video},
      author={Wang, Shizun and Yang, Xingyi and Shen, Qiuhong and Jiang, Zhenxiang and Wang, Xinchao},
      booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
      year={2025}
}