GFlow: Recovering 4D World from Monocular Video

National University of Singapore
arXiv 2024

GFlow represents the video as a 4D explicit representation, entailing a flow of Gaussian splatting through space and time.

A) Reconstructed frames

B) 2D tracking

C) 3D tracking

D) Zero-shot segmentation of moving object

E) Center of all Gaussians

F) Center of moving Gaussians

G) Center of still Gaussians

H) Consistent depth


Abstract

Reconstructing 4D scenes from video inputs is a crucial yet challenging task. Conventional methods usually rely on the assumptions of multi-view video inputs, known camera parameters, or static scenes, all of which are typically absent under in-the-wild scenarios. In this paper, we relax all these constraints and tackle a highly ambitious but practical task, which we termed as AnyV4D: we assume only one monocular video is available without any camera parameters as input, and we aim to recover the dynamic 4D world alongside the camera poses. To this end, we introduce GFlow, a new framework that utilizes only 2D priors (depth and optical flow) to lift a video (3D) to a 4D explicit representation, entailing a flow of Gaussian splatting through space and time. GFlow first clusters the scene into still and moving parts, then applies a sequential optimization process that optimizes camera poses and the dynamics of 3D Gaussian points based on 2D priors and scene clustering, ensuring fidelity among neighboring points and smooth movement across frames. Since dynamic scenes always introduce new content, we also propose a new pixel-wise densification strategy for Gaussian points to integrate new visual content. Moreover, GFlow transcends the boundaries of mere 4D reconstruction; it also enables tracking of any points across frames without the need for prior training and segments moving objects from the scene in an unsupervised way. Additionally, the camera poses of each frame can be derived from GFlow, allowing for rendering novel views of a video scene through changing camera pose. By employing the explicit representation, we may readily conduct scene-level or object-level editing as desired, underscoring its versatility and power.


Overview

method

A) Given a monocular video input consisting of image sequence \(\{I_i\}\), the associated depth \(\{D_i\}\), optical flow \(\{F_i\}\) and camera intrinsic \(K\) are obtained using off-the-shelf prior. B) For each frame , GFLow first clustering the scene into still part \(G_i^{s}\) and moving part \(G_i^{m}\). Then optimization process in GFlow consists of two steps: C1) Only the camera pose \(P_t\) is optimized by aligning the depth and optical flow within the still cluster. C2) Under the optimized camera pose \(P_t^*\), the Gaussian points \(\{G_i\}\) are optimized and densified based on RGB, depth, optical flow and the two scene clusters. D) The same procedure of steps B, C1, and C2 loops for the next frame. The colorful graphics under the dashed line represent the variables involved in the optimization.


Optimization illustration of frames

Optimization illustration of centers


More Results

3D tracking


2D tracking


Zero-shot segmentation


Center of Gaussians


Consistent Depth