Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

* University of Southern California $ Georgia Institute of Technology
§ Stanford University NVIDIA Research Equal advising


We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations—parameterized by 3D Gaussians and their velocities—in a single forward pass. Our key design is to transform 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., ``amodal'') reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding. Our code and model will be released soon.

method overview

TL;DR: STORM predicts 3D Gaussians and their motions from sparse observations in a feed-forward manner, outperforming existing methods in speed, accuracy, and generalization while enabling real-time rendering and additional applications.

STORM Feed-forward Reconstruction Examples

STORM reconstructs 3D representations and scene motions in a feed-forward manner. For each example, we present the input frames (Context RGB), the reconstructed RGB, depth maps, predicted scene flows, and motion segmentation. Ground truth scene flows are included for qualitative comparison, though they are not used for supervision.
Observe that STORM sometimes predicts scene motions that are not annotated in Ground Truth (e.g., the first example).

Novel View Synthesis Results

We show the novel view synthesis results of STORM. These novel views are directly rendered from the 3D scene representation and the predicted scene flows predicted by STORM.

4D Visualization

Point trajectories Visualization

We visualize the trajectories of dynamic Gaussians by chaining per-frame scene flows. Specifically, at each frame t, we use the predicted scene flow to transform the Gaussians to their estimated positions in the next frame (t+1). For every Gaussian in frame t+1 ,we identify its nearest transformed Gaussian from frame t and connect them to form the trajectory. This process is recursively applied across all frames to construct the complete trajectories. The color of each trajectory is determined by applying PCA to the motion assignment weights. Specifically, the motion assignment weights (N,16) are projected to (N,3) using PCA, and the first three components are used as RGB values. It is important to note that estimating point trajectories is not the primary objective of STORM. The trajectory visualizations are provided solely for interesting qualitative visualization.

* Currently, these demos are presented as video recordings due to time constraints. We plan to make them interactive in the future.

Latent-STORM Feed-forward Reconstruction Examples

Latent-STORM operates in latent space, initially rendering an 8x downsampled feature image and then upsampling it to RGB-D output using deconvolution layers.
For each example, we show the input frames (Context RGB), the predicted RGB, depth maps, scene flows, and motion segmentation.
Note that the predicted depth maps, flows, opacity maps, and motion segmentation masks are all in the downsampled space, i.e., they are rasterized in the 8x downsampled space.

Human Modeling with Latent-STORM

We present a side-by-side comparison of Latent-STORM (left) and STORM (right) for human motion modeling.
Modeling leg motion in pixel space is extremely challenging. By operating in latent space and utilizing an additional latent decoder, Latent-STORM reconstructs humans better.
*Footnote: We found that training Latent-STORM with our default perceptual loss weight led to strong perceptual-loss-derived artifacts and caused flickering in the human region. To reduce these artifacts, we post-trained our model with a lower perceptual loss weight for an additional 40k iterations. We also oversampled scenes with humans in the training set to improve human modeling. The results shown here are from the post-trained model, while the numerical results in the paper are from the default model.

Editing with STORM

We show different editing results with STORM. All dynamic instances here are selected by choosing corresponding motion token, without the need for bounding boxes.


STORM occasionally struggles to account for lighting effects caused by water droplets on the camera lens and predicts noisy velocities in textureless regions, such as roads.

Latent-STORM shows sensitivity to the perceptual loss weight. Higher weights can introduce artifacts, while lower weights may smooth the output but reduce detail. Improving the decoder and loss designs to address this will be explored in future work.

STORM is the very first model to reconstruct dynamic scenes in a feed-forward manner. However, it is not perfect and has limitations. We show two examples of limitations here. We believe that addressing these limitations will be interesting directions for future research. We hope our approaches and results will encourage future efforts to further enhance the feed-forward dynamic scene reconstruction model.


  author    = {Jiawei Yang and Jiahui Huang and Yuxiao Chen and Yan Wang and Boyi Li and Yurong You and Maximilian Igl and Apoorva Sharma and Peter Karkus and Danfei Xu and Boris Ivanovic and Yue Wang and Marco Pavone},
  title     = {STORM: Spatio-Temporal Reconstruction Model for Large-scale Outdoor Scenes},
  journal   = {arXiv preprint arXiv:2501.00602},
  year      = {2025}


Page borrowed from Omnire .