Detailed Reading

Street Gaussians is built for urban driving data, where treating the whole sequence as one static scene fails. Cars move, the camera moves, and background scale is large. The paper decomposes the scene into a static-ish background plus foreground objects that can be transformed and composed explicitly.

Each object can have its own Gaussian representation, pose trajectory, and appearance model. Semantic logits help distinguish object and background regions, while dynamic spherical harmonics capture view- and time-dependent appearance. Rendering becomes a composition of structured parts rather than a single unstructured splat cloud.

The paper is important because driving datasets are a stress test for 3DGS: large scale, moving actors, sparse trajectories, and simulation needs. It shows that object-aware structure is not just useful for editing; it is necessary for dynamic real-world scenes.

Street Gaussians adapts 3DGS to autonomous-driving scenes, where the world is large, dynamic, and semantically structured. A street is not a single static object: it contains static background, moving cars, pedestrians, sky, road surfaces, and repeated texture. The paper decomposes the scene so those parts are not forced into one undifferentiated Gaussian cloud.

The method models background and foreground actors differently. Static regions can be represented in a scene coordinate system, while dynamic objects need object-centric motion and time-dependent placement. Semantic cues help decide which Gaussians belong to which entity and how they should move.

Algorithmically, the paper is about factorization. If every moving object is treated as a deformation of the whole scene, optimization becomes entangled and inefficient. By giving objects their own Gaussian sets or transforms, the system can render novel views and times while preserving actor identity.

This paper is valuable for large-scale capture and simulation because it aligns splatting with the structure of driving datasets. Its limitations include reliance on detection, tracking, and reasonably calibrated multi-camera data. It points toward splat-based simulators, but robust rare-case dynamics and long-range consistency remain hard.

What The Paper Does

Street Gaussians adapts 3DGS to autonomous-driving scenes, where the world contains a large static background plus moving vehicles and changing appearances.

It composes foreground object Gaussians and background Gaussians, with semantic logits and dynamic spherical harmonics for time-varying appearance.

Core Ideas

Separates foreground dynamic vehicles from the background scene.
Uses explicit object composition for editing and rendering dynamic street scenes.
Targets fast training and high-FPS rendering on large driving benchmarks.

Why It Matters

Driving scenes are a major real-world testbed for scalable dynamic reconstruction.
It shows how object-level structure can make Gaussian scenes more controllable.
It connects 3DGS with autonomous-driving simulation, replay, and scenario editing.

Read This If

You work with street-view video, autonomous-driving datasets, or dynamic outdoor scenes.
You need a structured scene model with movable foreground objects.
You are comparing NeRF-based driving-scene rendering against Gaussian methods.

Limitations And Caveats

The setup depends on tracking, semantic structure, and driving-scene assumptions.
It is less directly applicable to arbitrary indoor captures or casual phone scans.
Dynamic object modeling remains difficult for occlusions and complex motion.

Original Links

arXiv Paper->Project Page->Code->

Street Gaussians for Modeling Dynamic Urban Scenes