Detailed Reading
The paper starts from the observation that duplicating a full Gaussian scene for every time step would be wasteful. Instead, it keeps a canonical set of Gaussians and learns a deformation field that maps those Gaussians to their state at a given time. This separates stable scene content from motion, which keeps storage and optimization more manageable.
Its deformation model uses spatial-temporal feature planes inspired by HexPlane. Given a Gaussian and a timestamp, the network predicts changes in position, rotation, and scale. The renderer then splats the deformed Gaussians for the requested time. In effect, each frame is a different slice through the same learned 4D representation.
The important algorithmic idea is that dynamics are represented at the primitive level rather than by a black-box video model. That makes the method fast and reasonably compact, while still allowing high-resolution novel views of moving content. Its limits show up when motion is too large, too nonrigid, or poorly observed.
The paper takes the static 3DGS representation and asks what should be shared across time. Instead of training an independent Gaussian cloud for every frame, it keeps a canonical set of Gaussians and learns how their positions, rotations, scales, and possibly appearance deform with time. This canonical-plus-deformation pattern became one of the dominant recipes for dynamic Gaussian papers.
The deformation field is the algorithmic center. A time-conditioned network predicts offsets that move canonical primitives into each frame, so rendering a novel time and view becomes: query deformation, transform Gaussians, project, sort, and composite. The representation is efficient because the renderer stays close to static 3DGS, while temporal variation is pushed into a relatively compact function.
Training has two coupled objectives: match every frame visually and keep motion learnable enough that the model does not use arbitrary opacity/color changes to fake dynamics. Good camera coverage and stable initialization matter, because local minima can appear when a moving surface is explained by the wrong canonical primitive. Later dynamic methods often improve this point with better motion bases, temporal regularizers, lifespan modeling, or deformation grouping.
The paper is important because it demonstrated that Gaussian splatting could be more than a static scene format. Its weakness is that canonical deformation can struggle with topology changes, long sequences, and fast occlusion changes, but the conceptual split between persistent scene elements and time-conditioned motion remains the reference point for many 4DGS systems.