Detailed Reading
The paper’s central move is to turn a sparse COLMAP point cloud into a learnable cloud of anisotropic ellipsoids. Each Gaussian carries position, opacity, scale, rotation, and spherical-harmonic color coefficients. During training, rendered images are compared against input photos, and gradients update the Gaussian parameters directly rather than flowing through a large neural field.
The clever part is density control. The model does not know beforehand how many primitives a scene needs, so it repeatedly clones, splits, and prunes Gaussians. Large or high-error regions can receive more primitives; transparent or useless primitives can disappear. This adaptive process is what lets the representation grow from sparse SfM points into a dense visual scene.
Rendering is also part of the contribution. The method projects 3D covariances into screen space, sorts Gaussians for visibility, and alpha-composites splats efficiently on the GPU. This is why it changed the field: it kept radiance-field visual quality while making interaction, viewers, and consumer-facing tools plausible.
Read as an optimization paper, it is less about inventing a new primitive and more about making the primitive trainable at scale. The covariance is parameterized through scale and rotation so it stays positive semi-definite during gradient descent. Opacity and color are optimized jointly, which means geometry and appearance are entangled: a Gaussian can become a visual proxy for a surface patch, a fuzzy volume, or even a view-dependent highlight if the training signal pushes it that way.
The algorithm alternates between differentiable rendering and representation management. After a warm-up period, high-gradient Gaussians are either cloned when they are too small and underfit a local detail, or split when a large primitive has to cover incompatible image evidence. This densification schedule is one of the main practical ideas to understand because most later methods either change it, constrain it, or compress the result it produces.
The renderer is engineered around front-to-back alpha compositing of projected ellipses. A 3D covariance is pushed through the camera projection into a 2D footprint, tiled on screen, sorted by depth, and blended with early termination. The paper therefore links a continuous radiance-field objective with a rasterization-style implementation, which is why it became useful for viewers and interactive applications rather than only for benchmark reconstruction.
The limitation is also visible in the design. Since the loss is image reconstruction, the optimized Gaussians are not guaranteed to lie on a clean manifold, preserve topology, or separate material from lighting. When a later paper proposes better mesh extraction, anti-aliasing, relighting, compression, semantics, or dynamics, it is usually repairing one consequence of this very flexible but weakly constrained representation.