Moving beyond voxels through positional encoding.

Neural Radiance Fields (NeRFs) provide a substantial step up in our ability to construct interactive, photorealistic 3D objects. The core of this capability comes from a "coordinate-based neural representation of low-dimensional signals". The goal of this post is to gain a strong intuition about what that means, and the trade-offs involved.

Before we dive into the details of how NeRF works, let us first get a better understanding of one way of viewing
3D objects. A "volume" is a dataset spanning 3 spatial dimensions

Creating a 2D image from a volume (known as rendering) consists of computing the *"radiance"* - the
reflection, refraction and emittance of light of an object from a source to a particular observer - which can then
be used to produce an RGB pixel grid showing what our model looks like.

A common technique for volume rendering (and the one used in NeRF) is **Ray Marching**. Ray Marching constructs
a series of rays;

To form an RGBA sample, a transfer function (from classical volume rendering

where

In essence, the transfer function is designed to translate our common-sense understanding that if something is behind a dense / "non-transparent" object then it will contribute less to an image than if it was behind a less-dense / "transparent" object. The key requirement for the rendering component is that it is differentiable, and hence can be optimised over efficiently.

With that out of the way, the innovation behind NeRF has little to do with the rendering approach - instead, it is about how you capture a volume. How do you encode the volume to allow it to be queryable at any arbitrary point in space? While also aware of non-"matte" surfaces?

Let us first consider a naive approach for encoding volumes - voxel grids. Voxel grids apply an explicit
discretisation to the 3D space, chunking up into cubes of space, storing the required metadata in an array (often
referencing a material that includes colour, and viewpoint specific features which the ray marcher must take into
account). The challenge with this internal representation is that it takes no advantage of any natural symmetries
and the storage and rendering requirements explode in

What is a volume if not a function mapping? Given that there are clear symmetries in any particular volume, we'd like a method to learn to exploit those symmetries. Densely connected neural networks have been found to suit this task well, when given sufficient data.

Neural Radiance Fields

Adding these two components together - a compact, continuous volume estimator - and a volume renderer, we have sufficient components to construct a fully-differentiable pipeline to optimise our model. We can then minimise our pixel-squared error loss over our parameters.

One of the most important things to remember is that there is **zero training** involved in basic NeRF models.
There is no prior knowledge about what the image it is trying to create, nor any information about common
materials or shapes. Insteadm each NeRF model is trained from scratch for each scene we want to synthesise. We can
see the impact that has on both training time, and the data requirements in our training
simulator.

In addition to the extended training times, and substantial data requirements, there are also a few common failure cases that should be highlighted.

You may notice that the above examples don't have the same clarity that the baked model. This is no accident, beyond just training time, state-of-the-art performance when optimising these densely connected networks requires a few additional techniques.

Rather than adding viewing direction to the original input to the function, it is best to leave the directions out of the first 4 layers. This reduces the number of view-dependent (often floating) artifacts that occur from a premature optimisation of a view-dependent feature before realising the underlying spatial structure.

In truth, the input to the Neural Networks isn't the raw RGB images and viewpoints, instead, a pre-processing step is included to transform the position into sinusoidal signals of exponential frequencies.

This seemingly "magical" trick leads to substantially better results are removes a tendency to "blur" resulting
images. The network architecture for NeRF (Densely connected ReLU) is incapable of modelling signals with fine
detail, and fail to represent a signal's spatial and temporal derivatives, even though these are
essential to the physical signals. This is a similar realisation to the work of Sitzmann

A ReLU MLP with Fourier Features (of which Positional Encoding is one type) allows for these high frequency functions to be represented in low dimensional domains because the ReLU MLP acting as a dot product kernel, and the dot product of Fourier Features are stationary.

One of the first questions that likely came to mind when talking about classical volume rendering was why we were using a uniform sampling system when the vast majority of models will have large quantities of empty space (with low densities) and diminishing value from samples behind dense objects.

The original NeRF

With this article, you should have obtained an overview of the original Neural Radiance Field paper, and developed a deeper understanding of how they work. As we have seen, Neural Radiance Fields can also offer a flexible framework for encoding 3D scenes for rendering and while they have several shortcomings, there are significant extensions possible to both alleviate and expand on the system proposed.

To hint at the variety of uses we add a short list of recent papers that have used a NeRF framework alongside the shortcomings they address:

- Slow (training + rendering)
*Neural Sparse Voxel Fields*: organize scene as sparse voxel octree (10x render speedup).*NERF++*: Model the background separately.*DeRF*: “soft Voronoi diagrams” to take advantage of accelerator memory architectures.*AutoInt*: Learning the volume integral directly*Learned Initialisations*: Meta-Learning for initialisation*JaxNeRF*: Days => Hour training with JAX*FastNERF*: NeRF volume rendering equation factorisation (efficient caching) => 3000x speedup!*KiloNERF*: Replace MLP with thousands of tiny MLPs (kind of strange that this works…)*SNeRG*: Precompute and Bake NeRF into “Sparse Neural Radiance Grid”*DONeRF*: depth oracle network predicts ray sample locations for each view ray with a single network evaluation => 48x rendering speedup.

- Static scenes
*D-NeRF*: Second MLP applies a translation-only deformation to each frame- Nerfies (D-NeRF II): deformations are more general
*Space-Time Neural Irradiance Fields*: Add time as an input*Neural Scene Flow Fields*: Use depth predictions as a prior & regularise by scene flow*NeRFlow*: D-NeRF Model with scene-flow across time.*DynamicVS*: Free-viewpoint video synthesis**Skeleton Driven***NARF*: Local occupancy network per articulation; modulates conditionally trained NeRF*Animatable NeRF*: Mocap + Multi-view to create “blend-fields”.

- One lighting condition is encoded into the model
*NERF-W*: Latent Codes for lighting*Neural Reflectance Fields*: Local Reflection Model (single-point lighting)*NeRV*: Second visibility MLP*NeRD*: Local Reflection Model

- Scene Representation
*Neural Scene Graphs*: Several object-level NeRFs*GIRAFFE*: Object-level NeRF output feature vectors, composed by averaging, rendered to 2D Feature Map, upscaled to 2D image*GANcraft*: Voxel-Grid of objects to compose a scene*EditNeRF*: category-specific conditional NeRF (latent space editing / supervision)*ObjectNeRF*: Learn voxel embedding

- Zero Generalisation
*Light Field Networks*: Learns the class of objects alongside the particular image to allow fewer shot generalisation

- Requires Pose
*BARF*: Optimise over the scene and pose*SCNeRF*: BARF + “intrinsics”*GNeRF*: Rough initial pose network