Moving beyond voxels through positional encoding.
Neural Radiance Fields (NeRFs) provide a substantial step up in our ability to construct interactive, photorealistic 3D objects. The core of this capability comes from a "coordinate-based neural representation of low-dimensional signals". The goal of this post is to gain a strong intuition about what that means, and the trade-offs involved.
Before we dive into the details of how NeRF works, let us first get a better understanding of one way of viewing
3D objects. A "volume" is a dataset spanning 3 spatial dimensions
Creating a 2D image from a volume (known as rendering) consists of computing the "radiance" - the reflection, refraction and emittance of light of an object from a source to a particular observer - which can then be used to produce an RGB pixel grid showing what our model looks like.
A common technique for volume rendering (and the one used in NeRF) is Ray Marching. Ray Marching constructs
a series of rays;
To form an RGBA sample, a transfer function (from classical volume rendering
where
In essence, the transfer function is designed to translate our common-sense understanding that if something is behind a dense / "non-transparent" object then it will contribute less to an image than if it was behind a less-dense / "transparent" object. The key requirement for the rendering component is that it is differentiable, and hence can be optimised over efficiently.
With that out of the way, the innovation behind NeRF has little to do with the rendering approach - instead, it is about how you capture a volume. How do you encode the volume to allow it to be queryable at any arbitrary point in space? While also aware of non-"matte" surfaces?
Let us first consider a naive approach for encoding volumes - voxel grids. Voxel grids apply an explicit
discretisation to the 3D space, chunking up into cubes of space, storing the required metadata in an array (often
referencing a material that includes colour, and viewpoint specific features which the ray marcher must take into
account). The challenge with this internal representation is that it takes no advantage of any natural symmetries
and the storage and rendering requirements explode in
What is a volume if not a function mapping? Given that there are clear symmetries in any particular volume, we'd like a method to learn to exploit those symmetries. Densely connected neural networks have been found to suit this task well, when given sufficient data.
Neural Radiance Fields
Adding these two components together - a compact, continuous volume estimator - and a volume renderer, we have sufficient components to construct a fully-differentiable pipeline to optimise our model. We can then minimise our pixel-squared error loss over our parameters.
One of the most important things to remember is that there is zero training involved in basic NeRF models. There is no prior knowledge about what the image it is trying to create, nor any information about common materials or shapes. Insteadm each NeRF model is trained from scratch for each scene we want to synthesise. We can see the impact that has on both training time, and the data requirements in our training simulator.
In addition to the extended training times, and substantial data requirements, there are also a few common failure cases that should be highlighted.
You may notice that the above examples don't have the same clarity that the baked model. This is no accident, beyond just training time, state-of-the-art performance when optimising these densely connected networks requires a few additional techniques.
Rather than adding viewing direction to the original input to the function, it is best to leave the directions out of the first 4 layers. This reduces the number of view-dependent (often floating) artifacts that occur from a premature optimisation of a view-dependent feature before realising the underlying spatial structure.
In truth, the input to the Neural Networks isn't the raw RGB images and viewpoints, instead, a pre-processing step is included to transform the position into sinusoidal signals of exponential frequencies.
This seemingly "magical" trick leads to substantially better results are removes a tendency to "blur" resulting
images. The network architecture for NeRF (Densely connected ReLU) is incapable of modelling signals with fine
detail, and fail to represent a signal's spatial and temporal derivatives, even though these are
essential to the physical signals. This is a similar realisation to the work of Sitzmann
A ReLU MLP with Fourier Features (of which Positional Encoding is one type) allows for these high frequency functions to be represented in low dimensional domains because the ReLU MLP acting as a dot product kernel, and the dot product of Fourier Features are stationary.
One of the first questions that likely came to mind when talking about classical volume rendering was why we were using a uniform sampling system when the vast majority of models will have large quantities of empty space (with low densities) and diminishing value from samples behind dense objects.
The original NeRF
With this article, you should have obtained an overview of the original Neural Radiance Field paper, and developed a deeper understanding of how they work. As we have seen, Neural Radiance Fields can also offer a flexible framework for encoding 3D scenes for rendering and while they have several shortcomings, there are significant extensions possible to both alleviate and expand on the system proposed.
To hint at the variety of uses we add a short list of recent papers that have used a NeRF framework alongside the shortcomings they address: