Thoughts and Theory
Rendering, a crucial part of any graphics system, __ is what makes a computerized three-dimensional world reflect on our two-dimensional computer screen, as if one of the world’s characters had taken a camera out of their pocket and photographed what they saw.
Over the past year (2020), we’ve learned how to make the rendering process differentiable, and turn it into a deep learning module. This sparks the imagination, because the deep learning motto is: "If it’s differentiable, we can learn through it". Indeed, many deep-learning breakthroughs have come from trying to soften the gap between zero and one, turning the discrete into the continuous, the argmax into the softmax. If we know how to differentially go from 3D to 2D, it means we can use deep learning and backpropagation to go back from 2D to 3D as well.
Suppose I’m holding in my hand a (2D) photograph of a cat sitting inside a window (taken in the real world), and have access to a differentiable renderer, a system that converts a representation of a three-dimensional (computerized) world to a two-dimensional image. Right now, if I ask the system to render a 2D image, I would get some random image that looks nothing like a cat, because the 3d world it would have rendered was just some randomly initialized computerized environment.
BUT because the renderer is differentiable, it means that every single pixel in the resulting image is a differentiable function of the computerized 3d world. This essentially means that our deep learning framework maintains a link from the rendered image back to the 3D world representation. We can therefore ask: "how can we change our current 3D world representation, so that the next time we render it from the same angle, the resulting image would look more like the photograph?" (mathematically, we are taking the gradient of the difference between the photograph pixels and the image pixels).
If we change our 3D world representation accordingly, and do this over and over, eventually our differentiable renderer will render an image that looks very similar to the original cat photograph.
At that point, we can turn our attention to the computerized 3D world it converged into. What can we make of it? Does it have a 3D cat object sitting inside a 3D window? If we render the same 3D computerized world from a different camera angle (obtaining a different image), would it still look like a cat? Well, unless our 3D representation already has some predefined notion of what a 3D cat looks like, probably not. Using just a single photograph, from a single view point, there is just not enough information to understand how the cat will look like from a different angle. For all we know our 3D world might converge into a flat cat printed on a canvas that is placed just in front of the camera.
In order for our representation of the 3d world to converge to something useful, we’ll need more photographs, taken from more points of view.
That’s precisely what a group of researchers from UC Berkeley (from Ren Ng’s lab) did in a project called NeRF (short for Neural Radiance Fields). They shot a static scene in the real world at no fewer than 40 different angles at once (and obtained 40 images), with the goal to get a three-dimensional computerized representation of the scene, that can be correctly rendered later from any angle. Their results are stunning.
The idea of using multiple views of the same object or scene in order to get a 3D representation of it is not new (it’s called view synthesis). However, none of the previous methods were able to generate a 3D structure with such a high fidelity to the real world that images rendered from it from new viewpoints were indistinguishable from real photos.
The guys who built NeRF were able to nail it. And the reason they succeeded was that they combined between a powerful differentiable renderer and a powerful yet untypical 3D world representation. That powerful 3D world representation was a neural network. And not even one of those fancy neural networks, but rather a multi-layer perceptron just like the good old times, 9 fully-connected layers all the way through.
How does the neural network at the center of NeRF represent the entire 3D world? The answer: It classifies its 3D space! Its input is any 3 coordinates (x, y, z) of a point in space (as well as the camera view point), and its output is the color (r, g, b) and the degree of opacity (α) of the material in that point in space.
This is it. Five numbers in, four numbers out. Once the neural network is fully trained on the scene, we can query it on any point in space and probe the environment. This is nice, but if we want to know more about the actual content of the scene, we’ll have to work hard probing the network. This neural network does not explicitly tell us anything about the number of objects in the scene, their class, or their boundaries. It’s a much more "atomic", low-level, way to look upon the world.
From a Computer Graphics researcher point of view, this is radical. They are used to represent each object in the scene in its own data-structure (called a "mesh"). They are used to process surfaces, obtain depth maps, and trace rays from light sources. There is none of that here.
Instead, we have something more similar to the voxel grid used to represent volume in medical images, like MRI or CT scans, except:
- The MRI/CT volume has a fixed resolution, using discreet voxels, while we can ask the NeRF network about any fractional (x, y, z) point, at any resolution.
- The MRI/CT volume takes up a huge amount of space (consider 512 x 512 x 512 voxels) compared to the relatively compact 256 x 256 x 9 weights (roughly) of the NeRF multi-layer perceptron.
So the NeRF’s neural network is not the most natural way to represent distinct objects, but it is definitely a suitable way to render them, and perhaps this is what makes the resulting images so photorealistic.
How does the NeRF renderer work?
You can imagine the image we’re rendering like a screen placed in front of the camera, perpendicular to the camera’s view angle. To determine any pixel’s color in this image, we send a ray from the camera through the pixel’s position on the screen and onwards. If the ray hits any surface, the pixel’s color is determined to be the color of the surface at the point of impact. If the surface is opaque (𝛂=1), we’re done here. But if not, we have to keep following the ray to see what other surfaces it hits because their color will also impact that pixel’s color.
In the non-differentiable version of this renderer, only the closest opaque surface determined the color of the pixel. But if we want to make it differentiable, we need again to turn the argmax into a softmax. In other words, all the surfaces hit by the ray need to play some part in the computation, even those that are occluded by the closest one (which will still have the largest impact). This is key to maintaining that necessary link between all the regions of our 3D world and the resulting image pixels.
More concretely, to calculate a particular pixel’s color, we sample 100 points along the above mentioned ray, and ask our neural network for the color (r,g,b) and the degree of opacity (𝛂) of each. The pixel’s final color is a linear combination (weighted average) of all the colors of all the sampled points, even those that are in mid-air (that should have 𝛂 close to 0), or those on surfaces that are "hidden" from the camera (that should have 𝛂 close to 1). The weight assigned to the i-t_h po_int on the ray (away from the camera) will get the weight T_i 𝛂_i, where the scalar 𝛂_i
is calculated directly by the network, and
depends on the opacity values of the previous points. You can see how the closest surface gets the most dominant weight (we normalize the weights so they sum to 1).
The Way Forward
I find this very exciting. This is no longer a neural network that is predicting physics. This is physics (or optics) plugged on top of a neural network inside a PyTorch engine. We have now a differentiable simulation of the real world (harnessing the power of computer graphics) on top of a neural representation of it (harnessing the power of Deep Learning). There’s no wonder that the results look so photorealistic.
This **** romance between deep learning and graphics is only in its beginning. It is going to close the gap between computer simulation and the real world. If you have a VR headset, think about watching movies where you don’t just sit in a static view point, but can actually move inside the scene, and even interact with it.
The latest innovations will allow us to import objects and environments from the real world into the virtual world, relying less on artist-made assets. [In this context, I recommend watching Sanja Fidler of Nvidia’s lecture on creating three-dimensional content from a workshop on the subject from the last NeurIPS.]
And those things I mentioned earlier about the limitations of the Nerf network? That it is too low level for computer graphics? This is going to change. We’re going to extract from this 3D world representation semantic elements that will allow us to understand it and control it, just like we do for the classic mesh representation, but retaining that new standard of photorealism.
In the next post I will explain how we plan to do that.
Further Reading
[1] Mildenhall et. al. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV 2020 [link]
[2] Facebook’s PyTorch3D library. [link] [Tutorial on implicit functions and NeRF] [[dedicated jupyter notebook for NeRF](http://dedicated NeRF notebook.)]
[3] Sanja Fidler. A. I. For 3D Content Creation. NeurIPS 2020 [link]
[4] Vladlen Koltun: Towards Photorealism [video]