Notes from Industry

Asset2Vec: Turning 3D Objects into Vectors and Back

How we used NeRF to embed our entire 3D object catalogue to a shared latent space, and what it means for the future of graphics

Jonathan Laserson, PhD

Follow

Published in

Towards Data Science

8 min readNov 14, 2021

--

At Datagen, where I currently work as the Head of AI research, we create synthetic photorealistic images of common 3D environments, for the purpose of training computer vision algorithms. For example, if you want to teach a house robot to navigate through a messy bedroom like the one below, it will take you quite some time to collect real images for a large-enough training set [people don’t usually like when outsiders enter their bedroom, and they definitely won’t appreciate taking pictures of their mess].

A synthetic image of a messy bedroom (Image by author).

We generated the above image using a graphic software. Once we build (in the software) the 3D models of the environment and everything in it, we use its ray-tracing renderer to generate an image of the scene from any camera view point we like, under any light conditions we want. We have full control.

Now if you think collecting real images is hard, wait till you try to label these images. You’ll need to launch a long and expensive labeling operation to teach humans how to tag those pixels according to your standards. Of course, with synthetic images, we can easily produce pixel-perfect labels for any aspect of the image, including things humans can’t asses, like depth and surface normal maps.

Collecting Assets

In order to help our user train her robot, we need to generate tens of thousands of bedrooms like the one above, and we need to fill them up with stuff: furniture, cloth, books, used cups, remote controls, etc. Hence, we maintain a large catalogue of artist-made 3D objects (we call them assets). Our catalogue spans over one million objects, and we take pride in it.

Each asset object consists of a detailed 3D mesh (a polygonal structure made of triangles, which defines the shape of the object), and a texture map (an image, which defines the appearance of the object, as if it is a blanket used to cover the mesh surface).

An asset (left) is represented by a mesh structure (middle) and a texture map (right). Image by author.

Other than the mesh and the texture map which define the object visually, the rest of the information stored for each asset is whatever the 3D artist decided to put in a textual meta-data file, like the type of the object (“dining chair”), and whatever tags that seemed relevant at the time.

Hence, searching through the catalogue (for example, “give me all tables with 3 legs”) can only be done using their meta-data attributes. If the artist didn’t write the number of legs for each table, then the only way to know which tables are three-legged is to open their 3D model files one by one and look.

Asset-2-Vec

We instead propose to encode each asset in an embedding space, which will encapsulate the entire asset’s shape and appearance. Much like Word2Vec, which gives each word in the dictionary a “code” vector that corresponds to a location in an n-dimensional space, such that semantically similar words reside close to each other, we want to assign a vector to each of our assets, so that we can tell just by looking at the vector all the visual properties of the assets.

Encoding all our 3D assets into vectors. Image by author.

Ideally we would also want to decouple the shape properties (for example, the number of legs) from the appearance properties (like the color or material). How would we know that the vector indeed captures the entire essence of the object? The ultimate way is to be able to go back from the vector to the 3D model of the asset.

You can imagine a neural network that will learn to read the vector as input, and output the original mesh and texture map of the asset. However, this is going to be hard. Each asset’s mesh has a completely different topology: different number of triangles, different meaning for each node, different format for the mapping between the texture map and the mesh. We’ll need something else, an alternative 3D representation that will fit all assets.

NeRF to The Rescue

This is where NeRF comes in. As you recall, in my previous post I showed how using 40 or so images of an object, taken from various viewpoints, we can train a neural network to learn the whole space around the object. The trained neural network takes as input a point (x, y, z) in space, and returns the color (r, g, b) and the degree of opacity (α) of the material in that point. The networks learns the space so well, that a renderer can take a “photo” of the object, just by querying this network on points along the rays coming from a simulated “camera”.

The neural network at the center of NeRF (image by author).

We can easily get 40 images covering each asset we have in our catalogue. We simply render them using the graphic software from their original mesh model (so we might as well render 80). Then we can train the NeRF network using these images to encode the space around the object. After the network is trained, we can generate a short movie showing the object from various new angles, where each frame is rendered by querying the trained NeRF network.

What we’re going to do next, is to use that single NeRF network to encode not just a single asset (like the chair above), but rather all the chairs in our catalogue. This single network is going to have exactly the same architecture as the NeRF network. The only difference is that it is going to have an additional input: the (latent) vector assigned for each chair, which it will learn during training. We will train this network exactly like we trained the NeRF network, except we will use the photos taken from of all the chairs in our catalogue.

The NeRF network, with the latent vector as an additional input. During training, the latent vectors of the assets are also learned together with the weights of the fully-connected layers (image by author).

This training is different from common training, in the sense that in addition to its own weights, the network also learns a special set of variables associated with each individual chair — its latent vector (similar to how an embedding layer is learned). When we backpropagate the error obtained from rendering a chair image, not only the weights of the NeRF network are updated, but also the values in the latent vector assigned to that specific chair (e.g. the vec input).

Once we trained this network, we can use it to produce movies of all the chairs in the catalogue! To render any specific chair, we just need to give that chair’s latent code as input to the NeRF network when we query it.

A sample of our chair assets rendered by querying a **single** NeRF network. Image by author.

Latent Exploration

We don’t need to restrict ourselves to the vectors learned for the assets in our current catalogue. We can also explore what happens when we change the latent vectors arbitrarily, or mix and match latent vectors from two different assets. Perhaps in the near future, we will be able to enrich our catalogue with new assets this way:

**Mixing latent vectors**: The chair at location (i, j) is produced when the input to the NeRF networks combines the *shape* part of the latent vector of chair i, with the **appearance part** of chair j. Image by the author.

To get an intuition how the assets latent space looks like, we can use the TSNE algorithm to place our assets on a 2D plane (according to their latent vectors):

T-SNE plot of the vectors of 522 of our assets. Clear separation between object classes and subclasses. Image by the author.

We can see that the separation between the different subclasses of chairs is evident, and even within a subclass (i.e. armchair), we can easily find areas where the chairs share visual attributes (i.e. armchairs with the wooden armrests vs. armchairs with padded armrests).

And this is the whole point, isn’t it? Because it means we can easily train a linear classifier (like SVM) to identify any visual attribute we care about, by feeding it the vectors of only a few positive and negative assets, and then use it to tag the rest of the 1M-assets catalogue, without ever loading those assets. This ad-hoc classification can save us a ton of time!

Latents are The Future

Finally, this analysis works not just on chairs, but also on every other asset family in our catalogue. We can imagine a future where every scene we render will be fully described by those latents: A latent for each asset, a latent for the background, a latent for the pose, the camera angle, the lights. And if we dare, we can imagine a future where the traditional mesh and texture maps would no longer be used at all to render synthetic but photorealistic images, like the messy bedroom scene pictured above.

A sample of our *table* assets rendered by querying a **single** NeRF network. Image by author.

Appendix: Decoupling Shape and Appearance

The NeRF network has two branches, one for the opacity (α), and the other for the color. To decouple the shape and appearance parts in the latent vector, we divided the input latent code to two, and showed the appearance part only to the color branch, as illustrated below. This trick was adopted from the recent GIRRAFE paper (Niemeyer and Geiger, CVPR 2021).

Architecture of our revised NeRF network (image by author).

To gain a better understanding of the shape and appearance sub-spaces, we did a PCA analysis of the relevant part of the latent vectors:

Illustration of the top 10 PCA components for the shape-space (top) and appearance space (bottom). Images by the author.

We can see how these control various aspect of the chair such as the size of the seat, its width, the height of its back, etc. In the appearance space we can see how the PCA directions controls colors but also their saturation, brightness, and the light conditions.