Generating, With Style: The Mechanics Behind NVIDIA’s Highly Realistic GAN Images

Published in

Towards Data Science

11 min readJan 21, 2019

If you followed any machine learning news towards the end of last year, you probably saw images like this circulating, the results of a new Generative Adversarial Network (GAN) architecture specified by a recent paper by a team at NVIDIA:

An apparently randomly selected set of faces produced by NVIDIA’s “Style-Based” GAN

Hearing that jaw-dropping results are being produced by some novel flavor of GAN is hardly a new experience if you follow the field, but even by recently heightened standards, these images are stunning. For the first time, I’m confident I wouldn’t be personally able to differentiate these from real images. Reading between the lines of the paper framing, it seems like the primary goal of this approach was actually to create a generator architecture where global and local image features were represented in a more separable way, and could as a result be more easily configured to specify and control the image you want to generate. The fact that the images were also astonishingly realistic appears to have been a pleasant side effect.

(As an aside, this post is going to assume a basic background knowledge of how convolutional GANs work; if you don’t have that background, this article is a good jumping-off point)

But First, The Basics

Traditional convolutional GANs work by sampling a vector z from some distribution, projecting that vector into a low-resolution spatial form, and then performing a sequence of transposed convolutions to upsample a 2x2 feature space to 4x4, then to 8x8, and so on. Although the z vector is just sampled randomly, our ultimate goal is to create a mapping between the distribution of images and our reference distribution Z, such that each vector in z corresponds to a plausibly real image. As a result, despite being meaningless at first, each particular z ends up corresponding to and encoding properties of the image that it will produce.

In their simplest form, transposed convolutions work by learning a filter matrix (for example, 3x3), and multiplying that by the value at each pixel to expand its information outward spatially. Each of the single “pixels” in a 4x4 representation influences the values in a 3x3 patch of output; these patches overlap and sum to create the final “blown out” representation.

The visual above, while good for building simplified intuition, is a little misleading, since it makes it look like the values of the enlarged patch have to be spun out of a single piece of information from a single pixel. Remember: at this point in the network, each of the four “pixels” contains a whole vector, perhaps 128 or 256 feature maps worth of information.

Since most areas in the bigger “upsampled” image will only get information from a certain set of “parent” pixels (for example, the top left pixel in the 4x4 image only contains information from the top left pixel in the 2x2), the information at each level needs to be somewhat spatially diversified- to be able to lay out where different components are in space — but also somewhat global, since if we’re trying to, say, create an image in black and white, each of the 2x2 pixels need to contain that information in order to make sure it’s passed down to the next generation, and we don’t end up with a chunk of the final image having not gotten the memo. Starting at the fully global Z vector that specifies the full image, each pixel is responsible for conveying to its children the information that it will need to convey to its children, and so on and so forth until the last layer of “children” is just the RGB pixels of the output image.

This is the fundamental problem that transposed (or “deconvolutional”, though that’s a confusing term, and thus dispreferred) networks, like those used in GAN generators, are solving: how to distribute information and instructions to different spatial regions, so as to ultimately be able to orchestrate the production of a global coherent, high quality image.

Think Globally, Generate Locally

As the prior section illustrates, in the layers of the generator of a typical GAN, global information (specifying image-wide parameters that all regions need to coordinate on) and local information (where exactly, spatially, should we put the hands, vs the eyes, vs strands of hair) are all mixed together in a shared feature vector, passed on from one layer to the next.

The method proposed by the NVIDIA paper, which they term a “style-based generator,” takes a different approach: instead of only feeding one vector in at the beginning of the network, they instead re-inject a global parameter vector, called W, and the output of a learned transformation from the random vector sample Z, at each layer of the network.

Walking through the diagram on the left: first, a draw is made from the Z distribution, which is typically something like a multidimensional Gaussian. Then, a mapping network is used to transform the z vector into a w vector. The idea of this transformation is that by mapping between z and w, we might be able to get a more natural or disentangled representation of concepts. (“Disentangled” here means that each dimension of the vector corresponds to an independent feature axis along which images vary, like hair color or age). The paper’s explanation for this is that, since we’re typically constrained in the kinds of distributions we can easily sample from, it might not be easy to represent image concepts in axis-aligned ways, and still have all the regions of the sampling distribution filled and corresponding to coherent points.

On the generator side of things, rather than starting with the usual random z vector, we start with a learned constant 4x4 feature vector, just to get the right shape. Then, we see two operations, noising and style injection, which are then repeated throughout the network.

Noising

Noising, represented in the diagram as the B and addition operators, is performed first. A noise map is randomly generated (initially of size 4x4, but in subsequent layers of whatever dimension the representation is at there) and passed into the network. Each feature map adds a scaled version of this to its values, with each map’s unique scaling factor learned as a network parameter. This has the effect of randomly perturbing the values of each feature map, and, importantly, perturbing the values at each spatial pixel of the map by a different amount, based on the magnitude of the sampled noise there. I’ll circle back around and try to build some intuition for why this is a valuable thing to do later in the post

Style Injection

After noising, the network performs “style”injection, a form of learned normalization based on global style parameters . Here’s where we use the w vector generated by the mapping network, which captures the global specification of the image. At each place this operation is performed, we use an learned affine transformation (basically a typical single neural network layer) to take in this global vector w, and output a vector containing two values for each feature map present at this point in the network, one scale parameter and one shift (or “bias”) parameter. These parameters are used in what’s called Adaptive Instance Normalization, or AdaIN for short, which is an operation for applying separate per-feature shifting and rescaling.

In the equation above, the index i refers to a specific feature map. In words, we take the ith feature map (which, remember, is laid out over some set of spatial dimensions) and calculate its mean value and standard deviation. We then perform a normalization and rescaling operation, reminiscent of batch norm: subtract out and divide by the mean and sd, and then multiply by and add the scale and shift parameters.

These two operations allow us to control the mean and variance of each feature for each image being generated, through our learned per-feature scale and shift parameters. The most important fact about all of these operations is that they are performed globally, across all spatial regions at once. If you think (very roughly) of each of these feature maps as dials controlling various image features — skin color, length of hair, distance between eyes, then performing this rescaling operation based on values calculated off of our global style parameter w is like moving those dials from the default “average face” position to the values we want for this particular image. In this architecture, it’s no longer the case that each deconvolution operation has to learn how to remember global information, and keep it consistent across all spatial regions: that consistency comes for free.

These two operations, noising and style-driven normalization, are the core of this architecture — the network functions by applying these on either side of deconvolution and upsampling. Together, they appear to have the effect of “factoring” information into more separable and separately controllable components, and facilitating more realistic generation of stochastic image elements.

What’s this Noise All About?

The authors’ rationale for the use of noise injection is that there are aspects of an image that, if not totally random in a rigorous sense, are at least not informationally meaningful, and are easier to generate in convincingly random fashion when noise is given directly to the network. Consider, for example, the curls in the little girl’s hair in the (a) picture below on the left, or the exact placement of the leaves in the same position on the right. Both of these look more realistic to the eye when the positions and orientations of the individual curls or leaves appears random.

This figure shows the same generated image, but with injected noise removed at different levels of the network. (a) shows the default StyleGAN, with noise at all levels. In (b), all noise is removed from the network. In (c), noise is applied at the later layers responsible for fine detail, but not at coarser/lower resolution layers. Finally, in (d), that is swapped, and noise is present for coarse layers but not fine ones. The same orientation of a/b/c/d follows for the boy on the right.

However, convincing randomness is actually relatively difficult for a network to natively generate; since everything inside a network is just deterministic mathematical transformations, often randomness will be approximated by some parametric function that has noticeable patterns or periodicity. Consider, as an example, the boy’s picture in the bottom left rectangle: the leaves are placed at approximately even intervals apart, in a way that doesn’t look particularly realistic.

Another way to think about the role of noise is as giving the network a way to easily sample from a distribution of equivalently plausible variations of a given image. The image on the left shows a generated image, and then shows the differences you get when you generate that image with all global elements the same, but with different instantiations of fine-level noise. The result is a generally very similar image, but with minor variations to details not essential to the image’s core identity. Although obviously none of these people are “real”, you can think of this very generally like taking different pictures of the same person, but with the wind or the lighting a little bit changed, and thus these random elements of their appearance in a somewhat different configuration.

You’ve Got Style

That explains the value of noise, but what benefit are we getting from this style injection setup? As a reminder, styles are configured on a per-image basis through two normalization parameters calculated per feature map, and those parameters are deterministically calculated based on a global style vector w. As an aside, the term “style” is used to refer to these global parameters because of the way they globally affect an image, and impact a broad array of aesthetic and content features, is reminiscent of style transfer work, where the visual style of one artist is used on a photo, or on another artist’s work. I think it’s a bit confusing, and prefer to think of styles as configurations, or global parameter settings, but I still use “style” occasionally, because that’s the term the paper uses.

Returning to our metaphor of styles being a way to “turn the dials” of image generation on a per-feature basis, one way to demonstrate how styles control resulting images is to see what happens when you combine together feature settings from two different images. Because style injection happens separately at each layer, this can be easily done by just feeding in Person A’s w vector to some set of layers, and Person B’s w vector to the remaining layers. This results in some layers being configured according to Person A’s parameters, and others configured according to Person B’s. This is what’s being shown in the image above: in each row, we’re taking the leftmost image and swapping out some group of its style parameters with the image in the corresponding column. For the first three rows, we’re swapping in the coarse style parameters from the source, for the second two rows, intermediate, and for the last row, only fine style parameters are being “imported” from the alternate image. (Note that none of these are real images, simply different artificial draws from the input z distribution, which are then mapped into a w vector using the mapping network)

Reconfiguring the coarse parameters of an image has, predictably, the most dramatic impact, completely changing the age, gender, and facial shape to match the source image, but with ethnicity and overall color palette kept constant from the destination. When we only change fine level parameters, we get what appear to be recognizably a person with the same gender, expression, and facial expression, but with minor changes to hair and skin color. By the Goldilocks principle, middle layers are somewhere in between: more of an equal melding between the two images.

All of this demonstrates a useful flexibility: instead of just being able to interpolate between generated images, and have global and local features change jointly, this architecture allows for greater control in merging or modifying on a scale-specific, or even feature-by-feature basis. Typically, this level of control over generation would be facilitated by training a class-conditional model, where you can then induce generation of a specified class for each image. We aren’t quite at the point of being able to specify a set of image instructions, and have that be translate-able into a blueprint that a generator can follow, but this approach brings us closer to the ability to do so without direct class supervision.

Zooming Out

An interesting observation I recall reading someone make about this paper (though unfortunately I forget who), is that it’s not so much an advance in GAN design as much as it’s a lot of quite clever image-specific generator design. I think this is a generally good distinction to keep in mind, both here and more broadly in ML: since so much of modern deep learning research focuses on images, it’s valuable to keep a running awareness of what successes represent generalized progress, vs domain-specific advancements. Not that there’s anything wrong with clever domain-specific work, to the contrary I think there’s huge value in it, but just that we should be careful of conflating “here is impressive progress on images” with “here is impressive progress on learning tasks more broadly”.

A quick caveat: I dug deep into this paper, and feel relatively confident of my understanding of it, but haven’t done quite as thorough a review of related literature as I have for some past posts, so it’s possible I’ve misunderstood the relationship of StyleGANs to current state of the art. Do let me know if you think that’s the case!