Please Stop Drawing Neural Networks Wrong

The case for GOOD diagrams

Aaron Master
Towards Data Science

--

Image by the authors, adapted from https://tikz.net/neural_networks/ (CC BY-SA 4.0)

By Aaron Master and Doron Bergman

If you’re one of the millions of people who has tried to learn neural networks, odds are you’ve seen something like the above image.

There’s just one problem with this diagram: it’s nonsense.

By which we mean confusing, incomplete, and probably wrong. The diagram, inspired by one in a famous online Deep Learning course, excludes all of the bias coefficients and shows data as if it were a function or node. It “probably” shows the inputs incorrectly. We say probably, because even after one of us earned certificates for completing courses this kind of diagram is used in, it’s more or less impossible to determine what it’s trying to show.¹

Other neural network diagrams are bad in different ways. Here’s an illustration inspired by one in a TensorFlow course from a certain Mountain View — based advertising company:

Image by the authors, adapted from the previous image.

This one shows the inputs more clearly than the first one, which is an improvement. But it does other weird stuff: it shows the bias by name but doesn’t diagram it visually, and also shows quantities out of the order in which they are used or created.

Why?

It’s hard to guess how these diagrams came to be. The first one looks superficially similar to flow network diagrams used in graph theory. But it violates a core rule of such flow diagrams, which is that the amount of flow into a node equals the amount of flow out of it (with exceptions that don’t apply here). The second diagram looks like maybe it started as the first one, but then wound up being edited to show both parameters and data, which then wound up in the wrong order.² Neither of these diagrams shows the bias visually (and neither do most others we’ve seen) but this choice doesn’t save much space, as we will see below.

We did not cherry pick these diagrams. An internet image search for “neural network diagrams” reveals that those above are the norm, not the exception.

Bad diagrams are bad for students

The diagrams above would probably be fine if they were being used only among seasoned professionals. But alas, they are being deployed for pedagogical purposes on hapless students of machine learning.

Learners encountering such weirdness must make written or mental notes such as “there is bias here, but they aren’t showing it,” or “the thing they are drawing inside a circle is actually the output of processing shown inside the same circle two slides ago” or “the inputs don’t actually work they way they are drawn.” The famous (and generally excellent) course mentioned above features lectures where the instructor patiently repeats several times that a given network doesn’t actually work the way a diagram shows it working. In the third week of the course, he valiantly tries to split the difference, alternating between special, accurate depictions which show what happens inside a node, and more typical diagrams which show something else. (If you want to see those better node depictions, this blog post nicely shows them.)

A better way

Learning neural networks should not be an exercise in decoding misleading diagrams. We propose a constructive, novel approach for teaching and learning neural networks: use good diagrams. We want diagrams that succinctly and faithfully represent the math — as seen in Feynman diagrams, Venn diagrams, digital filter diagrams, and circuit diagrams.

Let’s make GOOD diagrams

So, what exactly do we propose? Let’s start with basics. Neural networks involve many simple operations which already have representations in flow diagrams that electrical engineers have used for decades. Specifically, we can depict copying data, multiplying data, adding data, and inputting data to a function which outputs data. We can then assemble abbreviated versions of these symbols into an accurate whole, which we will call Generally Objective Observable Depiction diagrams, or GOOD diagrams for short. (Sorry, backronym haters.)

Let’s look at the building blocks. To start, here’s how you show a total of three copies of data coming from a single source of data. It’s pretty intuitive.

And here is a way to show scaling an input. It’s just a triangle.

The triangle indicates that the input value x₁ going into it is scaled by some number w₁, to produce a result w times x₁. For example w could be 0.5 or 1.2. Later on it will be easier if we move this triangle to the right end of the diagram (merging it with the arrow) and make it pretty small, so let’s draw it that way.

OK, we admit it: this is just an arrow with a solid triangle tip. The point, as it were, is that the triangle tip multiplies the data on the arrow.

Next, here’s a way to show adding two or more things together. Let’s call the sum z. Also simple.

Now, we showed addition and multiplication above with some standard symbols. But what if we have a more general function that takes an input and produces an output? More specifically, when we’re making neural nets, we will use an activation function that is often a Sigmoid or ReLU. Either way, it’s no problem to diagram; we just show this as a box. For example, say the input to our function is called z and the function of z is called g(z) and produces an output a. It looks like this:

Optionally, we can note that g(z) has a given input — output characteristic, which can be placed near the function box. Here’s a diagram including a g(z) plot for ReLU, along with the function box. In practice, there are only a few commonly used activation functions, so it would also be sufficient to note the function name (e.g. ReLU) somewhere near the layer.

Or, we can abbreviate even more, since there will be many, often identical, activation functions in a typical neural network. We propose using a single stylized script letter inside the function box for a specific activation, e.g. R for ReLU:

Similarly, Sigmoid could be represented with a stylized S and other functions with a another specified letter.

Before moving on, let’s make note of a simple but key fact: we show data (and its direction of travel) as arrows, and we show operations (multiplying, adding, general functions) as shapes (triangle, circle, square). This is standard in electrical engineering flow diagrams. But for some reason, perhaps inspiration from early computer science research which physically colocated memory and operations, this convention is ignored or even reversed when drawing neural networks.³ Nonetheless, the distinction matters because we do train the function parameters, but we do not train the data, each instance of which is immutable.

OK, back to our story. Since we will soon be constructing a neural network diagram, it will need to depict a lot of “summing then function” processing. To make our diagrams more compact, we will make an abbreviated symbol that combines these two functions. To start, let’s draw them together, again assuming a ReLU for g(z).

Since we are about to abbreviate things, let’s see how they look when placed really close together. We will also drop the internal variable and function symbols from the plot, and add some dotted lines to hint at a proposed shape:

Based on this, let’s introduce a new summary symbol for “sum then function”:

Its special shape serves as a reminder of what it is doing. It also looks different than other symbols on purpose, to help us remember that it is special.⁴

A GOOD thing

Now, let’s put all the diagrammed operations above together using a simple example of logistic regression. Our example starts with a two-dimensional input, multiplies each input dimension’s value by a unique constant, adds together the result along with a constant b (which we call bias), and passes the sum through a Sigmoid activation function. For reasons that will make sense later, we show the bias as the number 1 times the value b. For completeness (and foreshadowing) we give all these values names which we can show on the diagram. The inputs are x₁ and x₂, and the multiplication factors include the weights w₁ and w₂ as well as the bias b. The sum of the weighted inputs and bias is z, and the output of function g(z) is a.

About that number “1” shown lower left on the diagram. The number 1 is not an input, but by showing this number in addition to the inputs, we clarify that each of these values is multiplied by a parameter contributing to the sum. This way we can show both values of w (input weights) and values of b (bias) on the same diagram. Bad diagrams usually skip showing the bias but GOOD ones don’t. Skipping bias in a diagram is especially risky in situations where a network might sometimes intentionally omit the bias; if the bias is not shown, a viewer is left to guess whether it is part of the network or not. So please deliberately include or exclude bias in your diagrams.

Now let’s clean this up a bit by using the “sum then function” symbol we defined above. We also show the variable names below the diagram. Note that we indicate a Sigmoid function with the stylized script letter S in the “sum then function” symbol.

That looks pretty simple. It is a GOOD thing.

Now, let’s build something a little more interesting: an actual neural network (NN) with a hidden layer of three units with ReLU activations, and one output layer with a Sigmoid activation. (If you’re not familiar, a hidden layer is any layer except the input or the output.) Note that this is the same architecture used in the Mountain View network diagram above. In this case, each input dimension, and the input layer bias, connects to every node in the hidden layer, then the hidden layer outputs (plus bias value again) connect to the output node. The output of each function is still called a but we use bracketed superscripts and subscripts to respectively denote which layer and node we are outputting from. Similarly, we use bracketed superscripts to indicate the layer to which the w and b values point. Using the style from the previous example, it looks like this:

Now we are getting somewhere. At this point, we also see that the dimensions of W and b for each layer are specified by the dimensions of the inputs and the number of nodes in each layer. Let’s clean up the above diagram by not labeling every w and b value individually.

All images in this article by the authors.

Ta-dah! We have a GOOD neural network diagram that is also good. The learnable parameters are both shown on the diagram (as triangles) and summarized below it, while the data is shown directly on the diagram as labeled arrows. The architecture and activation functions for the network, typically called hyperparameters, are seen by inspecting the layout and nodes on the diagram itself.

It’s all GOOD

Let’s consider the benefits of GOOD diagrams, independent of bad ones:

  • It’s easy to see the order of operations for each layer. The multiplications happen first, then the sum, then the activation function.
  • It’s easy to see (immutable) data flowing through the network as separate from (trainable) parameters belonging to the network.
  • It’s easy to see the dimensionality of the w matrix and b vector for each layer. For a layer with N nodes, it’s clear that we need b to be of shape [N,1]. For a layer with N nodes coming after a layer with M nodes (or inputs), it’s clear that w is of shape [N,M]. (However, one still must memorize that the shape is [outputs, inputs] not [inputs, outputs].)
  • Related, we see where exactly the weights and bias exist, which is between layers. Conventionally they are named as belonging to the layer they output to but apt students using GOOD diagrams are reminded that this is just a naming convention.

Let’s also review how GOOD diagrams differ from bad ones:

  • They show the bias at each layer. They do not omit the bias.
  • They show data as data, and functions as functions. They do not confuse the two.
  • They show when data is copied and sent to functions. They do not skip this step.
  • They show all the steps in the correct order. They do not incorrectly reorder or omit steps.
  • They are reasonably clean and concise. OK, the bad ones are a bit more concise.

Do some GOOD

This article has spilled a lot of ink covering what’s wrong with bad diagrams and justifying GOOD ones. But if you are an ML instructor, we encourage you to just start using GOOD diagrams, without fanfare. GOOD diagrams are more self-explanatory than other options. You’ll be covering how neural nets work in your course anyway, so introducing GOOD diagrams at that point is a good idea.

And of course, as a service to your students, it’s a good idea to show some bad diagrams too. It’s important to know how the outside world is drawing things, even when it’s nonsense. In our estimation, it’s much easier to learn from something accurate and then to consider something confusing, than it is to do the reverse.

More GOOD stuff is ahead

This completes the first article in what will become a series if it catches on. In particular, we have our collective eye on Simplified Network diagrams which compactly represent the kinds of fully connected networks shown above, and which could also stand some improvement. Convolutional Network diagrams deserve their own treatment. We are also looking into developing a software package which automates drawing of GOOD diagrams.

Acknowledgements

The authors thank Jeremy Schiff and Mikiko Bazeley for their assistance with this piece.

References and endnotes

1) Based on the other layers, maybe the first diagram is running the inputs into nontrivial activation functions, from which we get values likely different from the inputs. But there have been no examples in the associated courses that work this way, so it wouldn’t make sense to include such a diagram as the only fully connected diagram on the cheat sheet. Or maybe the first layer “a” values shown are identical to the inputs, in which case the activation functions are identity functions which incur trivial and unnecessary processing. Either way, the diagram is ambiguous and therefore bad.

2) Either of the first two diagrams looks like it could be an unfortunate condensation of better, older diagrams, such as those in chapter 6 of the second edition of Pattern Classification by Duda, Hart and Stork (which one of us still has in hard copy from CS229 at Stanford in 2002). That book shows activation functions in circular units (which is better than showing their outputs inside the units), and correctly shows outputs leaving the units before copies are made and split off to the next layer. (It also shows the inputs and bias oddly, though.)

3) If your study has progressed to include Convolutional Nets (CNs), you will see that CN diagrams routinely show data as blocks and processes as annotated arrows. Don’t fret. For now, just remember that there’s an essential difference between data and processes, and, for fully connected neural nets, a good (or GOOD) diagram will make clear which is which.

4) For you logic fans out there that see the “sum then function” symbol as an and gate operating in reverse, remember that and gates are irreversible. Therefore, this new symbol must have another meaning, which we define here.

--

--