Debugging your tensorflow code right (without so many painful mistakes)

Halyna Oliinyk
Towards Data Science
9 min readFeb 9, 2019

--

When it comes to discussing writing code on tensorflow, it’s always about comparing it to PyTorch, talking about how complex the framework is and why some parts of tf.contrib work so badly. Moreover, I know a lot of data scientists, who interact with tensorflow only as with a dependency of pre-written Github repo, which can be cloned and then successfully used. The reasons for such an attitude to this framework are very different and they’re definitely worth another long-read, but today let’s focus on more pragmatic problems: debugging code written in tensorflow and understanding its main peculiarities.

Core abstractions

  • computational graph. The first abstraction, which makes the framework be able to handle the lazy evaluation paradigm (not eager execution, which is implemented “traditional” imperative Python programming) is the computational graph tf.Graph. Basically, this approach allows the programmer to create tf.Tensor (edges) and tf.Operation (nodes), which are not evaluated immediately, but only when the graph is executed. Such a method to constructing machine learning models is quite common for many frameworks (for instance, a similar idea is used in Apache Spark) and has different pros and cons, which become obvious during writing and running the code. The main and most important advantage is the fact that dataflow graphs enable parallelism and distributed execution quite easily without explicitly usingmultiprocessing module. In practice, well-written tensorflow model uses the resources of all of the cores as soon as it is launched without any additional configuration.
    However, a very obvious disadvantage of this workflow is that as long as you’re constructing your graph but not running it with (or without) some input provided, you can’t ever be sure that it will not crash. It definitely may crash. Also, unless you’ve executed the graph, you can’t estimate the running time of it too.
    The main components of the computational graph worth talking about are graph collections and graph structure. Strictly speaking, the graph structure is the specific set of nodes and edges discussed earlier, and graph collections are sets of variables, which can be grouped logically. For instance, the common way to retrieve trainable variables of the graph is tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES).
  • session. The second abstraction is highly correlated with the first one and has a bit more complex interpretation: tensorflow session tf.Session is used to connect between the client program and C++ runtime (as you remember, tensorflow is written in C++). Why C++? The answer is that mathematical operations implemented via this language can be very well optimized and, as the result, the computational graph operations can be processed with a great performance.
    If you’re using low-level tensorflow API (which most of Python developers use), tensorflow session is invoked as a context manager: with tf.Session() as sess: syntax is used. The session with no arguments passed to the constructor (as in the previous example) is using only the resources of the local machine and default tensorflow graph, but it can also access remote devices via distributed tensorflow runtime. In practice, a graph can’t exist without a session (without session it can’t be executed) and the session always has a pointer to the global graph.
    Diving deeper into the details of running the session, the main point worth noticing is the syntax of it: tf.Session.run(). It can have as argument fetch (or list of fetches) which can be tensor, operation or tensor-like object. In addition, feed_dict can be passed (this optional argument is a mapping (dictionary) of tf.placeholder objects to their values) together with a set of options.

Possible issues to experience and their most presumable solutions

  1. session loading and making the predictions via the pre-trained model. This is the bottleneck, which took me a few weeks to understand, debug and fix. I would like to highly concentrate on this issue and describe 2 possible techniques for the re-loading of the pre-trained model (its graph and session) and using it.
    First of all, what do we really mean when talking about loading the model? To do this, we need to, of course, train and save it before. The latter is usually done viatf.train.Saver.save functionality and, as a result, we have 3 binary files with .index, .meta and .data-00000-of-00001 extensions, which contain all of the needed data to restore the session and graph.
    To load the model saved this way, one needs to restore graph viatf.train.import_meta_graph() (the argument is the file with .meta extension). After following the steps described in the previous paragraph all of the variables (including so-called “hidden” one, which will be discussed later) will be ported into the current graph. To retrieve some tensor having its name (remember that it may be different from the one you’ve initialized it with due to the scope where the tensor has been created and the operation it’s the result of ) graph.get_tensor_by_name() should be executed. This is the first way.
    The second way is a bit more explicit and hard to implement (in the case of the architecture of the model I’ve been working on I haven’t managed to execute the graph successfully when using it) and its main idea is saving graph edges (tensors) explicitly into .npy or .npz files and later loading them back into the graph (together with assigning proper names according to the scope where they’ve been created). The problem with this approach is that it has 2 huge cons: first of all, when model architecture becomes significantly complex, it also becomes quite hard to control and keep in place all of the weight matrices. Secondly, there’s a kind of “hidden” kind of tensors, which are created without explicitly initializing them. For instance, when you create tf.nn.rnn_cell.BasicLSTMCell, it creates all the required weights and biases to implement an LSTM cell under the hood. Variables names are also assigned automatically.
    This behavior may look okay (as long as these 2 tensors are weights and it seems pretty useful to not create them manually, but rather have the framework handling it), but in fact, in many cases, it is not. The main problem of such an approach is that when you’re looking at the collections of the graph and see a bunch of variables, which you don’t know the origin of, you don’t actually know what you should save and where to load it. To be absolutely honest, it is very hard to put hidden variables to the right place in the graph and appropriately operate them. Harder than it should be.
  2. creating the tensor with the same name twice without any warning (via automatically adding _index ending). I can’t consider this issue to be as important as the previous one, but this problem definitely bothers me as long as it results in a lot of graph execution errors. To explain the problem better, let’s look at the example.
    For instance, you’re creating tensor using tf.get_variable(name=’char_embeddings’, dtype=…) and then saving it and loading back in the new session. You’ve forgotten that this variable was a trainable one and have created it one more time in the same fashion viatf.get_variable() functionality. During graph execution the error that will occur will look like: FailedPreconditionError (see above for traceback): Attempting to use uninitialized value char_embeddings_2. The reason for it is that, of course, you’ve created an empty variable and not ported it in the appropriate place of the model, while it can be ported as long as it is already contained within the graph.
    As you’ve seen, no error or warning was raised because of the fact that the developer has created the tensor with the same name twice (even Windows would do so). Maybe this point is crucial only for me, but this is the peculiarity of tensorflow and its behavior I don’t really enjoy.
  3. resetting the graph manually when writing unit tests and other problems with them. Testing the code written in tensorflow is always hard because of a lot of reasons. The first — most obvious one — is already mentioned at the beginning of this paragraph and may sound quite silly, but for me, it was at least irritating. Because of the fact that there’s only one default tensorflow graph for all of the tensors of all of the modules accessed during runtime, it is impossible to test the same functionality with, for instance, different parameters, without resetting the graph. It is only one line of the code tf.reset_default_graph(), but knowing that it should be written at the top of the majority of the methods this solution becomes some kind of monkey job and, of course, an obvious sample of code duplication. I haven’t found any of the possible ways to somehow handle this issue (except for using the reuse parameter of the scope, which we will discuss later) as long as all of the tensors are linked to the default graph and there’s no way to isolate them (of course, there can be a separate tensorflow graph for each method, but from my point of view it is not the best practice).
    Another thing about unit tests for tensorflow code that also bothers me a lot is that in the case when some part of the constructed graph should not be executed (it has uninitialized tensors inside of it because the model hasn’t already been trained) one doesn’t really know what should we test. I mean that the arguments to self.assertEqual() are not clear (should we test the names of the output tensors or their shapes? what if the shapes are None? what if tensor name or shape are not enough to make the conclusion that the code works appropriately?). In my case, I simply end up asserting tensor names, shapes, and dimensions, but I’m sure that for the case when the graph is not executed checking only this part of the functionality is not a reasonable condition.
  4. confusing tensors names. Many people would say that this comment on the tensorflow performance is an extraordinary way of whining, but one can’t always say what the name of the resulting tensor after performing some kind of operation will be. I mean, is the name bidirectional_rnn/bw/bw/while/Exit_4:0 clear for you? As for me, it is absolutely not. I get that this tensor is the result of some kind operation made on the backward cell of dynamic bidirectional RNN, but without explicitly debugging the code one can’t find out what the operations were performed and in what order. Also, the index endings are not understandable too, as long as to realize where did the number 4 came from one needs to read tensorflow docs and dive deep into the computational graph.
    The situation is the same for the “hidden” variables discussed earlier: why do we have bias and kernel names there? Maybe this is the problem of my qualification and level of skills, but such debugging cases are quite unnatural for me.
  5. tf.AUTO_REUSE, trainable variables, re-compiling the library and other naughty stuff. The last point of this list is giving a brief look at the small details I had to learn by error and trial method. The first thing is reuse=tf.AUTO_REUSE parameter of the scope, which allows to automatically handle already created variables and doesn’t create them twice if they already exist. In fact, in many of the cases, it can solve the issue described in the second point of this paragraph. However, when it comes to practice, this parameter should be used with care and only when the developer knows that some part of the code needs to be run twice or more times.
    The second point is dedicated to trainable variables, and most important note here is: all of the tensors are trainable by default. Sometimes it can be a headache as long as this behavior is not always the desired one and it is very easy to forget that they all can be trained.
    The third thing is just an optimization trick, which I recommend for everybody to do: almost in every case when you’re using the package installed via pip you receive the warning like: Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2. If you see this kind of message, it is the best idea to uninstall the tensorflow and then re-compile it via bazel using the options you’d like. The main benefit you’ll receive after doing so is the increased speed of calculations and better general performance of the framework on the machine.

Conclusion

I hope that this long-read will be useful for the data scientists who are developing their first tensorflow models and struggle with the non-obvious behavior of some parts of the framework, which are hardly understandable and quite complicated to debug. The main point I wanted to say by it is that making a lot of mistakes when working on this library is perfectly fine (and for any other thing it is perfectly fine too) and asking questions, diving deep into the docs and debugging every goddamn line is very much okay too.
As with dancing or swimming, everything comes with practice, and I hope I was able to make this practice a bit more pleasant and interesting.

--

--