Debugging in TensorFlow. How to Debug a TensorFlow Training… | by Chaim Rand

How to Debug a TensorFlow Training Program Without Losing Your Mind

If debugging is the process of removing software bugs, then programming must be the process of putting them in.
Edsger Dijkstra. From https://www.azquotes.com/quote/561997

In some of my previous posts (here, here, and here), I told you a bit about how my team at Mobileye, (officially known as Mobileye, an Intel Company), uses TensorFlow, the Amazon SageMaker and Amazon s3 to train our deep neural networks on large quantities of data. In this post, I want to talk about debugging in TensorFlow.

It is well known, that program debugging is an integral part of software development, and that the time that is spent debugging, often eclipses the time that it takes to write the original program.

Debugging is hard, and much has been written about how to design and implement one’s program in order to increase the reproducibility of bugs, and ease the process of root cause analysis.

In machine learning, the task of debugging is complicated by the stochasticity that is inherent to machine learning algorithms, and by the fact that the algorithms are run on dedicated HW accelerators often on remote machines.

Debugging in TensorFlow is further complicated due to the use of symbolic execution (a.k.a. graph mode), that boosts the runtime performance of the training session, but, at the same time, limits the ability to freely read arbitrary tensors in the graph, a capability that is important for debugging.

In this post, I will expand on the difficulties of debugging TensorFlow training programs, and provide some suggestions for how to address those difficulties.

For legal purposes, I want to clarify that despite my carefully chosen subtitle, I provide no guarantees that anything I write here will prevent you from losing your mind. On the contrary, I think that I can all but guarantee that you probably will lose your mind when debugging your TensorFlow program, despite anything I write. But, perhaps, you will lose your mind just a little bit less.

Before we begin, let’s clarify the scope of our discussion.

In the context of this post, debugging refers to the art of identifying a bug, either in your code, or in your data, that causes your training session to abruptly break down.

A different kind of debugging, that is out of the scope of this post, refers to the task of fixing, or tuning, a model that is not converging, or that is producing unsatisfactory predictions on a certain class of inputs (e.g. a vehicle detection model that is failing to identify pink cars). This procedure might involve defining and evaluating model metrics, collection and statistical analysis of the model artifacts (such as gradients, activations and weights), using tools such as TensorBoard and Amazon Sagemaker Debugger, hyperparameter tuning, rearchitecting, or modifying your data input using techniques such as augmentation and boosting. Tuning a model can be an extremely challenging, time consuming and often frustrating task.

Types of Bugs

Within the realm of solving bugs in one’s code or data, I like to make the distinction between two categories of bugs: bugs and monster bugs.

By bugs I refer to issues that are relatively easy to reproduce. Examples of bugs are models with an assumption on the sizes of the input tensors that doesn’t match the training data, trying to concatenate mismatched tensors, or performing a tf operation on an invalid data type. These usually don’t depend on specific model states and data and are typically relatively easy to reproduce. They aren’t necessarily easy to fix, but they are child’s play compared to monster bugs.

Monster bugs are bugs that occur sporadically and unpredictably. Bugs that reproduce only on a specific state of the model, a specific data sample, or a specific combination of the model state and data input, could pose a serious challenge and might constitute a monster bug.

Here is an example of a scenario, based on true events, that is certain to increase your blood pressure:

It’s Friday afternoon and your model has been training successfully for a couple of days. The loss appears to be converging nicely, and you are starting to picture a relaxing, post-release, weekend vacation, in a getaway location of your choosing. You glance back at your screen for a moment and notice that, all of a sudden, without any warning, your loss has become NaN. “Surely”, you think to yourself, “this must have been due to some totally random, momentary, macrocosmic glitch”, and you immediately resume training from your last valid model checkpoint. A few more hours pass, and it happens again, and then again. Now you start to panic, the dreamy pictures of your weekend paradise now replaced with thoughts of the tantalizing effort of needing to solve a monster bug.

We will come back to this sorrowful example in a short while. But first, let’s check off some mandatory “debugging” check-boxes.

Much ink has been spilled on the art of debugging and, more importantly, the art of developing debuggable code. In this section, I will mention a few techniques, as they pertain to TensorFlow applications. This list is, by no means, comprehensive.

Saving Model Checkpoints

This is probably the most important thing I will write in this post. Always configure your training session such that it periodically saves snapshots of your model.

Programming bugs are not the only reason why your training might break down… If you are running in the cloud, you might get a spot instance termination, or hit an internal server error. If you are running locally, there might be a power outage, or your GPU might explode. If you have been training for days, without storing intermediate checkpoints, the damage could be extreme. If you saved a checkpoint every hour, then all you lost is, at most, an hour. TensorFlow offers utilities for storing checkpoints, such as the keras model checkpoint callback. All you need to do, is to decide how frequently to capture such snapshots, by weighing the overhead of storing checkpoints, against the cost of an unplanned break down in the training session.

Contact Tracing

I apologize to my Covid19 contemporaries for my choice of title for this subsection, I just couldn’t resist. By contact tracing, I am referring to the ability to keep track of the training data that is being entered into the training pipeline.

Suppose your training data is divided into 100,000 in tfrecord files, and that one of these files has a formatting error that crashes, or stalls, your program. One way to narrow down your search for the problematic file, is to record each file that is entered into the pipeline. Once you hit the crash you can look back at your log to see what the most recent files to be entered, were. As I have mentioned in previous posts, we train using the Amazon SageMaker pipe mode feature. A fairly recent addition to pipe mode, is a, pipe mode server side log that records the files that are being entered into the pipe.

Recording the data that enters into pipeline can assist in one’s ability to reproduce bugs, which brings us to our next point.

(Ir)Reproducibilty

The ease at which a bug can be reproduced directly impacts how easily it can be solved. We always want to write our code so as to ensure reproducibity. This is not easy in TensorFlow programs. Machine learning applications, often include reliance on the use of random variables. We randomly initialize model weights, we randomly augment data, we randomly shard our data for distributed training, we randomly apply dropouts, we shuffle our input data before each epoch, and then shuffle it again (using tf.dataset.shuffle) before creating batches. We could seed all of the pseudo-random operations with pseudo-random seeds that we record, but keep in mind that there could be many different places that introduce randomization, and keeping track of all of these could easily become a bookkeeping nightmare. I can’t tell you how many times I have thought I had removed all elements of randomization, only to find that I missed one. Additionally, there are some random processes that cannot be seeded. If you use multiple processes to import your training data, you might not have any control over the order in which the data records are actually fed (e.g. if experimental_deterministic is set to false in tf.data.Options()). Of course, you could record each sample as it is entered into the pipe, but that would come at a steep, and likely prohibitive, overhead.

The bottom line is that while it is definitely possible to build reproducible training programs, I think it’s wiser to embrace the non-determinism, accept the irreproducible nature of training, and find ways to overcome this debugging limitation.

Modular Programming

A key technique in creating debuggable programs, is to build your application in a modular fashion. Applied to a TensorFlow training loop, this would imply the ability to test different subsets of the training pipeline, such as the dataset, the loss function, different model layers, and callbacks, separately. This is not always easy to do, as some of the training modules (such as the loss function) are pretty dependent on the other modules. But there is a lot of room for creativity. For example, one can test different functions on the input pipeline by simply iterating over the dataset while applying a subset of the dataset operations. One can test a loss function, or a callback, by creating an application that runs just the loss function or callback. One can neutralize the loss function, by replacing it with a dummy loss function. I like to build my models with multiple points of output, i.e. with the ability to easily modify the number of layers in the model so as to test the impact of different layers.

The more thought you put in to the modularity and debuggability of your program when you are building it, the less you will suffer later on.

Eager Execution

If you are a regular TensorFlow user, you have probably encountered terms such as “eager execution mode”, “graph mode”, and the “tf function qualifier”. You may have heard some (somewhat misleading) statements such as “debugging in eager execution mode is a piece of cake”, or “tensorflow 2 runs in eager execution mode”. You may, like me, have ardently dove into the tensorflow source code, trying to make sense of the different execution modes, only to have broken down in sobs, your self-esteem shattered for life. To get a full understanding of how it all works, I refer you to the TensorFlow documentation, and wish you luck. Here we will mention just the gist of it as it pertains to debugging. The most optimal way to run TensorFlow training is to run it in graph mode. Graph mode is a symbolic execution mode, which means that we don’t have arbitrary access to the graph tensors. Functions that are wrapped with the tf.function qualifier, will be run in graph mode. When you train with tf.keras.model.fit, by default, the training step is executed in graph mode. Of course, the inability to access arbitrary graph tensors, makes debugging in graph mode difficult. In eager execution mode you can access arbitrary tensors, and even debug with a debugger, (provided that you place your breakpoint in the appropriate place in the model.call() function). Of course, when you run in eager execution mode, your training will run much slower. To program your model to train in eager execution mode, you need to call the model.compile() function with with the run_eagerly flag set to true.

The bottom line is, when you are training, run in graph mode, when you are debugging, run in eager execution mode. Unfortunately, it is not uncommon for certain bugs to reproduce only in graph mode and not in eager execution mode, which is a real bummer. Also, eager execution is helpful when you are debugging in a local environment, less so in the cloud. It is often not very useful in debugging monster bugs… Unless you first find a way to reproduce the bug in your local environment, (more on this down below).

TensorFlow Logging and Debugging Utilities

Try to make the most of the TensorFlow logger. When you are debugging an issue, set the logger to the most informative level.

The tf.debugging module offers a bunch of assertion utilities as well as numeric checking functions. In particular, the tf.debugging.enable_check_numerics utility can be helpful in pinpointing problematic functions.

The tf.print function, which enables printing out arbitrary graph tensors, is an additional utility that I have found extremely useful for debugging.

And, last but not least, add your own print logs, (in the non-graph portions of the code), to get a better feel for where your program breaks down.

Deciphering TensorFlow error messages

Sometimes, you will be lucky enough to get an TensorFlow error message. Unfortunately, it is not always immediately clear how to use them. I often get emails from colleagues with cryptic TensorFlow messages, begging for help. When I see messages, such as:

tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [5,229376] vs. shape[2] = [3,1]

node DatasetToGraphV2 (defined at main.py:152) (1) Failed precondition: Failed to serialize the input pipeline graph: Conversion to GraphDef is not supported.

ValueError: slice index -1 of dimension 0 out of bounds. for 'loss/strided_slice' (op: 'StridedSlice') with input shapes: [0], [1], [1], [1] and with computed input tensors: input[1] = <-1>, input[2] = <0>, input[3] = <1>.

I ask myself (slightly modified to make the post child friendly) “What the bleepitybeep am I supposed to with that?” or “Why couldn’t the friendly-loving TensorFlow engineers give me something more to work with?”. But I quickly calm myself, (sometimes with the help of an alcoholic beverage), and say, “Chaim, stop being so spoiled. Get back to work and be thankful that you got any message at all.” The first thing you should do, is to try to reproduce the bug in eager execution mode, and/or with a debugger. Unfortunately, as mentioned above, this doesn’t always help.

There is no arguing the fact that messages such as the ones above are not very helpful. But don’t despair. Sometimes, with the help of some investigative work, you will find clues that might lead you in the right direction. Go through the call stack to see if it provides any hints. If the message includes shape sizes, try to match these up against tensors in your graph that might be of the same shape. And, of course, search online to see if others have encountered similar issues and in what scenarios. Don’t despair.

Run in a Local Environment

Naturally, debugging in your local environment is easier than debugging on a remote machine, or in the cloud. This is particularly true when you first create your model. Your goal should be to work through as many issues as possible in your local environment before starting to train remotely. Otherwise, you are likely to end up wasting a lot of time and money.

To increase reproducibility, you should try to make your local environment as similar as possible to the remote environment. If you are using a docker image or virtual environment in your remote environment, try to use the same one locally. (If your remote training is on Amazon SageMaker, you can pull the docker image used.)

Of course, there may be some elements of the remote training environment that cannot be reproduced locally. For example, you might have encountered a bug that only reproduces when using Amazon SageMaker pipe mode, which is currently only supported when running in the cloud. (In this case you might consider alternative methods for accessing your data from s3.)

I wish I could tell you that the techniques described here will solve all your problems. But alas, such is not the case. In the next section we will return to the monster bug scenario we illustrated above, and introduce one last debugging technique.

In the scenario we described above, after days of training, a combination of the particular state of the model and a particular training batch sample, suddenly caused the loss to become NaN.

Let’s evaluate how we can use the debugging techniques above to debug this issue.

If we kept meticulous track of the seeds that were used for all the random operations, and there were no uncontrolled non-deterministic events, we could in theory reproduce the bug by training from scratch… but that would take days.
Reproducing in a local environment or in eager execution mode would likely take weeks.
We could resume from a recent checkpoint, but we would only be able to reproduce the same model state and batch sample if we can resume from the exact same sample and with the exact same state of all the pseudo-random generators.
Adding tf.prints would help, but introduce tremendous overhead
Adding tf.debugging.enable_check_numerics would be very helpful in pinpointing the function in which it fails. This might be sufficient if there is an obvious bug in the function. But it does not enable us to reproduce the bug.

Ideally, we would be able to capture the input and model state right before the loss goes bananas. Then we could reproduce the issue in a controlled (local) environment, in eager execution mode and with a debugger.

The problem is that we don’t know that problem is about to happen, until it actually happens. By the time the loss is reported as NaN, the model has already been updated with NaN weights, and the batch sample that caused the error has already been iterated over.

The solution I’d like to propose is to customize the training loop such that we record the current sample at every step, and only update the model weights if the gradients are valid. If the gradients are invalid, we will halt training and dump out the last batch sample along with the current model snapshot. This can be carried over to your local environment, where you load the model, and enter the captured data sample in eager execution mode in order to reproduce (and solve) the bug.

We will get to the code in a moment, but first, a few words about the pros and cons of using custom training loops.

Custom Training Loop vs High Level API

There is an age-old dispute amongst TensorFlow users as to whether to write custom training loops or rely on high level APIs such as tf.keras.model.fit().

Proponents of the custom training loop, herald the ability to have line by line control over how the training is performed, and the freedom to be creative. Supporters of the high level API call out the many conveniences it offers, most notably the built-in callback utilities, and distributed strategy support. Using the high level API is also presumed to ensure that you are using a bug-free, and highly optimized implementation of the training loop.

Starting from version 2.2, TensorFlow introduced the ability to override the train_step and make_train_function routines of the tf.keras.model class. This enables one to introduce some level of customization while continuing to enjoy the conveniences of model.fit(). We will demonstrate how to override these function in such a way that enables us to capture a problematic sample input and model state for local debugging.

The Custom Capture Loop

In the code block below, we extend the tf.keras.models.Model object with customized implementations of the train_step and make_train_functions routines. To get a full understanding of the implementation, I recommend that you compare it to the default implementations of the routines in github. You’ll notice that I have removed all of the logic relating to metrics calculation and to strategy support in order to make the code more readable. The main changes to note are:

Before applying the gradients to the model weights, we test the gradients for NaN. The gradients will be applied to the weights, only if NaN does not appear. Otherwise, a signal is sent to the training loop that an error was encountered. An example of a signal can be setting the loss to a predetermined value such as zero or NaN.
The train loop stores the data features and labels (x and y) at each step. Note that in order to do that, we have moved the dataset traversal (next(iterator) call) outside of the @tf.function scope.
The class has a boolean “crash” flag to signal to the main function whether an error was encountered.

class CustomKerasModel(tf.keras.models.Model):
def __init__(self, **kwargs):
super(CustomKerasModel, self).__init__(**kwargs)# boolean flag that will signal to main function that 
# an error was encountered
self.crash = False
@tf.function
def train_step(self, data):
x, y = data
with tf.GradientTape() as tape:
y_pred = self(x, training=True)  # Forward pass
# Compute the loss value
# (the loss function is configured in `compile()`)
loss = self.compiled_loss(
y, y_pred, regularization_losses=self.losses)
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# concatenate the gradients into a single tensor for testing
concat_grads = 
tf.concat([tf.reshape(g,[-1]) for g in gradients],0)
# In this example, we test for NaNs, 
# but we can include other tests
if tf.reduce_any(tf.math.is_nan(concat_grads)):
# if any of the gradients are NaN, send a signal to the  
# outer loop and halt the training. We choose to signal
# to the outer loop by setting the loss to 0.
return {'loss': 0.}
else:
# Update weights
self.optimizer.apply_gradients(
zip(gradients, trainable_vars))
return {'loss': loss}
def make_train_function(self):
if self.train_function is not None:
return self.train_function
def train_function(iterator):
data = next(iterator)
# records the current sample
self.x, self.y = data
res = self.train_step(data)
if res['loss'] == 0.:
self.crash = True
raise Exception()
return res
self.train_function = train_function
return self.train_function
if __name__ == '__main__':
# train_ds = 
# inputs = 
# outputs =
# optimizer =
# loss =  
# epochs =
# steps_per_epoch = 
model = CustomKerasModel(inputs=inputs, outputs=outputs)
opt = tf.keras.optimizers.Adadelta(1.0)
model.compile(loss=loss, optimizer=optimizer)
try:
model.fit(train_ds, epochs=epochs,                         
steps_per_epoch=steps_per_epoch)
except Exception as e:
# check for signal
if model.crash:
model.save_weights('model_weights.ckpt')
# pickle dump model.x and model.y
features_dict = {}
for n, v in model.x.items():
features_dict[n] = v.numpy()
with open('features.pkl','wb') as f:
pickle.dump(features_dict,f)
labels_dict = {}
for n, v in model.y.items():
labels_dict[n] = v.numpy()
with open('labels.pkl', 'wb') as f:
pickle.dump(labels_dict, f)
raise e

It is important to note, that there is a small training runtime cost to this technique that comes from reading the data from the dataset in eager execution mode, rather than graph mode. (There are no free lunches.) The precise cost will depend on the size of the model; the larger the model, the less this change will be felt. You should evaluate the overhead of this technique on your own model, and then decide whether, and how, to employ it.

So long as us humans are involved in the development of AI applications, the prevalence of programming bugs is just about guaranteed. Designing your code with debuggability in mind, and acquiring tools and techniques for solving bugs, may prevent some serious torture down the line.

Most importantly, don’t despair.

Source link