Fight Against Silent Bugs in Deep Learning Libraries

TL;DR: How to find out if your favorite deep learning library is occasionally giving you wrong results? Such bugs happen from time to time, and are extremely difficult to notice, report, and debug.

Three years ago, I wrote an article Unawareness of Deep Learning Mistakes: buggy code can still train and appear to work, so it's difficult for users to realize that their code is wrong.

What's apparently more difficult to find out, is when the bad code exists in the deep learning library we use. Imagine, what if the library unfortunately computes wrong results for certain parts of our model during training? The training will probably still work to some extent thanks to the magic of SGD, so how could we ever possibly find out such bugs? I'll share some experience and lessons.

Silent Bugs in Deep Learning Libraries

"Bugs" in this article specifically refer to silent bugs that lead to wrong computation results, but no errors.

Such bugs exist in deep learning libraries and will continue to exist, because these libraries are young, and new features such as operators and training paradigm will continue to emerge in them as the research develops.

Such bugs in deep learning are very hard to notice. A model typically contains billions of floating point operations (FLOPs) grouped into hundreds of operators. Even with small bugs, it may still train, converge, and appear to work well. Maybe it works slightly worse, or it fails occasionally, but it's extremely difficult for a user to associate a suspicious result to a concrete bug in libraries. After all, there are many other explanations of a bad result that need to be ruled out: the model simply does not work; incorrect model implementation; bad hyperparameter; bug in users' training code, etc.

The situation gets worse when the buggy part of computation is not even explicitly written by users, but implicitly generated. Auto-generated computation such as auto differentiation and graph optimization are often not well exposed to users at all, making it more difficult to observe the bug. For example, pytorch/5801 is a bug in gradient computation that's found during the development of ELF OpenGO at FAIR. Models can still work to some extent with the bug, which hides the bug for a long time. It has unfortunately wasted many months in the project.

Compared to user's training code that may also have many silent bugs, deep learning libraries have some advantage in test-ability. They provide well-defined small building blocks (e.g. operators and their gradients), so they are more testable than an end-to-end training. But I've seen a few limitations of unittests in the context of deep learning:

  1. A test only covers a tiny input space, but other inputs may cause bugs.

    As an example, pytorch/36485 computes softmax incorrectly only if number of classes (C > 1024) && (C % 4 != 0), which is rare in real applications. It is found in the development of MoCo which uses 65537 classes. After noticing regression in model's accuracy, the root cause is later found by bisection.

  2. Behaviors under combinations of context are hard to test exhaustively.

    Deep learning libraries usually separate the definition of computation from its execution. As a result, a computation may run under different combinations of context: graph/eager mode (TensorFlow), eager/tracing/scripting mode (PyTorch), fusion with other computations, the device to run on, the level of parallelism to use, the underlying compute library and algorithm to choose from, etc. Unittests are often insufficient to cover such a huge space.

    This issue gets worse in higher-level interface (e.g. Keras). TensorFlow is well-known for its many high-level ways to do the same thing: users can write a model under graph or eager mode, using either object-oriented or functional style, with either raw TF APIs or Keras/Estimator interface, and Keras has many more modes within itself. Handling these combinations gets more challenging, because a high-level component has much richer semantics (therefore more side effects), which is often not strictly defined and harder to test than a simple operator doing pure computation.

    For example, tensorflow/25175 and tensorflow/40638 are two silent bugs in Keras causing models to not train properly. Both are due to unconventional combination in ways TensorFlow/Keras interact with each other.

  3. Concurrency bugs that happen nondeterministically.

    Deep learning software and hardware stacks by design have a high degree of parallelism, which provide room for concurrency bugs. Concurrency bugs such as a race condition may happen only in certain model or hardware, or may not be reproducible at all. They are difficult to notice, report, and debug.

    As an example, pytorch/18465 is a use-after-free concurrency bug I found. The only symptom I observed is that some tensor values in my model are unexpectedly modified. Drawing any conclusions beyond that is challenging, because any simplication I applied to the model can cause the bug to disappear. A lot of hours were put to track down and reproduce it with minimal examples. And there is little chance that a unittest can guard against such bugs.

Two Debugging Stories

I'll share stories of two more silent bugs that I found. They led to wrong gradients in TensorFlow and PyTorch, respectively. And both bugs stayed in the codebase for a year, presumably because users can hardly blame bad training on wrong gradients, rather than their own models.

Wrong gradients in PyTorch's nn.SyncBatchNorm

  1. Notice the bug

    I started to try out PyTorch's nn.SyncBatchNorm in the summer of 2019 due to the need of this layer in the MoCo project. To gain some trust in this layer, the first thing I do is to try it on some baselines I'm familiar with: a Mask R-CNN in detectron2.

    Luckily, this was before TensorFlow introduced the next bug I would find later. So when I compared it with my TensorFlow implementation of Mask R-CNN that also supports SyncBatchNorm, I can see that most results in detectron2 were a few AP worse.

    I know every detail of the two implementations, and their gap is very small when not using SyncBatchNorm. So I'm relatively confident that such a large gap is a library bug in PyTorch.

  2. Confirm the bug

    Next, we decided to just reimplement a correct SyncBatchNorm. It turned out to be quite easy, and this was later released in detectron2. Comparing the results of the two implementations further confirmed the bug is in nn.SyncBatchNorm.

  3. Narrow down the bug

    From the experiments in various models, I noticed that suboptimal results only appear if SyncBN is added in Mask R-CNN's mask head. Adding it to all other components is OK. Therefore I hypothesized that there are wrong computation results when batch size is different across workers, since that's where mask head is different from others. After sharing our findings with the code owner, the root cause in gradient computation was fixed.

Wrong results in TensorFlow's nccl_ops.all_sum

NCCL is widely used to reduce gradients among GPUs. However, it turns out that TensorFlow can do it wrong sometimes. This bug may affect all NCCL-based multi-GPU data-parallel training. Interestingly, it also affects SyncBatchNorm in TensorFlow if using NCCL.

  1. Notice the bug

    In the summer of 2020 I started to use TF v1.15 - a little late to it since it's full of TF1 deprecation messages that I don't like. I planned to just do some basic benchmarks of my code, but a few Mask R-CNN training blowed up with NaNs after 10~20 minutes of training. This has not happened before.

  2. Confirm the bug

    My first thought was that I broke my Mask R-CNN implementation at some commit. But after trying a few combinations of code versions, it became clear that TensorFlow was to blame, because the same code can train in TF v1.14, even if I make sure they use identical version of CUDA/CuDNN.

  3. Narrow down the bug

    I know that no one in TF team would use my complicated training code to debug, so I have to narrow it down. But such thing was never easy, because wrong results in any step in the whole training system can lead to NaNs, and there is nowhere to start looking. Moreover, the bug does not happen deterministically, and when I tried to simplify my code, it started to happen less frequently.

    Luckily, there is still a painful but practical way to go: bisection. So I:

    1. Made up a criteria that a "good" TF version must successfully survive 30 minutes of training for 3 times.
    2. Figured out where to download historical nightly TF releases, because compiling TF is slow. They are stored in public GCS buckets like
      gs://tensorflow-nightly/prod/tensorflow/release/ubuntu_16/gpu_py37_full/nightly_release/
    3. Performed a bisection between TF v1.14 and v1.15, first with the nightlies, then with my own builds, until I found the offending commit.

    Unfortunately, the offending commit seems correct to me. This means the commit which increases parallelism in NCCL probably triggers a bug that dates back even earlier.

  4. Further narrow down the bug

    After playing with the offending commit a bit, given the non-deterministic behavior of the bug, and the content of the commit, my hypothesis was that the way TensorFlow uses NCCL contains concurrency bugs.

    My original code only uses NCCL's all_sum to all-reduce gradients. To add a simple check of its results, I used tf.add_n to all-reduce the gradients again, and added tf.debugging.Assert to ensure that the two results have to match. Luckily, the results don't always match -- a large discrepancy appears once a while between the results of tf.add_n and nccl_ops.all_sum.

    This is where the heavy-lifting ended: I've turned the silent training bug into an obvious error. The bug is no longer about a failed training which "I think" should succeed, but is now about something that's obviously wrong in TensorFlow: we added tensors in two different ways and results don't match! No one is obligated to trust the correctness of my training code, but every one has to admit that nccl_ops.all_sum and tf.add_n must not produce different results.

    The rest is easy: I started to simplify my training code for better understanding of the bug, removed all depenencies, and eventually made a small enough self-contained reproducible script and reported a bug. Beyond that, it is no longer my responsibility.

Keys to Fight Silent Bugs

  1. Reproducing known results is the only way to discover silent bugs in model training. This is how we have an "expected output", so that we can notice if anything unexpected is happening.

  2. Narrowing down is necessary at least in the open source environment. Unless a small enough code clearly demonstrates a bug in the library, it's not the library owners' responsibility to understand and debug user code. After all, a bug often lives in user code rather than the library The general guidelines about how to ask good questions/bug reports can apply to deep learning.

  3. Bisection is slow, costly, but effective. When there is no obvious clues, and its cost is affordable, do a bisection. If anything can be better than bisection, that would be a trisection or k-section to reduce its latency, because verifying whether a commit works or not may require training a model for quite a while.

    Bisection is not always applicable. If there isn't a good history version as a reference, other more creative debugging methods will be needed.

  4. Know the library well, understand its internals so we can make reasonable hypothesis and investigate them. It's often helpful to dig into library code: a few lines of debugging code at the right place can provide valuable information that cannot be easily obtained in user code.

Takeaways

  1. As a downstream library owner, do regression tests. Retrain the most important models once a while, just in case any regression appears in the stack. For example:

    • I re-train a few important models in tensorpack examples whenever I upgrade the TensorFlow version I worked with.
    • A few most representative baselines in detectron2 model zoo are trained once a month using PyTorch master. Some smaller ones are trained once a week.
  2. As a researcher, be skeptical. Use more precaution to prevent silent bugs, otherwise we'll draw wrong conclusions from wrong experiments. Some strategies include:

    • During the period of a research, be careful about upgrading a dependency. After an upgrade, re-train some models to verify correctness.
    • For important research results, reimplement/reproduce them in different frameworks independently by different people to avoid hidden bugs in the stack (including hidden bugs that improve results). For example, both GroupNorm and MoCo were reproduced and released in >1 frameworks. MoCo was even implemented 4 times, by 3 different authors in the team.
  3. As an average user, follow what the experts are using. Silent bugs exist but are hard to find. Without enough confidence on our own ability to always discover such bugs, follow the experts.

    A library without years of battle testing may have many sharp edges or hidden bugs. Using a mature library like PyTorch or TensorFlow, the bug you may run into is more likely to have been discovered by others already. This applies not only to libraries as a whole, but also to different features of a library, modules within a library, extensions of a library, etc.

    This is not saying we should use the most popular thing. On the contrary, high-level frameworks that build over-simplified APIs to gain popularity among non-experts are something we'd rather avoid: they may have serious bugs buried underneath simply because the intended user group is not capable of noticing them.

  4. To make your code/library popular, reproduce known results to increase credibility. "Following the experts" tends to create monopoly. To break that, deep learning training libraries can earn trust by reproducing known results, rather than just provide examples of arbitrary toy models This is a core principle in tensorpack that I follow since the beginning, and is probably the most effective way to convince a user that your library/implementation does not have hidden silent bugs.

Comments