Unawareness of Deep Learning Mistakes

TL;DR: People are hardly aware of any deep learning mistakes they made, because things always appear to work, and there are no expectations on how well they should work. The solution is to try to accurately reproduce settings & performance of high-quality papers & code.

The Dropout Mistake

I'll start with a true story. Years ago I was playing with different ConvNets with a couple of students at Berkeley. There weren't so many great softwares at that time, and we just wrote everything on top of Theano, which was also very premature back then. We just started to learn about neural networks, and I wanted to try this trick called "dropout".

One student said he had tried it already and got no good results. I was curious how to implement it in Theano, so I asked for his code, and immediately noticed something wrong: I saw a random mask generated with numpy.

If you are familiar with symbolic computation frameworks, you can probably guess what happened. Instead of randomly dropping half the neurons in each training step, the code randomly dropped half the neurons at compilation, by masking the tensor with a numpy array. It did nothing but made the model smaller.

It's quite common to get confused about compile-time and run-time when people are new to symbolic computation. And dropout in Theano is indeed not trivial, someone wrote blogs about it. But these technical details about this mistake aren't important: what's important is the fact that you don't know you've made mistakes.

Why?

Why people are unaware of deep learning mistakes? Because (usually) there are no "correct results" to expect from the code.

In general programming, you become aware of your bugs by wrong results / exceptions / segfaults. You wrote unit tests or ran experiments, to check correctness of code. But the study of deep learning is totally different: it's highly based on experiments; it's almost all about experiments. People draw conclusions from experiments, rather than use experiments to verify conclusions. Therefore it's hard to find when experiments themselves are wrong.

The "dropout" mistake is just a simple example which looks more like programming bugs. Perhaps you might find it through unit tests. But there are a lot more mistakes that hurt you without getting noticed. I'll give some other examples I've seen from others:

  1. You train a model with BatchNorm for a small number of steps, and BatchNorm statistics EMA decay parameter is 0.99. You see your model has a much better training score than validation score, and conclude it's overfitting.

  2. When working with pixel interpolations, it's important to model the pixel value as the color sampled at center of the pixel square. In fact all libraries have slightly different behaviors on this. Doing it wrong can hurt tasks that require pixel alignment, e.g. segmentation.

  3. Copy-paste some preprocessing / augmentations that are unsuitable or even harmful for a new dataset.

  4. Download a pretrained ResNet and use it without mean-subtraction, although it's needed.

  5. Your ConvNet training only has 60% GPU utilization. So you tell your boss / advisor that TensorFlow is slow, because that's what you've heard from others.

Apart from these very specific problems, you could've just somehow misunderstood or missed details in a paper you implement. You won't see any errors from these mistakes. Instead, models with mistakes usually still train, converge, and appear to work, thanks to the magic of gradient descent. Unless an expert came to check your code carefully, it's unlikely you can realize that. Deep learning feels so easy in a world without bugs.

Does it Matter?

You can look at this in the good way: deep learning is so robust! It still works (slightly worse) even with mistakes. Right, it doesn't matter if you only play with it, or use it for tasks where "being good enough" is enough.

But if you care about being "better than enough", being best in the field, or achieving state-of-the-art research, or saving computation power for your company or users, then it certainly matters.

How to be Aware?

People are unaware of mistakes due to the lack of expectation -- then just create expectation. The straightforward way is to try to accurately reproduce settings & performance of high-quality papers or code, before using it for your own settings. It gives you a clear outcome to expect from your code. If you get worse performance under the same setting, you know something must be wrong.

Certainly there are many difficulties in reproducing papers:

  1. Not everyone have the extra time to spend on a setting that they don't really need.

  2. Some papers are unclear about the exact settings.

  3. Some are naturally hard to reproduce due to variance in experiments, e.g. RL papers.

  4. Sometimes papers are not trustworthy. People may tune their results heavily or choose the best from multiple runs, just to get published.

  5. Some papers have no measurable results, e.g. GANs.

It would make our life much easier if authors can open source their code. When there is a public piece of code that reproduces the method, it's usually not hard to write a new one.

Open source also benefits the authors. It validates the paper and make it easier to follow, which is especially important for junior researchers: it's unlikely for others to spend time reproducing the method, when they don't have enough trust on the author.

How's the community doing?

First, do authors often release their code?

I would say not many. I guess a major reason is that research code is usually messy -- which is reasonable. In research, you don't know whether the code you're writing now will ever be used after today, and you are often unable to plan (and make designs) for code you're going to write tomorrow. It's hard to write high-quality code in this way.

Second, are papers reproduced by the community?

Even worse. In fact, even the most famous results are not so popular in the open source world. I work on CV & RL, so I'll mention some most famous results in these field:

  1. ResNet (and variants) on ImageNet. Modern DL frameworks today usually have their official implementations.
  2. FasterRCNN on COCO. The original code release was based on outdated VGG / caffe / matlab. If you want a modern version (ResNet at least), the only ones I know that mention matching COCO performance (within 1 mAP -- this is still not a small gap that you can use the "variance excuse") are 1 2 3 4 5.
  3. MaskRCNN on COCO. Published 8 months ago, there is literally no success in reproducing its performance on github. There is one which mentions matching performance on cityscapes dataset, though.
  4. DQN/A3C for playing atari games (pong excluded -- it's like MNIST). There are many DQN repos but often didn't mention any comparable scores, and I didn't check further. For A3C the only good ones I know are 1 2 3 4.

In fact, most github repos that claim to "reproduce" these papers only reproduce the method, not the experiments. They do "appear to work", but as said earlier, without a matching experiment you don't know are they correct or not.

Also, if you check DL frameworks in the above links, the world is dominated by tensorflow, pytorch and mxnet. Caffe and theano are not doing well today. Another interesting thing is, none of the above results were successfully reproduced in Keras.

What we should do?

ICLR next year has a reproducibility challenge, which is a great signal that the community is addressing this issue. The challenge let students reproduce experiments in ICLR papers, and the rules basically said what we should do:

  • For authors: encouraged to release the code
  • For users of papers: reproduce some experiments in the paper

This post is written from the view of users, e.g. students, researchers trying to follow someone's results, engineers working on DL applications. If you're serious about what you're training, don't get too happy about something that appears to work. Remember that you could be unaware of your mistakes.

Comments