这里我们把讨论限制在如下的使用场景中:
由于假设 2, 一个新 feature 往往要对仓库里的多个部分进行相对独立的改动, 例如在实现 feature X 的过程中, 我们可能会:
等等多个相对独立的步骤.
由于假设 4, 开发者知道自己的方向是大概率正确, (做少量修改后) 会被 reviewer 通过的. 因此为了提高效率, 在实现了部分改动 (例如 1-3) 后, 后续 (例如 4-5) 的开发不应被 code review block 住. 代码管理系统应当很好的支持这种 无阻塞的开发模式, 才能最大化团队的效率. 然而我们将发现, git + github 的设计并不鼓励这种开发模式.
传统的基于 github/gitlab 的 workflow 具有这样的特性:
而 stacked diffs workflow 的最重要的特性是: code review 的基本单元对应仓库里的 commits. 在 phabricator (Meta 使用的 code review 系统, 也被用于 llvm 等开源项目) 中这个单元叫做 "diff", 概念上对应 "PR".
Code review 的基本单元是 branch 还是 commit, 究竟有什么区别?
读者可能会认为, 在 github 上 review PR 的时候, 也是在 review commit. 但是, PR 里提供给 reviewer 的内容, 其实是通过 branch 的状态计算得来的: 每个 PR 有一个 target branch (例如下图中的pytorch:main
) 和一个 featurebranch (例如ppwwyyxx:logging
). Code review 的内容是它们之间的差别. 也就是说, 如果 feature branch 的内容发生变化 (例如有了新的 commit), PR 就会发生变化.
而在基于 commit 的 workflow 中, 甚至根本不需要 branch 这个概念: 我所有的工作都会是本地的一系列 commits, 它们被同步到 code review 系统里成为 diffs. 由于 diff 与单个 commit 对应, 添加新的 commit 并不会影响 code review 系统里现存的内容, 而会创建新的 diff. 如果要修改某个 diff 的内容, 我们可以把修改 amend 进这个 diff 对应的 commit.
说了这些基础概念, 接下来我们解释为什么 git + github PR 的 workflow 并不好用. 整个逻辑总结下来是这样的:
这一节介绍一个普遍的工程实践: code review 单元应尽可能的小. 一个复杂的开发任务 应尽可能拆成不同的 code review 单元 而不是合在一起 review. 这是因为:
Code review 所需要的精力与 PR 长度并不成比例: 大的 PR 要比等量的小 PR 更难 review.
难以 review 带来的结果是, 大的 PR review 时间更长, 收到的 review 质量更低. 以上两点都是有很多研究佐证的, 例如 "Modern Code Review: A Case Study at Google" 这篇 paper.
Review 是有延迟的. 分开独立的 code review 使得 较早的改动在被 accept 后可以尽早合并, 这样能 (i) 减少冲突; (ii) 尽早被别人使用, 触发可能的问题.
举例来说, 假如一个工作的第一部分可以很快被 accept, 而其他部分还需要讨论 & review 一周. 如果我们等整体被 accept 了再全部一起合并, 则第一部分可能会与这一周里其他改动产生合并冲突, 而这种冲突本可以完全避免.
尽早将已完成的小部分工作拿出来 review, 这样被 reviewer 发现问题可以及时调整后续路线. 否则的话, 如果憋了个大招一起 review, 再发现问题调整起来就会额外花很多工作量.
不同模块的改动可能需要由不同的人来 review. 合在一起 review 会给每个 reviewer 增加额外的心智负担: review 时要找 "哪些是我该看的?". 不断收到 code review 平台发送的新的通知要想 "跟我有没有关系?"
合并进仓库的 commit 历史应该与 code review 单元一一对应 (而不应多对一). 当回看 commit 历史时, 小的, 一次只解决一个独立问题的 commit 会看起来更清晰, 找问题也会更容易.
因为这些原因, 好的工程实践总是 鼓励将大的单个改动拆成多个小的部分, 分开进行 review 和提交. 每个部分各自需要逻辑上是一个完整的, 正确的小单元. 一个改动通常不超过 100 行, 原则上不超过 300 行.Google 的 "Modern Code Review" 论文中也说:
Developers are strongly encouraged to make small, incremental changes.
Graphite 公司的文章说:
The ideal PR is 50 lines long.
有时, 拆分会导致部分的总和略大于单个改动; 有时, 为了将一个大规模改动 (例如重构) 变得 "可以拆分", 甚至需要额外做不少的工作 (例如增加兼容层). 但 "small incremental change" 的优点值得这些额外的付出.
当有多个互相依赖的小的 code review 后, 需要工具来自动化的管理它们的依赖关系. 笔者在 Meta 和 Google 都使用本地的 mercurial 仓库配合 Meta/Google 内部的 code review 工具. 这套 workflow 可以非常方便的管理 code review 间的依赖.
下面以几个例子说明 Meta 的基于 mercurial 仓库 + Phabricator Diff 的 workflow 为什么优于 git 仓库 + github PR 的 workflow. 在每个例子中, 用😞来表示体验糟糕的部分.
Example 1: 我们以这样两个改动开始:
它们有依赖关系 A <- X
. 在 Meta, 我会这么做:
hg log
里都能看到全部两个.如果使用 github, 我将不得不这么做:
git log
能看到两个改动, branchA 只能看到改动 A.Example 2: 继续上一个 example, 在经过一些 review 后, 我们需要对改动 A 的内容进行修改. 在 Meta 我会这么做:
如果使用 github, 我需要:
Example 3: 接着上一个 example, 在经过一些 review 后, 我们发现需要额外对函数 S 进行修改才能更好的实现 feature A. 也即依赖关系为S <- A <- X
. 在 Meta 我会这么做:
如果使用 github, 我需要:
Example 4: 接着上一个 example, 假如我们有S <- A <- X <- Y
的依赖链, 此时 S 和 A 都被 accept, 我们想要尽快将其合并, 并在合并后的最新主线上继续开发 X 和 Y. 在 Meta 我会:
而使用 github 时, 我需要:
从这几个例子可以看出, github workflow 的本质缺点在于: 无论是 git 还是 github 都 没有充分的关于 branch 之间的依赖关系 (也即 PR 之间依赖关系) 的信息. 这带来的主要问题是:
当多个 PR 的依赖链条较长时, 每次改变中间 PR 的内容, 或合并 / 删除了某个中间 PR 后, 都需要一个个手动 rebase 所有依赖它的后续 branch, 并手动 push github. 有时候还需要手动改 github merge target.
而当 commit 作为工作单元时, 以上这些工作都可以自动完成: 当中间 commit 被改动时, 所有需要被 rebase/push 的 commit 都可以通过依赖关系自动找到.
除了依赖关系的缺失之外, 另一个 git/github 的缺点是, branch 之间 rebase 有更大的概率产生 conflict. 这是由于缺少一种 commit identifier 机制.
什么是 commit identifier? 在基于 branch 的 workflow 里, 本地 branch 与远端 PR 通过 "branch 的名字" 这个 identifier 来匹配. 在基于 commit 的 workflow 里, commit 与远端 diff 也需要一种匹配机制, 工具才知道每个 commit 应该更新哪个 diff. 它的实现方式一般通过本地工具 (如 hg) 在 commit metadata 里添加一个随机 unique identifier 来实现. 同时, 本地工具需要维护这个 identifier, 确保一个 commit 在经历了 rebase, reorder, amend 等操作后 identifier 不变, 且在 squash 操作时询问用户保留哪个 identifier. 这个 commit identifier 替代了 "branch 名字" 的功能.
不仅如此, commit identifier 能使得 rebase 的体验更加的丝滑. 例如, 在上一节的 Example 2 中, 我们要将 branchX rebase 到修改后的 branchA 上:
图中的 rebase 并没有想象中那么简单: 由于 git 并不知道 commitA 与 new commitA 之间有任何关系, git 会尝试将 commitA, commitX 分别应用到 new commitA 上. 而将 commitA 应用到 new commitA 上几乎一定会产生 conflict. 然而, 当有了 commit identifier 后, rebase 工具通过 identifier 和 commit 时间知道 "new commitA" 是最新版的 "commitA", 就可以直接避免这个 conflict.
另外, 一个常见的小问题是 github PR 的 inline comment 经常会在 force-push 之后丢失, 这同样是因为 github 不知道新的 commit 与旧的的对应关系.
不难想到, 不同的 PR/diff 之间的依赖关系未必是一条单链表, 而可以是一个有向无环图 (DAG). 这种依赖关系就更难在 git 中处理了, 这是 git 的另一个小缺点.
相较于 git branch 内的所有 commits 必须是一条直线, 一个 mercurial 仓库的本地 workspace 可以包含分支, 例如, 我可以在本地创建这样 5 个 WIP 的 commits, 它们可以有 DAG 的依赖关系:
|
由于 code review 与 commits 对应, 这 5 个 commits 将成为 5 个 "diff" 以供 review. Phabricator 的 UI 上也可以显示 diff 间的 DAG 关系, 例如:
在 Meta/Google 工作时, 我的 mercurial workspace 里通常有数十个开发中的 commits, 对应着 code review 平台上的 diffs (在 Google 又叫 CL). 它们可能有复杂的 DAG 依赖, 也可能是独立的. 它们有的是严肃的开发, 有的是 prototype, 有的只用来临时 debug, 但是没关系, 因为我可以选择哪些 commits 要给人 review, 不会受到新增 commits 的影响. 我也可以方便的通过 amend/rebase 修改 commits 或它们的依赖关系, 并且所有修改都可以一键与 code review 平台同步. 在 git 上如何复刻这种体验, 仍然是个难题.
如果要在不改变 git / github 的情况下, 实现接近 stacked diff 的 workflow, 就需要实现一个新的 git 仓库管理工具, 负责:
git log
.有一些工具已经部分实现了这些功能, 例如:
最后, 关于 stacked diffs 的话题, 这里提供一些其他参考:
上面两篇文章写的最详细, 本文也参考了其中的一些观点. 除此之外, 还有:
注: 本文大部分写于离开 Meta 的 Stacked Diff 后, 在 Cruise 工作期间. 当时苦于 Cruise 用 github 没有 Stacked Diff. 然而文章还没写完我就去了 Google, 又有了 Stacked Diff. 如今再次回到 git 的世界, 所以又开始研究这个问题.
]]>The design space is complex, so in this article I'll start with a smaller topic:registration in config systems. I'll show why this common pattern, though works fine forsmall-scale projects, does not scale well in the long term. I'll also discussan alternative.
Configs are often constrained to include only primitive types (numbers andstrings), and there are a lot of good reasons to keep this property.
A global registry in a config system is typically a Mapping[str, Any]
.Its purpose is to allow users to refer to complex objects through simple strings in a config,overcoming the constraint of primitive types.
Objects can be added to a registry like this:
|
This allows users to select which model / data to use by settingcfg.model.name = "MyModel"
or cfg.dataset.train = "my_dataset"
.
Registration has a clear benefit, but at a larger scale, some of its downside could become serious.
Users should only pay (compute cost and mental cost) for what they use is a general design philosophy I found pretty important in almost all aspects of software design.
The registration pattern breaks this philosophy by running unnecessaryregistration code: users will only provide one (or very few) string in theirconfig, but they have to pay the overhead of registering many candidate stringsthat users might need.
To make matter worse, the overhead has to happen very early in a program, typically at import time. Import speed is crucial for developer ergonomics: unlike other code that may run async with development, import
often blocks developers.
The registration overhead includes:
These costs are negligible for small-scale projects, but they can become quitebad when there are hundreds or more objects to register. Bad patterns are guaranteed to appear at larger scale:There will be some users doingnon-trivial registration (e.g. register objects in a for loop) that's slow oreven has unintended side effects. I had to work with many projects that take >10s to import and the most common reason of slow import is registration.
The import overhead is also greatly magnified by Python's multiprocessing
module: all subprocesses will have to spend the time and RAM to rerun the imports.
Registries are typically defined as a global dictionary, so they share manyinherent problems of using global states.
It's not uncommon that different users register different objects under thesame name -- at a large scale that's guaranteed to happen.
Such conflicts can live in two users' own code for a long time, unnoticed,until one day someone needs to depend on both.The only viable solution is usually to rename one, hence break all its users.
To complicate the issue even more, people sometimes decide to resolve nameconflicts by overwriting what's already registered. For example, an "overwrite"option is provided inthe registry of iopath,mobile_cv,and paxml.Using this option may introduce hard-to-debug problems, because now aninnocent import
statement may silently change the behavior of user code.
Despite of this, note that overwriting is actually necessary when working in notebooks, where it's common to "reload" code (therefore re-register objects) on the fly. Here is some code I use to always enable overwrite during reload.
When running a function using a multiprocessing.Process
created with a safe start_method
like "spawn", the child process receives a pickled closure from its parent, so it knows what to run. However, this pickle does not include any global states. This implies that if a function access global states, it may behave differently depending on if it runs in the subprocess or the parent process. Python's documentation has a clear warning about this:
if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that
Process.start
was called.
The ray framework can run a pickledPython function remotely, and therefore it has similar (and even more counter-intuitive) issues.
Since the registration is globally accessible, it's not easy to find where inthe code an object is registered (or modified, if overwrite is allowed) just byreading the code. When a user sees cfg.dataset.name = 'dataset_X'
and iscurious what is "dataset_X", a global string search is almost the only way tofind it out without running the code. And the search does not always work: ifthe name is programmatically generated, then the string cannot be founddirectly in source code, e.g.:
|
In this case, users will have to be more creative about what strings to search.
In C++, registries cause more trouble becauseconstruction and destruction of global objects are verytricky. In Safe Static Initialization, No DestructionI talked about a few PyTorch's C++ bugs related to this. Luckily,in Python, there are better alternatives.
If the only goal of registration is to provide a name → object mapping, then asimple alternative in Python is to use obj.__module__ + '.' + {variable name}
as the name,which may look like some_library.some_module.MyClass
."Variable name" can be obj.__qualname__
for classes & functions.
Given this string, one can then call a simple function such as the builtinpydoc.locate
to obtain the object it refers to: modules will be importedon-demand by importlib
.
Use registration: | Use full qualname: | ||
---|---|---|---|
|
|
This pattern has some obvious advantages over registration:
There are some common concerns of this pattern, but they are not hard to address.
It's slightly harder to dynamically create candidates: there is no "registry" to add objects to,and the only equivalence is to edit the globals()
dictionary directly.
|
I don't consider this a big issue because it's actually discouraging bad practice:in a proper config system (e.g. one that's based on recursive instantiation)there should be no need to dynamically generate candidates like above.I hope to get to this in a future article.
The names in config have to match the names of classes/functions in code.
This has the benefit of clarity on one hand. But on the other hand, code owners have more responsibilityto maintain backward compatibility, especially after renaming their classes and files.The standard good practice suffices to address this: distinguish private vs. public symbols;keep an alias from the deprecated name to the new name; etc.
The names are too long.
This is a real problem. Here are some possible ways to address it:
A $PATH-like mechanism can be used to specify which modules to search for names.The search path can include common prefixes like "my_lib.my_module" so that users only have to provide "MyModel".
There can be a registry-like Mapping[str, str]
that maps from "MyModel" to"my_lib.my_module.MyModel" so that users don't have to write long strings.This mapping doesn't have to be global and doesn't introduce import overhead.This can help with problem (2) as well.
This is just a UI-level issue.Having a better config frontend, e.g. using Python code as the config language,can make this issue disappear! Let me save this for a future article.
Among those challenges, there are a few tricky bugs related to staticinitialization order fiasco(SIOF) and their destructions. This time I was forced to learn a lot more detailsthan I'd like to know about these topics, so it's good to write them down before I forget.
"Static initialization" is an ambiguous term because "static" is very overloaded in C++.In our context, it is supposed to mean "initialization of objects that have static storage duration",i.e. objects that live through the lifetime of a program.The word "static" actually talks about the object lifetime, not about initialization.
Meanwhile, initialization of such objects can have two steps:
Objects with static storage duration can be categorized into following two types, based on when their "dynamic initialization" happen:
main()
).
|
|
SIOF typically refers to the problem that the dynamic initialization order of objects from different translation units is undefined, e.g.:
|
If a
and b
have non-trivial constructors, and the constructor of b
somehow needs to access a
, the program may crash or behave unexpectedly because a
may be initialized after b
.
PyTorch heavily uses registrations, which all have static storage duration. A few SIOF bugs were found when Itried to build PyTorch in Google. As an example, when an ATen operator has many overloads, initialization order affects which overload is called, because an overload that's initialized earlier will be preferred over those initialized later.
Standard ways to avoid SIOF problems are:
Avoid dynamic initialization: change object type to something that can be zero/const-initialized. totw/140 shows a few examples on how to replace std::string
with non-dynamic counterparts.
Use well-defined initialization order: there is a guarantee that objects within the same translation unit are dynamically initialized according to the well-defined program order. So we can sometimes just move code into the same translation unit. In another PyTorch bug where one global depends on another,I simply merged two files so that their constructors are properly sequenced.
Construct on first use: it's often not practical to merge files. A better solution is the "construct on first use" idiom:
❌ Don't use globals | ✅ Use function-local static: | ||
---|---|---|---|
|
|
By doing this, anyone that needs to access a
will have to call get_a()
. Because function-local static is guaranteed to initialize on first use, we can rest assured that a
will not be used before initialization.
The "construct on first use" idiom may look differently, because sometimes we don't need to use a
directly but do need to observe the side effects of its constructor. In such cases we just manually call get_a
to make sure a
is constructed. I used this to fix another PyTorch bug .
There are more ways things can go wrong in the destruction of objects with static storage duration.
In general, we have to carefully avoid use-after-free, i.e. access a global/function-local variable after it's destructed. This is typically protected by this rule:
Non-local objects with static storage duration are destroyed in the reverse order of the completion of their constructor.
Given this rule, we can deduce that:
b
. This should be discouraged, but it means that technically ANY object could access b
in their destructor. If any of these objects are destructed after b
, we're doomed.
|
|
Given the above issues, the Google C++ style guide bluntly forbids such destructions:
Objects with static storage duration are forbidden unless they are trivially destructible.
This "no destruction" rule implies that the following code is illegal
|
if Object
is not trivially destructible. C++ FAQ advises the same.
Writing static Object* a = new Object; return *a;
is safe as long as we never call delete
, but this introduces a heap-allocation overhead.The last trick is to use a NoDestructor
wrapper class to bypass RAII(the trick is placement new operator):
Safe, but has heap allocation overhead | Safe and low overhead | ||
---|---|---|---|
|
|
Finally, as an alternative to "no destruction",another way to safely run destructors is toref-counting all such objects,but it's perhaps not worth the complexity. "No destruction" is usually a good enough solution.
In conclusion, to safely construct and destruct objects with static storage duration + dynamic initialization, follow these rules of thumb:
Terminal Escape Sequences 是终端应用向 stdout 打出的一些具有特殊含义的字符串. 终端看到这些串之后不会显示它们, 而是执行这些串所对应的终端高级功能.
最常见的 escape sequence 就是改变字的颜色. 例如, 这个命令会打印出红色的 " Hello World ":
|
终端颜色最初只有 8 种, 而如今多数终端已经支持
|
不过大多数应用还是只使用 8 种颜色. 丰富的颜色主要在代码高亮里比较有用: vim 中使用set termguicolors
来打开 true color 支持, 之后便可以用 truecolor 来配置各个 highlight group 的 guifg 和 guibg.
这个有用的脚本可以打出终端支持的各种颜色, 以及其他渲染特性, 可惜大多数都没有什么应用在用. Kitty 终端下的输出是这样的:
在支持的终端里执行以下命令会将 "Hello World" 复制到剪贴板.
|
这个 escape sequence 一般称作 "OSC52", 其中 OSC 是 "operating system command" 的意思.OSC52 科学的解决了一个困扰我十多年的问题: 怎么在 ssh + vim/tmux 的时候复制终端上的文字到本地剪贴板?
终端自带的选中 + 复制的功能并不能很好的与 vim/tmux 这类有 "窗口" 的终端应用一起工作, 因为:
如果 vim/tmux 运行在本地, 这些问题都很容易解决: vim/tmux 各自提供了自己的选中 + 复制功能, 并且都可以读写本地的系统剪贴板. 但是当它们跑在 ssh 里的时候, 我就只能依赖 hack:
有了 OSC52 之后再也不会有这个问题了: ssh 里的应用只要输出了 OSC52 的控制字符, 被本地的终端看到了, 就可以写入本地的剪贴板. 具体方案可以这样:
set-clipboard on
选项. 这个选项同时做了两件事 (感觉官方 wiki 解释的并不清楚):yank
脚本用于命令行:$ run_some_command | yank
.在支持的终端里, 这个命令会输出 "This is a link", 鼠标点击输出的文字会打开 "example.com":
|
我与其他人共同维护了一个文档, 记录了支持 OSC8 hyperlink 的终端和会使用 OSC8 hyperlink 的应用.
由于大部分终端已经有了基于 regex 匹配文字中的 URL 的功能, 因此 hyperlink 的功能并不是十分刚需, 可能还需要应用开发者发挥更多想象力. 目前我用到的场景仅有:
ls --hyperlink=auto
. 使用了这个 alias 之后可以在终端里点击文件名打开文件.git log
做一个类似的.Kitty 终端自己发明了一套 escape sequence 用于在终端中显示图片. 这样显示的图片不是用彩色的 unicode 字符拼出的高糊图, 而是正常的高清图.timg
是一个支持 Kitty protocol 的看图工具. 有了它就可以在 ssh 的时候看远端的图片了.
要注意的是, tmux 并不支持这个非标准的 protocol, 会把相应的 escape sequence 吞掉. 所幸, tmux 提供了 "passthrough" 功能: 在打开了allow-passthrough on
之后, 使用特殊的 passthrough escape sequence 可以让 tmux 把应用打出的 escape sequence 转发到外层. 由于 tmux 不支持, 在 tmux 下看图还会有位置错乱的问题. 我搞了一些 hack 勉强解决了, 就不过多解释了.
在支持的终端里, 这两个命令会弹出 "Hello World" 的通知:
|
主要的用途是让 ssh 远端的程序给本地发通知.tmux 同样不支持这个 sequence, 需要配合 passthrough 使用.
OSC9 的出现比较早, 兼容性会更好. OSC99 是 kitty 自己发明的版本, 支持更丰富的通知格式.
这个命令让上层设置当前的窗口标题. 具体做什么由上层 (tmux 或终端) 实现决定:
|
用处不大. 主要是可以让 shell 自动设置标题为 PWD 或当前在执行的命令, 这样当存在多个 tmux tab 或终端 tab 的时候可以方便区分.zsh 可以这样:
|
最后吐点槽.
有不少有用的终端 feature 仅存在于一两个终端里: 读剪贴板, 传输文件, 看图看视频, 进度条, 鼠标悬浮时显示 tooltip...
终端 feature 一直缺乏标准化: 很多 escape code 并没有详细的 spec -- 一个新的终端开发者基本上主要靠看其他终端的代码来理解它们的行为. 另外, 一些终端自己发明的 escape code 甚至会互相冲突 -- 用了别人已经用过的字符.
每个终端只会实现一部分它认为有价值的终端 feature. 这就使得大部分应用为了兼容性都不会去使用高级的 feature.
即使一个应用想要使用高级的 feature, 它也没有一个好的方法判断终端是否支持一个 feature. 在这里我发现了一个类似于浏览器的 User-Agent 的故事:
$TERM
环境变量里有没有 xterm 来决定要不要使用这些 feature, 判断$TERM
里有没有 "256color" 来决定要不要使用 256 色输出.$TERM
名称还是 "xterm" 或 "xterm-256color". kitty 终端的名字是 "xterm-kitty".terminfo 允许应用查询终端是否支持特定 feature, 但由于 feature 缺乏标准化, terminfo 也并没有很好的解决这个问题.
在这个混乱的情形下, 终端开发者也难以达成共识. 曾经几个终端开发者组织了个 terminal working group 来讨论各种 feature 的提案, 但最后大家不欢而散. 这个帖子记录了组织者的吐槽.
由于这些原因, 终端 feature 的演化已经基本陷入停滞. 只有少数几个终端在自己发明新 feature: 例如 iTerm2 的自酿 feature 和 kitty 的自酿 feature. 但由于缺乏标准化, 没有对社区产生太多影响.
终端是我工作的主力工具, 希望这些问题能得到解决.
]]>All code examples and experiment results are available on github at ppwwyyxx/RAM-multiprocess-dataloader.The content is not specific to PyTorch: it applies to any user of Python's multiprocessing library on Linux.
Datasets for machine learning are usually not stored in RAM. But it's common to store their "metadata" in RAM, and this may still cause nontrivial RAM usage. The metadata could be:
As a concrete case, loading the metadata of COCO training set into Python takes ~2.4G of RAM:
|
We obviously don't want to replicate this 2.4G of RAM across all processes.
We acknowledge that there are ways to offload these metadata to disk. For example, people sometimes do:
By doing these, the RAM usage of a dataset becomes negligible. However, these methods will sacrifice flexibility and capabilities, such as random-access, perfect shuffle, merging datasets arbitrarily, custom subsampling support, etc.Notably, PyTorch's commonly used map-style datasets supportrandom access & sampling.All of these capabilities require certain metadata in RAM.
This article ignores any of these offloading methods. Instead, we'll discuss how to reduce the RAM usage without moving these data out of RAM. The idea is simple: we'll try to let all processes share a single copy of the dataset.
First let's build tools to measure RAM usage - which is not as easy as it sounds.
Common tools like top -p PID
or psutil.Process(PID).memory_info()
obtains memory statistics from /proc/{PID}/statm
or /proc/{PID}/status
, but they are insufficient for our analysis. Instead, we'll use the information provided in
/proc/{PID}/smaps
: per-memory-mapping RAM usage information, documented inthis man page/proc/{PID}/smaps_rollup
: aggregation of data from smaps
We'll derive the following important measurements from it:
smaps
.smaps
.top/htop
.To obtain these measurements, we use psutil.Process(PID).memory_maps()
which parses smaps
under the hood:
|
Then we create a MemoryMonitor
utility to measure and print the results for a list of PIDs.The code is straightforward and can be found here.
We start with a naive implementation of a dataset that produces itemsfrom a list:
|
Then we launch subprocesses to read from this dataset with the list of COCO data. To make a cleaner demo, we don't use PyTorch's dataloader, but just launch 4 subprocesses by ourselves:
|
We then added our MemoryMonitor
to it. The full code and its output logs are available on github. Each segment in the log contains memory measurements for the main process + 4 workers:
|
The code looks completely innocent. However, if we plot the memoryusage of any dataloader worker over time, we seem to find a memory leak!This is the notorious "dataloader leaks memory" issue that is discussed at multiple places, e.g. this PyTorch issue and Edward's podcast.
In fact, the growth of RAM usage does stop in the end, so this issue is not a memory leak. But in reality, users often do not see the end before the system OOMs, and they may wrongly conclude this as a "memory leak".
The root cause of this issue is "copy-on-read" of forked CPython objects.
Linux has a copy-on-write mechanism: when a process forks, the child process will share its entire memory space with the parent, and only copy the relevant pages when necessary, i.e. when the child process needs to write to the page. This mechanism allows read-only pages to be shared to reduce total memory usage.
The copy-on-write behavior can be clearly observed in the above figure:at time=0, the worker has 2.6G of shared RAM, 0 USS, and
However, this mechanism did not help us when we read our dataset. The problemis that our dataset is a large nested data structure that contains many smallPython objects. Even though the dataset is "read-only" in theory, accessingany Python object will increment its refcount - causing a lot of memorywrites. With these writes, memory can no longer be shared among parent andchild processes. In other words, objects are not only copy-on-write, but also copy-on-read.Therefore, in the figure we see that the "Shared" RAM decreases and "USS" increases,since many pages are copied from shared memory into each process.
The end game is that each child process has to replicate all the pages that contain object refcounts in the dataset. For a dataset with many objects, this is almost the size of the dataset itself. In the output log, we see that this program uses10G total PSS in the end,where each child process replicates 1.8G of USS.
The copy-on-read issue is due to CPython's reference counting.There are ways to change CPython's behavior, e.g. by gc.freeze
, but it has far-reaching consequences and I failed to make it work for the example here. However, there is a simple and transparent way to solve the issue: store the dataset with very few number of Python objects, so there are very few refcounts!Below is a minimal implementation that stores a listusing 2 numpy arrays:
|
Detectron2 enables this type of serialization by default (since this commit by Yanghan). To compare different serialization mechanisms,we borrow its code into a serialization util, and use it here:
|
Just by this simple one-line change, the RAM usage greatly reduces. The end of the output log file is shown below.
|
We can see that:
#processes
because pickle.dumps
notonly serializes but also compresses the data. We benefit from both sharingand compression by applying this optimization, at the cost of a tinypickle.loads
overhead in each access.Actually, after compression, the dataset only takes ~500M (printed at the beginning of log). So a question arises: why does the main process use 1.6G RAM before starting subprocesses?
This is in fact just an artifact of modern memory allocators: it does not always release memory back to the OS. In fact, if we run this simple serialization/compression code:
|
We see that we seem to "lose" ~700MB of RAM even after we've deleted everything:
|
Using a better allocator, e.g. by export LD_PRELOAD=libjemalloc.so
, can make this issue largely disappear.
This artifact is typically not a big concern, since allocators will find opportunities to reuse these free buffers.(Well, they may be concerning in start_method="fork"
because reusing these free buffers may trigger copy-on-write!But I'm not going to talk more about that.)
In our code above, we launched subprocesses using a start_method="fork"
argument."fork, spawn, forkserver" are the 3 "start methods" of Python's multiprocessing library. This article is a good reference that explains their differences.
Since start_method="fork"
is unsafe (in practice, it causes various crashes & deadlocks) and might no longer be the default in the future, we want to rerun our code above with start_method="spawn"
or "forkserver"
. Sadly, the serialized array is no longer shared among workers. Each worker has a large USS:
|
The reason why our trick no longer works is that "spawn" and "forkserver" don't benefit from the copy-on-write mechanism. They will start a "fresh" subprocess with fresh memory space, instead of sharing with the parent. Everything the child process needs to access is pickled in the parent process and sent to the child. This ensures safe behavior, but is bad for start-up speed and memory usage.
In our case, the entire dataset will be pickled and sent to child processes. This is why each child process consumes a large USS.
torch.Tensor
¶It turns out there is a simple fix to this problem: just store the serialized dataset in a torch.Tensor
instead of a numpy array. The reason why it works, is that multiprocessing uses a customizable pickle implementation called ForkingPickler
, and PyTorch customizes how torch.Tensor
should be pickled by it: the tensor data will not be serialized to bytes. Instead, during pickling the tensor will be moved to shared memory files (typically under /dev/shm
) to be accessed by other processes directly.
To test tensor-based serialization, we run ./main-torchserialize.py spawn
using the code here, and observes the following memory usage in workers (raw log is here):
torch.Tensor
as needed. This is different from start_method="fork"
where the entire memory space is shared at the beginning.import torch
needs to load such as libtorch.so
.This can be easily verified by printing the measurements after import torch
.After applying tensor-based serialization,the total PSS usage in the end is 2.2G-- still worse than our earlier number using start_method="fork"
.Next section will optimize it further.
The last culprit in the above experiment is the 160MBper-worker USS in the above figure: this is just the memory footprint of import torch
,mainly for PyTorch's global variables, etc. Since every child process launched by "spawn / forkserver" is a "fresh" one, they all need to import torch
independently, hence each has 160MB of USS.
Luckily, "forkserver" provides a way to share the import torch
RAM usage through copy-on-write. By calling the undocumented Python API multiprocessing.set_forkserver_preload(["torch"])
before launching processes, each child process will be "less fresh": the torch library is preloaded (and shared), and don't need to be imported by each process independently.
Below are the experiment results. Code and full logs are on github:
|
start_method="fork"
.(Note that this optimization may be unsafe if import torch
creates any threads.My observation is that threads are indeed created due to import numpy
inside torch, but they can be disabled with environment variables.)
So far we've only looked at a single dataloader (with 4 workers). In reality, the only scalable way to use PyTorch on multiple GPUs is to use one process per GPU, each will have its own dataloader and dataloader workers. This gives a total of #GPUs x (#DL workers + 1)
processes organized like below:
We modified the previous experiment slightly into this code to run on 2 GPUs. The memory usage looks like this:
|
Our previous optimization on dataloader workers is still effective - dataloader workers have a tiny USS. However, RAM usage is now replicated by #GPUs times because we let each GPU worker read the dataset independently.
An inconvenient solution to this problem is to load and serialize the dataset before launching GPU workers. By doing this, all GPU workers share the dataset just like what dataloader workers do. However, this limits flexibility and often requires significant refactoring, due to reasons such as:
Another simple solution to this problem is again to use torch.Tensor
and ForkingPickler
to share the dataset among GPU workers, except that now we need tomanage the sharing explicitly like this:
|
This logic is implemented as another serialization utilhere.When using it as a drop-in replacement (full code here),the dataset is no longer replicated by GPU workers:
|
GPU worker 1 still has a small amount of extra USS, and that's just the footprint of import torch
that we saw earlier, and can be avoided using set_forkserver_preload
.
Note that the multiprocessing
library itself also provides shared memory support.This PR contains an implementation of our serialization util without using PyTorch.
We've successfully reduced the total RAM usage by (approximately) a factor of
The essence of the solution is to let all processes share memory through a single torch.Tensor
object, which needs to be moved to Linux shared memory by PyTorch's custom pickling routine. The TLDR on how to achieve sharing is:
- Don't let dataloader workers access many Python objects in their parent. Serialize all objects into a single
torch.Tensor
(but not numpy array) for workers to access.- Don't let all GPU workers load data independently. Load in one GPU worker, and share with others through a
torch.Tensor
.
For list-like data, all of these can be implemented transparently using the serialization routines developed in this article.
Multi-processing is often the only way to achieve trueparallelism in Python(until PEP703),but it comes with many tricky problems.This article hopefully provides an in-depth view of the problem of RAM usage.
]]>"Loss function" may mean different things in different systems.The version I'm going to criticize is the most common one that looks like below:
Bad | Worse | ||
---|---|---|---|
|
|
The key property of the bad "loss function" abstraction is:Users are asked to provide a "loss function" that's executed separately after the "model / forward logic".Such abstraction appears in a few open source systems: Keras model.compile(loss=),fast.ai Learner(loss_func=), Lingvo BaseModel.ComputeLoss.
The main problem is not with the function itself, but that the users' algorithm logic is forced to separate into two parts: model
and loss_func
.
As an alternative, trainer_good
below no longer separates "loss_func" from the model, and has equal functionalitieswith trainer_bad
.
|
In this article, I want to argue that this is a better design because:
model
into two parts if they like, but they don't have to.(Apparently, trainer_good == partial(trainer_bad, loss_func=lambda x, y: x)
.So trainer_bad
can still be used - we just set loss_func
to a no-op if we don't like it.But trainer_good
is cleaner.)
It's true that the separation can be useful to certain types of models.But it's not always the case, and enforcing it can be harmful instead.
The separation is not convenient for a model with many optional losses.Take a multi-task model for example:
Separation | No Separation | ||
---|---|---|---|
|
|
The right one is simpler in that it does not duplicate thebranches that enable different tasks/losses.In reality, these conditions can be more complex than a simple if
,and branching is generally less straightforward to maintain.So it's beneficial to not have to repeat the logic.
Note: If you think a wrapper likemulti_loss_func({"task1": loss_func1, "task2": loss_func2})
will help (like what Keras supports), it is not going to work wellbecause it doesn't know how to route the inputs/outputs to loss functions.
One may argue that separating "loss" from "model" is nice becausethen we can easily switch different loss functions independent of "model".That is indeed useful in many cases.However, in many algorithms, loss computation is simply not independent ofthe model and should not be switched arbitrarily.This could be due to:
Loss computation depends on internal states computed during model.forward
, e.g.:
forward
.forward
.In these cases, forcing a separation of "loss" and "model" will require "model" to return its internal states, causing an abstraction leak.
Different loss functions expect different representations of model's predictions. For example, these representations could be:
Since conversion between representations may be expensive or lossy, we'd like the model toproduce the exact representation needed by loss computation.Therefore, a separation would not make model independent of losses.On the contrary, it's even worse because loss-relatedlogic will be unnaturally split like this:
Separation | No Separation | ||
---|---|---|---|
|
|
We can see in the above snippet that the model is in factnot independent of losses.It also makes loss_func
a bad abstraction because the semanticsof its prediction
argument is complex: it should bein different formats depending on which of loss{1,2}
is used.In the version with no separation, it's very clearthat the losses are computed using the right representation.
One may argue that the separation is helpful because it's nice to let the "model"return the same data in training and inference.This makes sense for simple models where training and inference share most of the logic.For example, in a standard classification model shown below,we can let the "model" object return logits, which will be useful in bothtraining and inference.
But many models don't have a clean separation like this.In theory, training and inference only have to share (some) trained weights,but don't necessarily have to share any logic.Many object detection models, for example, do not compute "predictions" in trainingand do not compute losses in inference.A simplified diagram of Region-Proposal Network (RPN)of a two-stage detector looks like this during training:
Any attempt to split a complicated algorithm like this into "model" and "loss function" will:
Therefore, it's unrealistic to expect that there is a nice separation, or that "model" can producea consistent format in both training and inference.A better design is to include loss computation in the model's training-mode forward
, i.e., let model outputlosses in training, but predictions in inference.
Separation | No Separation | ||
---|---|---|---|
|
|
In the "no separation" design, users provide a "model" that returns losses.This model internally can still use separation of "loss function" and "forward logic"as long as it makes sense for this model.However, trainer is no longer aware of the separation,and the trainer can no longer obtain the "outputs".
Will this become a limitation of the "no separation" design? What if we'd like to do something with"outputs"? My answer is:
write_summary(outputs)
in their model.Design is always a trade-off.Adding assumptions to a system might result in some benefits, but at the same time can cause trouble when the assumption isn't true.Finding a balance in between is difficult and often subjective.
The assumption that models have to come together with a separate "loss function", in my opinion, brings more trouble than it's worth.
]]>In deep learning libraries, these variants can be a different implementation of a layer,a change in optimization algorithm, or a small modification to the training logic, etc.
Designing and maintaining these "research APIs" is difficultthanks to how frequently users want to change their behaviors.Such changes are often implemented by simply adding featuresto the target API they want to modify, e.g. by adding a new flag to the API,or by adding a new abstraction that generalizes the target API towards the users’ use case.
However, when maintaining a generic, core library meant to be adopted by diverse use cases for a long term,the above approach does not scale and poses many problems (discussed morebelow).
This note lists a few principles when working with "research APIs" that should help answer:
Researchers' job is about doing things in new ways.Hence their needs are so diverse that a core library should not aim to include or implementfeatures for all possible use cases. The library should aim to only include the most popularand standardized features (more on the criteria later).
For features not included in the core, ideally there should be a way for users to implementthem out-of-core as extensions, without too much overhead / repetition.
This requires a continuous design evolution to make the core more modular and composable,so that core code can be reused in users’ new implementation.
A good sanity check for library maintainers is to ask the following question:
For any feature currently in the core library, suppose we remove it today, how much effort would it takefor users to reimplement it out-of-core?
A well-designed library should be decoupled such that most of its features are just extensions of itself, and they can beimplemented out-of-core the same way as it is in the core.
There are 3 criteria for feature inclusion in core, ordered by their importance.
To understand the criteria more, let’s ask: what if the feature is —
Popular but not standardized: sometimes a feature is popular, but its users don’t yet align on the properparameterization, its API, or the subtle implementation details. Including such features is risky, as it may create unclearsemantics orimpede its standardization in the future. It’s still OK to include it if it’s very popular (popularity is the #1 most important criteria),but try to do it in a composable way and with warning signs.
As a negative example, "Transformer" is a popular but not standarized feature.It's included in Pytorch, but received many complaints,and many projects (e.g. fairseq, detr)eventually have to fork and reimplement their own Transformer.
Simple but not popular/standardized: Simplicity alone is not sufficient for inclusion, no matter how simple it is.Because if everyone adds a simple feature they need, together it becomes complex.
Popular, standardized but not simple: Simplicity is the #3 important factors.If something is complex but very popular & standardized (e.g. BatchNorm being a headachefor DL library developers), it should be included. In fact this is where a library couldprovide a lot of value to users.
When a user wants to change the behavior of a "research API" def func()
defined in core,adding new arguments is often the quickest way to get things done. But it may introduce a number of maintenance problems.
New flag | New argument | ||
---|---|---|---|
|
|
Adding a simple argument to control the behavior like above is OK,if we think that the new option is very clear and popular.But as a "research API", many users will want to add their own customizations.This could lead to the following problems:
Poor Code health: The library may gradually accumulates too many features that are:
Confusing behaviors: More and more features added over time may not interact with each other in a clear way,causing confusing or silent wrong behaviors
"More general" may mean "less general":A common argument for adding options like this, is thatit doesn't change existing behavior and"makes the function more general".
However, keep in mind that when a function becomes more general in one aspect,it's often less general in other aspects.Generalizing towards one direction may not be a net win, becauseresearch code has too many possible directions to generalize towards, andpicking one direction may affect its eligibility to pick others in the future.We will show what this means shortly.
New behaviors can also be encapsulated inside an argument:
Inject custom behaviors through callbacks: | Use object.method as callbacks: | ||
---|---|---|---|
|
|
This appears useful, since the custom logic is not implemented in core,but in a user-provided callback.For example, given the original code below (left), a researcher who wants to compute y
differently may proposea compute_y_fn
argument like below (right).
Original: | With callbacks: | ||
---|---|---|---|
|
|
However, this design may be even more problematic:
Premature abstractions: Assumptions/constraints are implicitly created about where the callback is triggered,what arguments it needs and what it returns. These assumptions may be bad.
For example, a 2nd researcher may want to computey
using both x
and a
; a 3rd researcher may want to compute y, z
in one function compute_y_z_fn
because it's more efficient. These variants conflict with the 1st researcher's design.
In the future, after seeing enough use cases, we might realize that a xyz = compute_xyz(a)
is a truly good abstraction.However, at that time the premature abstraction of compute_y_fn
will get in our way implementing compute_xyz
.In other words, although the current design makes the computation of y
"more general", the abstraction limits our abilityto generalize the function in other ways. That's why we said earlier that "more general means less general".
Obscure logic: readers can't easily figure out what this function does: they needto look at the caller of this function to see which callback is supplied, and thenlook at the implementation of the callback function. The aforementioned issue of "confusing behaviors" also applies here.
Sometimes callbacks are good and useful abstractions. But because it is too powerful, I saw it frequently abused to altera behavior into something that's strongly overfitted to a small number of use cases.In code reviews, I usually frown upon APIs that require callbacks/user-defined functions.
To customize a "research API" def func()
defined in core, we have the following options:
def func_v2()
in user code.(Or a class ClassV2
for classes).def func_v2()
in core.def func(option)
.The best choice is heavily subjective and should be evaluated case-by-case.Due to the concern of new arguments,in general we recommend methods (1) and (2), i.e. prefer forking func()
over changing func()
.
This also echoesFlax design philosophy thatsays "prefer duplication over adding options / bad abstractions".
Users/developers may find that the core design is not good enough yet, and recreating a variantof func()
without touching it may lead to too much code duplication.For example, ...
is duplicated between the two functions below.
Existing API in core | New variant | ||
---|---|---|---|
|
|
Such duplication is acceptable for a short term.We do NOT mean to encourage users to heavily fork core code.Instead, users and core developers should engage and aim to evolve the core design to reduce duplication— but design change takes time to happen, and duplication is preferred before a good design is found.
The most risk-free way to reduce duplications is by moving them into shared reusable code:
Existing API in core | New variant | ||
---|---|---|---|
|
|
This should be the preferred way to reduce duplications. The benefits are:
func()
, hence little risk.However, there are also challenges:
_reusable_parts()
) to maintain.The above challenges are less significant if _reusable_parts()
is private. Therefore:
func_v2()
is in core, make _reusable_parts()
private.func_v2()
must be out-of-core, consider _reusable_parts()
as "internal/experimental APIs".Inheritance, e.g. class ModuleV2(ModuleCore)
may also reduce duplication between two variants.However, this is generally less preferable than composition like above. The reason is similar towhy callbacks are not preferred: overriding methods is like passing callbacks - they are both user-definedfunctions and suffer from the same limitations: users are constrained by the assumption ofwhen/where/how the methods/callbacks are triggered.
We generally prefer adding a new implementation over adding new conditional branches to the existing implementation,but branches probably will happen somewhere anyway – after all, the new feature variant probably ends up as a new option/argument in the end-users' config.
If branching has to happen, we prefer it at earlier, shallower code path:
Branch earlier | Branch later | ||
---|---|---|---|
|
|
By branching earlier, we keep a clean func()
unaffected by the new variant.This recommendation is consistent with the preference to fork func_v2()
, not to add flag
to func()
.
Low-level components of these systems often use a plain list of values/tensorsas inputs & outputs.However, end-users that develop models often want to work with morecomplicated data structures:Dict[str, Any]
, List[Any]
, custom classes, and their nested combinations.Therefore, we need bidirectional conversion between nested structures and a plain list of tensors.I found that different libraries invent similar approaches to solve this problem, and it's interesting to list them here.
Though many simple deep learning models just needs a few inputs/outputs tensors,nested containers are useful abstractions in advanced models.This is because many concepts are naturally represented by >1 tensors, e.g.:
|
When a frequently-used concept has natural complexity like above, representing itin a flat structure (e.g. Dict[str, Tensor]
) consisting of only regular tensors may result in ugly code.A multi-level nested structure sometimes becomes helpful.Take sparse tensor as a simple example:
Use nested containers | Use a flat Dict[str, Tensor] | |
---|---|---|
Representation | {"a": SparseTensor, SparseTensor can be a namedtuple/dataclass, or a new class. | {"a_values": Tensor, |
Sanity check | SparseTensor class can guarantee both tensors exist and follow certain contracts (e.g. their shapes match) | Need to check a_{values,indices} co-exist in the dict |
Pass to another function | Pass x["a"] directly | Extract x["a_values"], x["a_indices"] and pass both |
Operations | SparseTensor class can have methods that work like regular tensors, e.g. y = x["a"] + 1 | Need to implement many new functions, e.g. y = add_sparse(x["a_values"], x["a_indices"], 1) |
Despite the benefits, lower-level stacks often ignore these abstractionsand choose to use a "flat" interface: their inputs & outputs are a flat list of values / Tensors.This is because:(i) the abstraction may no longer be useful in lower level;(ii) a simple structure simplifies their implementation;(iii) a flat list is a data structure available even in lower-level languages & systems.
Therefore, conversion from a nested structure to a plain list of values is important.This is often referred to as "flatten".It is pretty straightforward to flatten a container recursively -- like the following flatten
function:
|
The inverse of flatten
is also important: given new values [x2, y2, z2]
,we want the unflatten
function below to construct obj2
that has the samestructure as obj
.
|
unflatten
is a very handy utility. For example, to create a clone of obj
on a different device, we simply do this:
|
Without unflatten
, every such functionality needs to be reimplemented as a recursivefunction, like PyTorch's pin_memory
.
unflatten
¶How do we implement unflatten
?Apparently, we need to give it a representation of structure (noted as a placeholder ???
in the above code).There are two high-level approaches to solve this problem:
Schema-based: when flattening a container, explicitly record its structure/schema to be used for unflatten.Its API may look like this:
|
Examples: Detectron2's flatten_to_tuple
, TensorFlow's FetchMapper
, JAX's pytree
.
Schema-less: use the entire nested container as an implicit representation of structure. Its interface looks like this:
|
Examples: TensorFlow's tf.nest
. DeepMind's dm-tree
.
The two approaches have some pros and cons:
JAX's low level components accept/return flat tensors, so functions can be transformed and optimized more easily.Since end-users need nested containers, JAX transformations supports pytree containers,which by default includes flattening & unflattening for common Python containers.It further allows users to register custom classes byregister_pytree_node
.
Pytree uses a schema-based implementation that we already show-cased above.
When we need to independently process each leaf of the container, JAX provides another handyfunction tree_map
:
|
PyTorch also adds a similar implementation of pytree at herethat is used in its FX tracing.
TracingAdapter
¶torch.jit.trace(model, inputs)
executes the model with given inputs, and returns a graph representationof the model's execution.This is one of the most common methods (and the best IMO) how PyTorch models are exported today.However, it limits model's input & output format.
In order to trace models with more complicated inputs & outputs,I created the TracingAdapter
tool in detectron2, that flattens/unflattens a model's inputs and outputs into simple Tuple[Tensor]
to make it traceable.A minimal implementation of it may look like this:
|
where flatten
uses a schema-based implementation that can be found in this file.Coincidentally, its interface looks like JAX's pytree:
|
Perception models in Meta accept a wide range of inputs/outputs formats:they may take any number of images plus auxiliary data as inputs, andpredict boxes, masks, keypoints or any other interesting attributes as outputs.But deployment prefers a flat interface for optimizability and interoperability.TracingAdapter
's automatic flattening and unflattening mechanism has freed engineers fromwriting format conversion glue code when deploying these models.
In addition to deployment, TracingAdapter
is also useful in a few other places to smooththe experience of torch.jit.trace
:
TracingAdapter
is the easiest way.add_graph
method that visualizes the graph structure in tensorboard.The method requires flattened inputs,therefore TracingAdapter
can be used like this.TracingAdapter
is useful as well, e.g. here.tf.nest
¶tf.nest.flatten
and tf.nest.pack_sequence_as
implement schema-less flattening and unflattening.
The unflatten function requires a container, and it will flatten this container on-the-fly whilesimultaneously "pack" flat values into the structure of this container. Here is an official example (note that dict values are ordered by keys):
|
tf.nest.{flatten,pack_sequence_as}
are widely used in TensorFlow because many low-level components have a flat interface, especially forinterop with C APIs.
|
tf.nest.map_structure
has the same functionality as JAX's tree_map
.
FetchMapper
¶TFv1's session.run(fetches)
supports fetching nested containers.This is demonstrated in an example from theofficial documentation:
|
This powerful interface exists in TF's Python client only.The client interacts with the C API's TF_SessionRun
which only accepts a plain array of inputs/outputs.Therefore, the client needs to:
The flatten/unflatten logic uses a schema-based implementation in the client's FetchMapper
.This implementation is a bit more complicated due toan extra guarantee thatthe flattened tensors are unique. (This is to ensure the client won't fetch the same tensor twice in one call;this cannot be done by using tf.nest
.)
In addition to builtin Python containers, FetchMapper
supports a few other TF containers(such as SparseTensor
) and can be extended to new containers by registering conversion functions.
tree
library¶DeepMind has a tree
library as a standalone alternative to tf.nest
:
deepmind/tree | tf.nest | jax.tree_util |
---|---|---|
tree.flatten | tf.nest.flatten | jax.tree_util.tree_flatten |
tree.unflatten_as | tf.nest.pack_sequence_as | jax.tree_util.tree_unflatten |
tree.map_structure | tf.nest.map_structure | jax.tree_util.tree_map |
nn.Module
into agraph represented in TorchScript format: tracing and scripting.This article will:torch.jit.trace
should be preferred over torch.jit.script
for deployment of non-trivial models.The second point might be an uncommon opinion:If I Google "tracing vs scripting", the first articlerecommends scripting as default.But tracing has many advantages.In fact, by the time I left, "tracing as default, scripting only when necessary" is thestrategy all detection & segmentation models in Facebook/Meta products are deployed.
Why tracing is better? TL;DR: (i) it will not damage the code quality; (ii) its main limitations can beaddressed by mixing with scripting.
We start by disambiguate some common terminologies:
Export: refers to the process that turns a model written in eager-mode Pythoncode into a graph that describes the computation.
Tracing: An export method. It runs a model with certain inputs, and "traces / records" all the operationsthat are executed into a graph.
torch.jit.trace
is an export API that uses tracing, used like torch.jit.trace(model, input)
.See its tutorialand API.
Scripting: Another export method. It parses the Python source code of the model, and compiles the code into agraph.
torch.jit.script
is an export API that uses scripting, used like torch.jit.script(model)
.See its tutorialand API.
TorchScript: This is an overloaded term
To avoid confusion, I'll never use "TorchScript" alone in this article.I'll use "TS-format" to refer to the format, and "scripting" to refer to the export method.
Because this term is used with ambiguity, it may have caused the impression that "scripting" is the"official / preferred" way to create a TS-format model. But that's not necessarily true.
(Torch)Scriptable: A model is "scriptable" if torch.jit.script(model)
succeeds, i.e. it canbe exported by scripting.
Traceable: A model is "traceable" if torch.jit.trace(model, input)
succeeds for atypical input.
Generalize: A traced model (returned object of trace()
) "generalizes" to other inputs(different from the inputs given during tracing), if it can inference correctly when given other inputs.Scripted models always generalize.
Dynamic control flow or data-dependent control flow: control flow where the operatorsto be executed depend on the input data, e.g. for a Tensor
x:
if x[0] == 4: x += 1
is a dynamic control flow.
|
|
If anyone says "we'll make Python better by writing a compiler for it", you should immediatelybe alarmed and know that this is extremely difficult.Python is too big and too dynamic. A compiler can only support a subset of its syntax features and builtins, at best --the scripting compiler in PyTorch is no exception.
What subset of Python does this compiler support?A rough answer is: the compiler hasgood support for the most basic syntax, but medium to no support for anything more complicated (classes, builtins like range
and zip
, dynamic types, etc.).But there is no clear answer: even the developers of the compiler usually need to run the code to see if it can be compiled or not.
The incomplete Python compiler limits how users can write code.Though there isn't a clear list of constraints,I can tell from my experience what impact they have had on large projects:code quality is the cost of scriptability.
To make their code scriptable / compilable by the scripting compiler,most projects choose to stay on the "safe side" to only use basic syntax of Python:no/few custom structures, no builtins, no inheritance, no Union
, no **kwargs
, no lambda, no dynamic types, etc.
This is because these "advanced" compiler features are either not supported at all, or with "partial support"which is not robust enough: they may work in some cases but fail in others.And because there is no clear spec of what is supported,users are unable to reason about or workaround the failures.Therefore, eventually users move to and stay on the safe side.
The terrible consequence is that:developers stop making abstractions / exploring useful language featuresdue to concerns in scriptability.
A related hack that many projects do is to rewrite part of the code for scripting:create a separate, inference-only forward codepath that makes the compiler happy.This also makes the project harder to maintain.
Detectron2 supports scripting, but the story was a bit different: it did not go downhill in code quality which we value a lot in research.Instead, with some creativity and direct support from PyTorch team (and some volunteered help from Alibaba engineers), we managed to make most modelsscriptable without removing any abstractions.
However, it is not an easy task:we had to add dozens of syntax fixes to the compiler, find creative workarounds,and develop some hacky patches in detectron2 that are inthis file(which honestly could affect maintainability in the long term).I would not recommend other large projects to aim for "scriptability without losing abstractions" unlessthey are also closely supported by PyTorch team.
If you think "scripting seems to work for my project"so let's embrace it, I might advise against it for the following reasons,based on my past experiences with a few projects that support scripting:
What "works" might be more brittle than you think (unless you limit yourself to the basic syntax):Your code might happen to compile now, but one day you'll add a few innocent changes to your modeland find that the compiler refuses it.
Basic syntax is not enough:Even if more complex abstractions don't appear necessary to your project at the moment,if the project is expected to grow, it will require more language features in the future.
Take a multi-task detector for example:
Union
or more dynamic types.Large, growing projects definitely need evolving abstractions to stay healthy.
Code quality could severely deteriorate:Ugly code starts to accumulate, because clean code sometimes just doesn't compile.Also, due to syntax limitations of the compiler,abstractions cannot be easily made to clean up the ugliness.The health of the project gradually goes downhill.
Below is a complaint in PyTorch issues.The issue itself is just one small papercut of scripting,but similar complaints were heard many times.The status-quo is: scripting forces you to write ugly code, so only use it when necessary.
What it takes to make a model traceable is very clear, and has a much smaller impact on code health.
First, neither scripting nor tracing works if the model is not even a proper single-device, connected graph representable in TS-format.For example, if the model has DataParallel
submodules, or if the modelconverts tensors to numpy arrays and calls OpenCV functions, etc, you'll have to refactor it.
Apart from this obvious constraint, there are only two extra requirements for traceability.
Input/output format
Model's inputs/outputs have to be Union[Tensor, Tuple[Tensor], Dict[str, Tensor]]
or their nested combinations. Note that values in a dict have to belong to the same type.
Similar constraints exist for scripting as well.However, in tracing the constraint does not apply to submodules:submodules can use any input/output format: dicts of Any, classes, kwargs, anything that Python supports.Only the top-level model is required to use the constraint format.
This makes the constraint very easy to satisfy.If the model uses richer formats, just create a simple wrapper around it that converts to/fromTuple[Tensor]
.Detectron2 even automates this for all its models by a universal wrapperlike this:
|
Symbolic shapes:
Expressions like tensor.size(0)
, tensor.size()[1]
, tensor.shape[2]
are integers in eager mode, but Tensor
s in tracing mode.Such difference is necessary so that during tracing, shape computation can becaptured as symbolic operations in the graph.An example is given in the next section about generalization.
Due to different return types,a model may be untraceable if parts of it assume shapes are integers.This usually can be fixed quite easily by handling both types in the code.A helpful function is torch.jit.is_tracing
which checks if the code is executed in tracing mode.
That's all it takes for traceability - most importantly, any Python syntax is allowed in model implementation, because tracing does not careabout syntax at all.
Just being "traceable" is not sufficient.The biggest problem with tracing, is that it may not generalize to other inputs.This problem happens in the following cases:
Dynamic control flow:
|
In this example, due to dynamic control flow,the trace only keeps one branch of the condition, and will not generalize to certain (negative) inputs.
Capture variables as constants:
|
Intermediate computation results of a non-Tensor type (in this case, an int type) may be captured as constants, using thevalue observed during tracing. This causes the trace to not generalize.
In addition to len()
, this issue can also appear in:
.item()
which converts tensors to int/float.Capture device:
|
Similarly, operators that accept a device
argument will remember the device used during tracing (this canbe seen in m.code
).So the trace may not generalize to inputs on a different device.Such generalization is almost never needed, because deployment usually has a target device.
The above problems are annoying and often silent (warnings, but no errors),but they can be successfully addressed by good practice and tools:
Pay attention to TracerWarning
: In the first two examples above, torch.jit.trace
actually emits warnings.The first example prints:
|
Paying attention to these warnings (or even better, catch them)will expose most generalization problems of tracing.
Note that the "capture device" case does not print warnings because tracing was not designed to support such generalization at all.
Unittests for parity: Unittests should be done after export and before deployment, to verify thatthe exported model produces the same outputs as the original eager-mode model, i.e.
|
If generalization across shapes is needed (not always needed), input2
should have differentshapes from input1
.
Detectron2 has many generalization tests, e.g. thisand this.Once a gap is found, inspecting the code of the exported TS-format model can uncover the place whereit fails to generalize.
Avoid unnecessary "special case" conditions:Avoid conditions like
|
that handles special cases such as empty inputs.Instead, improve self.layers
or its underlying kernel so it supports empty inputs.This would result in cleaner code and also improve tracing.This is why I'm involved in many PyTorch issues that improve support for emptyinputs, such as#12013,#36530,#56998.Most PyTorch operations work perfectly with empty inputs,so such branching is hardly needed.
Use symbolic shapes: As mentioned earlier, tensor.size()
returns Tensor
during tracing, sothat shape computations are captured in the graph.Users should avoid accidentally turning tensor shapes into constants:
tensor.size(0)
instead of len(tensor)
because the latter is an int.For custom classes, implement a .size
method or use .__len__()
instead of len()
, e.g. like here.int()
or torch.as_tensor
because they will capture constants.This helper functionis useful to convert sizes into a tensor, in a way that works in both tracing and eager mode.Mix tracing and scripting: they can be mixed together, so you can use scriptingon the small portion of code that tracing does not work correctly.This can fix almost all problems of tracing. More on this below.
Tracing and scripting both have their own problems, and thebest solution is usually to mix them together.This gives us the best of both worlds.
To minimize the negative impact on code quality,we should use tracing for the majority of logic, and use scripting only when necessary.
Use @script_if_tracing
: Inside torch.jit.trace
, the @script_if_tracing
decorator can compile functions by scripting.Typically, this only requires a small refactor of the forward logic to separate the parts that need tobe compiled (the parts with control flow):
|
By scripting only the parts that need it,the code quality damage is strictly smaller than making the entire model scriptable,and it does not affect the module's forward interface at all.
The function decorated by @script_if_tracing
has to be a pure function that does not contain modules.Therefore, sometimes a bit more refactoring is needed:
Before Refactoring | After Refactoring | ||
---|---|---|---|
|
|
In fact, for most vision models, dynamic control flow is needed only in a few submodules whereit's easy to be scriptable.To show how rare it is needed, the entire detectron2 only has two functions decorated with @script_if_tracing
due to control flows:paste_masksand heatmaps_to_keypoints,both for post-processing only.A few other functions are also decorated to generalize across devices (a very rare requirement).
Use scripted / traced submodules:
|
In this example, suppose submodule
cannot be traced correctly, we can script it before tracing.However I do not recommend it.If possible, I will suggest using @script_if_tracing
inside submodule.forward
instead,so that scripting is limited to the internals of the submodule,without affecting the module's interface.
And similarly,
|
this uses a traced submodule during scripting.This looks nice, but is not so useful in practice: it will affect the interfaceof submodule
, requiring it to only accept/return Tuple[Tensor]
-- this is abig constraint that might hurt code quality even more than scripting.
A rare scenario where "tracing a submodule" is useful, is this:
|
@script_if_tracing
cannot compile such control flow because it only supports pure functions.If submodule{1,2}
are complex and cannot be scripted,using traced submodules in a scripted parent A
is the best option.
Merge multiple traces:
Scripted models support two more features that traced models don't:
forward()
, but a scripted module can havemultiple methods.Actually, both features above are doing the same thing: they allow an exported model to be used indifferent ways, i.e. execute different sequences of operators as requested by the caller.
Below is an example scenario where such feature is useful: if Detector
is scripted, the caller can mutate itsdo_keypoint
attribute to control its behavior, or call predict_keypoint
methoddirectly if needed.
|
This requirement is not seen very often. But if needed, how to achieve this in tracing?I have a solution that's not very clean:
Tracing can only capture one sequence of operators, so the natural way is to trace the model twice:
|
We can then alias their weights (to not duplicate the storage), and merge thetwo traces into one module to script.
|
If a model is both traceable and scriptable,tracing always generates same or simpler graph (therefore likely faster).
Why?Because scripting tries to faithfully representyour Python code, even some of it are unnecessary. For example:it is not always smart enough to realize that someloops or data structures in the Python code are actually static and can be removed:
|
This example is very simple, so it actually has workarounds for scripting (use tuple instead of list),or the loop might get optimized in a later optimization pass.But the point is: the graph compiler is not always smart enough. For complicated models, scripting mightgenerate a graph with unnecessary complexity that's hard to optimize.
Tracing has clear limitations:I spent most of this article talking about the limitations of tracing and how to fix them.I actually think this is the advantage of tracing: it has clear limitations (and solutions),so you can reason about whether it works.
On the contrary, scripting is more like a black box:no one knows if it works before trying.I didn't mention a single trick about how to fix scripting:there are many of them, but it's not worth your time to probe and fix a black box.
Tracing has small blast radius:Both tracing and scripting affect how code can be written, but tracing has a much smaller blastradius, and causes much less damage:
On the other hand, scripting has an impact on:
Having a large blast radius is why scripting can do great harm to code quality.
Control flow vs. other Python syntax:PyTorch is loved by its users because they can "just write Python", and most importantly writePython control flows. But other syntax of Python are important as well.If being able to write Python control flow (scripting) means losing other great syntax,I'd rather give up on the ability to write Python control flow.
In fact, if PyTorch is less obsessed with Python control flow, and offers mesymbolic control flows such as torch.cond
like this (similar to the API of tf.cond
):
|
Then f
could be traced correctly and I would be happy to use this, no longer having to worryabout scripting.TensorFlow AutoGraphis a great example that automates this idea.
这篇文章说说用户怎么提出好的 feature request / pull request, 以及维护者如何对待它们.
这里, 我们忽略那种特别简单的 (例如 10 行代码以内可以实现的) request, 只考虑 non-trivial 的 feature request 和 pull request.
首先, 一个残忍的事实是, 开源项目中大多数的 feature requests 不会得到 maintainer 的回应. 理由也很简单: 项目的资源是有限的, 而修 bug, 维护现有 feature 的优先级自然会更高. 当项目有额外的开发资源时, 一般也会优先推进团队自己原有的开发计划 / roadmap, 或优先为项目的赞助方 (如背后的公司) 实现 feature. 路人的 feature request 优先级可以说是最低的, 排在所有这些之后.
下图是 vscode 社区处理 feature request 的流程: (来源)
Vscode 是一个非常注重社区的项目, 因为编辑器必须要有好的生态才能成功. 因此我们才能看到 vscode 把用户的 "upvote" 也考虑在内. 绝大多数项目并没有这最后一步: 和项目 roadmap 不 align 的 feature request, 一般就直接进入 backlog 了.
在这种情况下, 要想提出一个 "好的 feature request", 并得到 maintainer 的重视, 当然不是那么容易. 一个好的 feature request 一般至少要在以下某一点中比较突出:
要做到这些, 有时候确实需要用户对项目有一定的深入了解, 能够把握住项目的 direction. 毕竟想要项目的 developer 改变原定的计划, 自己没两把刷子是不行的.
反过来, 一个 "平凡的 / 不好的 feature request" 可能会有如下特征:
当然, 一个平凡的 feature request 照样值得提出, 虽然它可能会进入 backlog 暂时无人问津, 但是也许在沉寂一段时间之后会引发更有价值的讨论和实现.
Pull request 是社区向项目贡献代码, 因此一般更受 maintainer 欢迎, 但也不全是. 围绕 pull request 的主要矛盾是 可维护性 : 当 maintainer 同意接受一个 PR 时, 就意味着 maintainer 同意负责维护这段别人写的代码, 这对代码的可维护性是有要求的.
因此, 用户应该认识到, maintainer 关注的绝不仅仅是一个 PR 是否 "work", 而是会考虑更多的因素:
Jeff Geerling 的 Why I Close PRs 和 The Burden of an Open Source Maintainer 也介绍了什么样的 PR 是 maintainer 更乐于见到的. 文章写的很好, 且另外提到了一条重要的沟通原则:
Maintainer 应在CONTRIBUTING.md
或 .github/pull_request_template.md
里为 contributor 提供引导, 包括介绍提交 PR 的注意事项, PR 被接收的原则, 项目的 coding style, 如何使用 linter, 如何测试, 如何更新 documentation, 等等. 例如 detectron2 的 contributing.md 和 pull_request_template.md.
开源社区中, 用户会有无数不同的需求. 即使 maintainer 有时间 (大部分 maintainer 没有) 去处理 feature request / pull request, 也会有很多人的需求无法满足.
在这种现实下, 面对没有精力实现 / 维护的 {feature, pull} request, maintainer / contributor 可以采取的一个好的策略是: 通过一些改动让项目变得更 extensible, 使得 feature 可以被用户以扩展 / extensions 的方式独立实现, 而不是在项目中实现.
具体要怎么做到这一点, 是一个系统设计问题, 这篇文章就不跑题多说了. 采用这种方式的好处是:
很多成功的开源项目都是靠着可扩展性创建了优秀的生态.
Tensor
subclass, 自己的 device 等非常夸张的扩展. 最近的torch.fx
也是在给用户实现 graph transformation 扩展的机会.PyTorch 团队会使用 "extension points" 这个词, 指系统中可以由用户实现扩展的部位.Detectron2 也从最初就尽量走这条路, 把 "尽量让所有模块都可扩展 / 可替换" 作为一个设计目标.Facebook 与之相关的 research project 就都以 detectron2 扩展的形式开源. 除此之外也有不少来自社区的优秀扩展, 例如 AdelaiDet, YOLOv7 等.
如果 pull request 并不容易被接受, 那么开发者是不是应该干脆自己 fork 项目, 来实现自己想要的改动呢? 要回答这个问题, 要先想清楚将这些改动开源的目的是什么:
如果只是一个 proof-of-concept, 为了公开的展示这个改动的内容, 那么 fork 是没问题甚至更合适的:
开发者也要意识到, 如果认为自己的工作不只是一个 proof-of-concept/toy, 想要让自己的 fork 真的被人严肃的使用的话, 就不得不自己承担维护的责任. 而维护的负担是很重的, 挑几个点来说:
Do not remove a fence until you know why it was put up in the first place
因此, 虽然一个成功的 pull request 要付出额外的交流, 但它换来的是项目维护者的维护工作. 如果开发者想加入新 feature, 又没有自信能胜任整个项目的维护, 与其另起炉灶, 不如多参与交流, 与维护者讨论一个更可维护的方案 (pull request 或 extension).
]]>我听过不少人凭借爱好开源了自己的项目后, 却对 issue 太乱感到困扰, 甚至想干脆直接禁用 issue. 其实, 任何项目达到一定规模后, 如果不对 issue 进行适当管理, 都会使 issue 信噪比过低, 失去原本的功能.
这篇文章主要从 maintainer 的角度说说, 在具备规模的项目中管理 issue 的一些方法和原则.
任何具备一定规模的项目都应该使用 issue template.Issue template 位于项目的.github/ISSUE_TEMPLATE
目录, 包含两种文件:
每个 template 有一个 markdown 文件, 对应一类 issue. 其中描述需要用户提供的信息.
还可以为这个 issue template 自动配置 issue label. 然而由于 template 是用户选择的, 这种方式得到的 issue label 噪音较大, 可能还需要 maintainer 纠正. (我的策略是仅对 "feature request" 和 "documentation issue" 自动 label)
可选的config.yml
全局配置文件. 有用的配置包括:
blank_issues_enabled
: 是否允许用户不使用 template 自己写 issue.contact_links
: maintainer 用它将用户引导到其他地方 (论坛, discussions 等).Github 近期在测试 issue form, 是 issue template 的升级版, 有了更好看的 UI 和丰富的输入类型. 可惜我一直还没有测试机会.
常见的 issue 有如下两大类:
除了这些之外, 用户常常还想问各种其他问题, 譬如 "怎么用 XXX", "我这样做对不对", "项目里这段 code 是干嘛用的" 等等. 暂且将它们称为 "question". 我认为, 大的 (issue 很多的) 开源项目中 issue 里不应包含这些 "question", issue 应当 不超过上面的两类.
为什么? 当 issue 很多的时候, "question" 与两大类 "issue" 有些本质的不同, 会导致 issue 难以管理:
总而言之, question 大多以用户为中心, 处理它们的沟通成本更高, 而对项目的 contribution 却更低. 混杂在以项目为中心的另两类更重要的 issue 中会分散 maintainer 的精力. 因此很多大的项目都希望将 question 剥离出 issue.
然而, 用户确实有问问题或进行其他交流的需求, 这样的需求可以用 github discussions / 论坛来满足.
Github 近两年推出了 "discussions" 版块.Discussions 在功能 / UI 上与 issues 有所区别, 各方面都更像传统的论坛: 例如没有 open/close/assign 的状态, 可以 "顶帖", 可以 "mark as answer", 等等. 简单来说, github discussions 就是提供一个 简化版的论坛.
在内容上, github 并没有给 discussions 和 issues 定义明确的边界, 这个边界由每个项目自己定义:Maintainer 应通过 issue category 和 issue template 来 声明自己愿意支持解决的 issue 有哪些 (例如 bug report, feature request), 并告知用户 "其他" 讨论 / Question 可以发到 discussions 中. 如果发错了地方, maintainer 可以通过 github 提供的按钮一键在 issue/discussion 之间转换.
我们以 PyTorch 为例. 在 PyTorch 的 issue 列表点击 "new issue" 后, 进入 PyTorch 的 issue 类别 页面.
可以看到:
PyTorch issue 就只包含上文提到的两大类: bug 与 feature (只是细分成了更多类).
实践上把 documentation 细分出一类是很有用的. 因为 documentation 的勘误到底是属于 "bug" 还是 "enhancement" 可能会有歧义.Documentation 被细分后, maintainer 就可以将 "bug" 定义为狭义的代码 bug, 将 "enhancement" 定义为 "feature request", 使得类别的定义更清晰.
所有 "其他讨论" 都通过最后一行的按钮被引导到 PyTorch 的官方 Discourse 论坛上. 曾经, PyTorch 甚至专门有一个 "question" issue template 的内容就是 "不要发 question, 请用论坛". 由于避免了 question, PyTorch issue 始终维持了高质量的技术讨论, 也达到了管理开发任务的 "tracker" 功能.
Github discussions 的定位就是一个项目自带的简易论坛, 毕竟不是所有项目都有资源自己搭建一个论坛.
再以 TensorFlow 做个反面教材: 我由于曾经是深度 TF1 用户, 在早期还是很喜欢看它的 github. 然而 TensorFlow 长期没有对 issue 进行分流. 可以观察到大约在 18 年前后, 估计由于 issue 的噪声太大, 性价比太低, TensorFlow issues 里已经很少再有 core developer 回复, 导致真正有价值的 issue 也更难以得到重视了. 我就多次需要靠手动 at 对应领域我认识的 developer 才能有人回应我报的 bug. 直到 2021 年, TensorFlow 才终于开始在 issue template 里把用户引导至自建 Discourse 论坛.
最后还是要提醒: discussions / 论坛仅适用于规模较大, 问题较多的项目. 对小项目, 额外一个讨论平台引入的 overhead 可能得不偿失.
在第一篇文章中说到,maintainer 自己决定自己有哪些义务, 决定自己的 commitment, 也即自己愿意对用户提供哪些 "support". 很多 maintainer 与用户沟通上的问题, 源于没有划清自己的义务范围. 一旦这条线划清了, maintainer 就无需为乱七八糟的 issue 头疼: 项目不 support 的问题不必操心, 关闭或者移至 discussions 都可以.
Maintainer 应该通过 issue template 的选项表明哪些类 issue 是允许的. 可以通过blank_issues_enabled: false
来禁用 "无 template" 的 issue. 可以通过contact_links
引导 "其他问题" 到别的地方. 如果用户依然发了不支持的 issue, 可以以 "不支持" 为由关闭 / 移至 discussions.
Issue template 的内容里可以更清楚的声明哪些常见情形是不支持的, 例如:
用户应该认识到 "支持 / support" 到底是什么意思:
对于 maintainer 职责之外的 issue, 即使 maintainer 个人愿意帮助, 也可以立刻关闭 / 移至 discussion, 再进行评论. 这样的情况下, 我一般会关闭 issue 并说:
Because of ABC, this issue is unsupported/unrelated, therefore closing the issue.
I think doing XYZ might solve/help the issue.
在这里, "close issue" 表明了 issue 不被支持, 这样提前避免用户由于 "得到了评论" 而对于 support 有不切实际的预期. 也避免了 (其他) maintainer 在下次处理 issue 列表时再看一次.
同时, 也在不需要花自己太多时间的前提下给了简单的建议, 但至于是否能解决问题我就不再管了.
这一节说说对于 bugs/unexpected issues 的常见处理流程和注意事项.
使用 Issue Template: 上篇文章中说了用户报告 unexpected issues 时需要提供的几类信息: expectation, unexpected observation, environment, reproducible example.Maintainer 应该使用 issue template 来告知 / 引导用户提供这些信息.
Detectron2 的 "unexpected problems" issue template 可以作为参考. Facebook AI Research 的其他一些 project 也参考了这个 template (如 pytorch3d,vissl).
检查必要的信息: 还是有不少用户不尊重 issue template, 不提供需要的信息. 以下几个方案可能有帮助:
If you need help to solve an unexpected issue you observed, please include details following theXXX issue template (link).
分析, 解决 issue: 任何一个有足够信息的 unexpected issue, 应该 有且仅有 如下几种结果:
可以看到, 以上几种结果基本都是对项目有 contribution 的. 甚至即使 issue 最终不存在, maintainer 也可能从 unexpected issues 中看到提升用户体验的机会. 因此 unexpected issues / bugs 对项目有很大价值.
介绍一些管理 issue 的 bot:
上面提到过的检查 issue 是否包含必要信息的 bot. 然而为了用户体验, 这个 bot 是 precision-driven 的, 只检测最明显的情况, recall 并不高.
自动关闭 "needs-more-info" 的 issue: 如果 issue 有了 "needs-more-info" 的标签, 等待用户提供必要的信息, 却长时间没有 update, 就会被 bot 自动关闭. 当有了 update 时, 标签会被这个 workflow 自动移除.
自动锁定古老 issue: 如果项目一直在活跃开发, 那么一个古老的, 已解决的 bug 很可能没有任何值得 follow up 的信息: 即使类似的 bug 又出现了, 大概率也和旧的 bug 没什么关系. 那么可以对此类 issue 设定为静默一年后自动锁定 (禁止评论).
自动 label:Github 支持按照 issue template 来自动 label, 但是那样的粒度太粗. 如果对于特定类的 issue 能够根据内容来精准匹配的话, 也可以用这个 bot 添加 label. 但是需要注意自然语言处理是很困难的, 给这个 bot 写规则并不容易.
自动订阅 label: 巨型项目中, 开发者想要自动 subscribe 特定模块相关的 issue. 这个 bot 按照 issue 的 label 自动添加 "@username" 来 subscribe 感兴趣的开发者.
Stale bot: 自动关闭一段时间没有 activity 的 issue. 这个 bot 很常见, 但 不应该被使用, 因为没有 activity 不代表 issue 解决了. 参考:
注意这里假设了 issue 和 question 是被区分开的. 如果 question 也被包括在 issue 里, 自动关闭 question 是可以接受的.
报告错误 / 报 bug 是用户与开发者间最常见的一类交流, 也是常见的 github issue. 但是很多用户并不会科学的报 bug, maintainer 对此也缺乏引导. 因此这篇文章讨论如何科学的报 bug.
如何报 bug, 不仅适用于开源社区, 也适用于任何软件开发. 上一篇提到, 开源社区的交流难度比一般的团队合作更大. 如果掌握了在开源社区中报 bug / 修 bug 的交流方式, 在公司里处理类似的事情也会更轻松.
首先, "报 bug" 是一个较为狭义的说法.
在有的项目里, 用户容易确定一个问题是不是 "bug". 但在有些项目里, 用户未必有能力判断问题到底是不是由于项目的 bug 产生的. 程序的错误可能来自于用户自己, 用户的环境, 或其他依赖.
这时候, 报告 "unexpected issues" 是个更合适的说法: 用户报告的是未预期的行为 (unexpected observations/behaviors, 不一定是 error), 然后由更了解情况的人判断它们是不是 bug.
要报告 unexpected issue, 用户应首先一定 确保对方明白自己的 expectation.
Expectation 有时候是很显然的, 比如 expect 程序正常运行但是它崩溃了. 然而, 很多时候, expectation 也许对问题的报告者显然, 对别人却未必.
例如: 一个常见情况是用户写了一大段文字描述自己做了什么, 程序做了什么输出了什么, 看完根本不明白到底哪里是 unexpected. 通过反复询问才了解到, 用户的 expectation 是 "程序不输出 XXX". 这样的 expectation, 未必那么显然.
人类语言往往是模糊的. 要确保对方明白你的 expectation, 以 "我 expect ..." 为开头造句最清楚. 上面的例子里, 如果用户能在流水帐的信息之外, 清楚的说出 "我 expect ...", 则避免了低效的交流.
因为用户的误解, expectation 本身可能是 错误的, 没有根据的, 或不被支持的. 例如:
由误解产生的 expectation 可能就更不显然了. 只有清楚的说出来才能尽早澄清这类误解.
要说清楚 expectation, 一般要包含两个部分:
用户应描述自己看到了什么 现象 (observations) , 而不 (仅) 是自己以为程序做了什么 (presumed behaviors). 因为用户未必理解程序到底做了什么, 也未必有能力描述好程序的行为.
作为一个用户, 你 expect 程序做 X, 但是程序好像没做 X / 做了 Y, 因此你想报告 unexpected issue. 这时候, 不要下结论说程序做了 / 没做什么, 因为:
如果你觉得程序做了错误的事情, 当然可以提供自己的判断和分析, 但最需要提供的是能够支持你的判断的 observations, 例如原始的 logs (如果 observation 与图片有关, 截图).
相比描述 "behavior" 来说, 提供 observation 有这些好处:
更简单: 你只要复制粘贴. 不需要了解这个程序
无歧义: 复制粘贴可以更完整的还原你的 observation, 避免了人类语言的歧义性.
提供 完整的 observations 的话, 其他人就可以跳过用户的判断, 独立判断 到底发生了什么. 这对分析 unexpected issue 是至关重要的. 用户自己的判断可能是错的, 举几个例子:
feature_A=True
之后触发了 failure X, 因此判断feature_A
导致了 X. 但事实可能是, feature_A=False
也会触发 failure X, 只是由于其他原因 X 没有暴露出来.与此相对的, maintainer 不要过度相信用户声称的 behavior. 应该从用户提供的信息中判断用户声称的 unexpected behavior 是否真的发生了.
我一般都会在 issue template 里要求用户提供 完整的 log . 这是性价比最高的信息: 不仅能够用来判断程序的行为, 还能够帮助 debug, 用户也很容易提供. 但还是总有人在报告 error 的时候只给一行 error message, 连 stack trace 都没有, 让人很头疼. 希望未来的 github issue form 能够通过强制必填的表单来更好的教育用户.
重要的事情再说一遍: maintainer 需要 全部的, 完整的 log, 而不仅仅是 error 发生前的 log. 在用户看来没有用的信息对 maintainer 可能是有用的, 不要省略它们.
另外, 既然在报告 unexpected issue, 用户提供的 observation 当然应该清楚的包含 "unexpected" 的部分. 用户需要让 maintainer 能够从 observations 中看到这个 unexpected issue 确实发生了.
Stackoverflow 的 "How to ask a good question" 里有提到 "Minimal Reproducible Example (MRE)" 的概念, 建议阅读.
在开源社区的场景下, 报告一个 unexpected issue 的时候, 用户也应该尽量以代码, 命令, 数据的形式提供 minimal reproducible example. 其意义在于:
反过来:
为了提供一个高质量的 MRE:
用户应提供 maintainer 要求的环境信息 (项目的 version, 依赖的 version, 系统软硬件等等). 它的重要性在于:
Maintainer 最清楚哪些环境信息是需要的, 因此 maintainer 应当以 issue template 等形式告知用户如何提供环境信息. 例如, 在 detectron2 中我提供了一个collect_env.py
脚本, 运行后会输出如下的结果, 比用户自己能想到的信息要详细得多.
|
Maintainer 实现这样的脚本时, 需要注意:
collect_env.py
里使用{conda,pip} list
就是不科学的做法.有时候, 用户仅仅提供自己的环境信息还不足以复现问题, 因为难以确定是环境中的哪个因素导致了 issue. 为了保证 issue 的 reproducibility, 可以考虑使用 docker 或 Colab notebook 提供更完整的环境. 这种情况并不少见: 我在 PyTorch 里有 4 个 bug report 是自带 docker 来 reproduce 的.Maintainer 也应提供官方的 docker/Colab, 方便用户在报 issue 时排除环境问题: 用户可以把自己的 MRE 在官方的环境中测试.
这篇文章更多从用户的角度说了如何报告 unexpected issues. 用户最好应提供:
在 maintainer 给予了足够的引导的情况下, 1-3 的代价都很小, 用户应尽可能提供.4 有时会有一定难度, 文中已介绍.
在第一篇文章中说到, maintainer 自己决定自己的义务 / commitment 有哪些, 那么也就可以要求 unexpected issue 必须包含特定信息, 并决定对于缺少信息的 issue 不予处理. 一个很有趣的极端例子是, you-get
项目直接禁用了 issue 功能, 要求所有的 bug report 必须以 "失败的单元测试" 的 PR 形式报告, 直接满足了以上四点. 对于这种接口简单的工具来说, 不失为一个好办法.
大多数具备规模的项目会通过 issue 类别和 issue template 表明什么样的 issue 是 maintainer 愿意支持的. 为了高效管理, 往往都会对用户提供的信息有硬性要求. 如果项目有 issue template, 而你又没有自信到觉得自己提供的信息比 template 更好, 那么请务必 follow issue template -- 要获得 maintainer 的帮助, 应该首先尊重 maintainer 的要求, 提供必要的信息. 下一篇文章会更详细的说 maintainer 的管理方式.
]]>相比传统的邮件列表 / bugzilla/sourceforge 等开源平台, github 把开源社区交流的成本 / 门槛降的很低, 因此交流的质量也常常随之下降.
我计划写几篇文章, 从 用户 (User) 和 维护者 (Maintainer) 两者的角度写写开源社区中如何使用 issue/PR 进行沟通, 希望能够:
作为主要开发者和维护者, 我曾经管理过 detectron2 和 tensorpack 等项目.2016-2021 年里我一个人处理过这两个项目里约 5000 个 issue/PR,作为用户, 我也参与了 PyTorch / TensorFlow 等不少项目的社区讨论. 在这个过程中, 看到了开源项目中各种不同的沟通, 管理方式. 现在我已经基本离开了这些项目, 于是想把这些经验总结一下.
这篇文章作为第一篇, 只讨论一些基本的原则.
在一个项目中, maintainer 和用户的目的常常并不是完全一致的. 有效交流的基础, 是要理解对方与自己 Priority 上的相同和不同.
大多数开源项目的资源都很有限, 用爱发电. 因此, maintainer 自己决定自己的义务 有哪些.
通常, maintainer 不以满足某个用户为目标, 不当 "客服". 这是因为, 相比于其他可以做的事情而言, 给网上的路人提供个人化的 support 对一个项目能够带来的贡献是非常非常小的. 相反, maintainer 通过做其他事情 (例如修 bug) 让项目发展得更好, 来 间接 的帮助所有用户. 通常来说, maintainer 的 priority 是围绕 项目 为中心, 而不是特定用户.
但是, 用户的诉求很多时候就是要解决自己的问题. 这时候用户一定要认识到: maintainer 对解决你的问题并不一定有兴趣.maintainer 愿意与用户交流, 本质是因为用户的 feedbacks 可能让项目变得更好.
让项目变得更好 是用户与 maintainer 的 common interest, 基于这一点的交流才是最有效的, 二者才能有效合作.
让项目变得更好, 换句话说就是 "make contribution to the project".
"Contribution" 这个词在 github 上主要出现于 "contribution calendar", 这是一个记录用户每天的 "contribution activity" 的日历:
在 contribution calendar 上, 不仅与代码相关的行为 (commits, PR, reviews) 算作 "contributions", 有些奇怪的是, 创建 issue 也算作 "contributions". 这可能正是因为, 在 github 设计者的眼里, issue 理应是为了让项目变得更好, 而不仅是解决自己的需求.
用户如果能够理解这一点, 将自己的个人需求转化为对项目的 contribution, 才能把交流变得更有效. 概括来说的话:
以上三点在实践中意味着什么, 应该怎么做, 会在后面几篇中再说明.
开源社区的交流和公司同事间的开发交流在很多方面是相似的, 开源社区中 Maintainer/User 的关系, 也与公司内部 Code Owner/User 的关系类似. 但是, 开源社区里的沟通难度一般会更大:
因此, 开源社区中的交流方式, 对公司内的交流有参考意义, 但不一定完全适用. 例如下一篇"如何报bug"就更通用.
人类语言是模糊, 容易歧义的, 上面提到的开源社区中交流的障碍, 会把人类语言的歧义放大.
为了能够在消息中传达更多的有用信息, 在交流中要意识到人类语言的局限性. 交流中的每一方如果可以花少量额外时间, 使用代码, 复制粘贴等方式, 将信息尽量组织的更客观, 消除歧义, 就会使交流更有效. 毕竟交流的延迟很大 (至少以小时为单位), 如果更精确的表述能够为双方节省一次 round trip, 就已经赚了.
例如, 在开源社区的交流中:
./main --mode=xx
. 后者更准确类似的例子还有很多. 尽量使用更准确的语言来交流技术问题是个重要的好习惯.
]]>logging
module,with the aim of:Loggers are globally identified by a dot-separated name given to logging.getLogger(name)
, such as library.module.submodule
.The logger named by an empty string is the "root logger".
Libraries must not call logging.basicConfig
or configure the root logger in any way, unless requested by users.
Configuration of the root logger affects all logs, which is beyond the responsibility of anysingle library. Only application developers, i.e. those who create the program that interacts withusers, should determine how to configure the root logger.
Never call functions like logging.{info,error}
from within a library, because they write to theroot logger. I've added this advice into CPython's official documentation.
When a library writes to the root logger, applications that use the library lose control over thelibrary's logging behavior: they cannot turn on/off the logs from the library, apply custom filter/formatter, or redirect thelogs from the library.
Instead, A library should write to a logger with easily and uniquely identifiable name, using
|
This way, caller of the library will be able to reconfigure its logger using the same name.
__name__
, i.e. the current module name, is often a good logger name.Occasionally, __name__
is not good enough:
__name__
may beuninformative and can be removed, e.g. my_lib.submodule._internal._impl
.Note that there is a trade-off between the name simplicity and the granularity of control.__name__
, e.g. company.company3.organization
.Removing such a common prefix can simplify the logger names while still keeping them unique.The "current function/class name" is often a bad logger name because:
I wrote a simple scriptthat processes Python source files to automatically replace all logging.xxx
by logging.getLogger(__name__).xxx
.This script has created PRs in a few projects that misuse the root logger, such aspytorch/72649,tensorboardX/662.
I hope someone could create a linter that performs this check.
Handlercan be attached to loggers to decide where/how to log a record.
Unless requested by users,a library should not add a handler anywhere (even not to its own logger),if the handler has a visible effect on users.This is because the application developer should make the final call onhow each library's logs are processed.Pre-existing handlers may cause issues such as duplicated logs.This suggestion is present in CPython's documentation here.
Examples of invisible handlers that libraries may add to their loggers are:
Libraries should try to be good citizens in reducing the amount of duplicate/unwanted/useless logs theyprinted. Some tips include:
INFO
for debugging.__init__
.if valid(): log(...)
.log_first_n
:log only for the first log_every_n_seconds
:limit the frequency of certain logs to be less than once every log_every_n
:log once every logging
module directly.Logs are not only strings. logging.LogRecord
is a rich structure with useful attributes, and users can even tag logs with custom attributes through the extra=
argument.
Large, distributed systems should not rely on printing as the sole method of logging.Whenever logs are printed (to terminal or files), they have to be converted to strings.A lot of useful attributes, such as stack trace and line number, are often lost. The lack of structure also makes it difficult toparse and analyze the logs.
In addition to printing, we can also use an additional Handler
to send structured logs to a logging service,such as Google Cloud Logging or Humio.The advantages include:
In an MPI-like distributed job (e.g. many data-parallel deep learning training), workers often print almost identical logs.We should avoid printing them all to the terminal.
A good strategy could be:
Detectron2's setup_logger
implements (1) and (2).
When logs are printed to terminal, they are more readable ifseverity is represented by colors rather than strings.I often use this formatter:
|
Attach this formatter only when the handler writes to terminals (check sys.stdout.isatty
),and we'll get outputs like:
logging
module is not enough¶Be aware that it's insufficient to only rely on logging
module.A Python program may produce useful logs bypassing the logging
module, e.g.:
print
statements: they should be avoided, but may still exist.To not miss important logs, a comprehensive logging solution needs to integrateboth the structured logs from Python and the less common unstructured logs from the above sources.
]]>延续 上一篇文章, 再说一说怎么科学的在 paper 里做 ablations.
一组理想的 ablation 实验, 应当所有实验尽量使用一份代码实现, 和相同的实验 recipe, 这样才算是真的 ablation. 其中尤其不要忽视实现的重要性, 因为同一个 feature 在不同的实现里可能会有重要的区别. 例如, 一个 TensorFlow 跑的实验和一个 PyTorch 跑的实验就不能放到一组 ablation 里. 我的 Where Are Pixels? -- a Deep Learning Perspective 也说了很多底层实现细节对模型的影响.
反例: EfficientNet 和类似的不少文章设计了新的网络, 却没有跟已有网络结构的 ablations, 只有在不同 recipe 下的 system-level 结果.Revisiting ResNets: Improved Training and Scaling Strategies 一文就说, 其实 ResNet 在加强的 recipe 下仍然很 competitive, 而那些看似很厉害的新模型, 很大程度上受益于它们使用的 recipe. 这篇文章的开头和结尾写的很好, 摘抄一下:
Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies.
...
We hope our work encourages further scrutiny in maintaining consistent methodology for both proposed innovations and baselines alike.
开头吐槽只想 claim 大新闻; 结尾吐槽别人实验做的没有 scrutiny.
反例: FCOS 是一个 system-level 效果很好的 detector, 然而它并没有充分的 ablation 来说明它的效果 为什么好. ATSS 一文就把 FCOS 和 RetinaNet 之间的所有区别进行了 ablation, 发现 FCOS 的性能提升有不少得益于与其中心思想无关的改动.
"控制变量" 并没有看上去的那么美好: 深度学习作为没有太多理论的科学, 不同变量之间常常存在潜在的, 未知的相关性. (其他缺乏理论的学科, 例如医学, 心理学, 社会学也有类似问题). 这种相关性会带来如下一些后果:
相关性让 ablation 的结论更可疑. 虽然一个实验支持了 claim, 但是这个 claim 可能跟实验里被控制的变量相关, 那么也许换一组变量后, claim 就不再成立了. 对于深度学习 paper 的读者, 这也是一个常见的的 concern: ablation 证明了在这个 baseline 下你的方法有用, 可是换个 baseline 呢?
要缓解这个 concern, 应当选择 常用的, 有代表性的 实验 recipe (包括 baseline, hyperparameter, evaluation protocol 等). 一个好的 baseline 并不需要是 SOTA, 但是需要是一个领域内大家公认具有代表性的结果. 如果实验并不是在一个读者熟悉的条件和设定下, 读者更容易怀疑 ablation 的结论是否换个 recipe 仍然通用, 是否有意或无意中选择了对 baseline 不利的设定, 是不是拿着锤子找钉子. 这些都会弱化结论的可信度. 标新立异的选择往往是需要 justify 的, 要说明 "这个锤子为什么适合这种钉子" (见最后一节).
反例: 某 paper 发明了一个新的 layer, 然后: "在一个 (自己设计的) 30 层 ResNet 上做了实验, 实验设定和参数见附录".
反例: 某 paper 发明了新的优化方法, 然后实验是卫星图像分类或医疗图像分类这种小众领域.
相关性使得不同变量带来的效果常常不可叠加: A 变量和 B 变量可能各自能够将结果提高 1%, 但是合在一起也只能提高 1%.
举个直观的例子: 假如某 paper 发明了一个新的 loss function 能够提高结果, 但也许这个 loss function 的主要原理是改变了 gradient magnitude. 这时候, 把旧的 loss function 的系数调一调也能得到一样的效果. 在这里, "loss function" 和它的系数就是两个相关的变量, 他们带来的效果是可以互相替代的. 如果研究者不注意, 写了这样一篇 loss function 的 paper, 被人发现跟调系数没区别, 那 paper 的价值就消失了.
这是另一个我们要使用 常用 recipe 的重要原因: 一个常用的 recipe 往往是已经被 well-tuned, well-studied 的. 这意味着如果能在这个 recipe 上做出 improvement, 这个 improvement 没法通过简单的 tuning 得到. 这也会让结论更强. 即使在 ResNet 早已不是 SOTA 的今天, 我如果要做 CNN 结构相关的实验, 可能仍然会选择从 ResNet 出发.
相关性导致新的方法需要改变 (而不是控制) 变量才有效. 例如, 很多新的模型可能需要找一个新的 learning rate. 这时候, "learning rate" 这个变量就没有控制. 在改变了这个变量的同时还要 convince 读者这个实验是有效的, 是需要做额外的工作的. 下一节会详细解释.
前面说 ablations 要使用常用的 recipe. 但是, recipe 也要与时俱进: 一个曾经不常用的 trick 可能在未来会进化成大家都在用的标准 recipe, 一个新的方法可能需要一个新的 recipe. 如果每篇文章都严格 "控制变量", 只使用旧的 recipe, 领域可能会陷入 local optimum. 那么, recipe 的进化要如何发生呢?
假设 A, B, C, ... 是一些与 ablation 的主要 claim 没有紧密关联的 recipe (例如 hyperparameter / tricks. 为方便理解, 可以把它们当作几个不同的 learning rate), 且 baseline + A 是 baseline 的 "标准" recipe. 当我们在开发一个新的方法 "proposed method" 时, 也许会发现用 B 来做实验比 A 更好 (proposed + B > proposed + A
). 这时, 作者可以展示下面这些实验:
proposed + B > baseline + A
. 这是不足以 claim proposed > baseline
的.
当然, 作者也可以选择将 "B" claim 为 "proposed method" 的一部分 -- 但是这会弱化文章的价值, 因为它让 "proposed method" 更复杂了. 读者也会疑惑: 也许只有 B 就够了, "proposed method" 里剩下的部分也许价值不大.
proposed + B > max(baseline + A, baseline + B)
: 这样来 claim proposed > baseline
, 读者一般是接受的.
在此基础上, 读者会好奇 proposed + A
表现如何. 如果proposed + A < baseline + A
, 则说明 proposed 依赖 B. 如果 B 是某个复杂的 trick 的话, 这种依赖也会降低 proposed 的价值.
要注意到, 以上的结果无法排除下面这种可能性: 存在一个 C, 使得 max(proposed + B, proposed + C) < baseline + C
. C 的存在会使得 baseline 看上去比 proposed 更好. 但是, 由于我们假设 baseline + A 已经是一个常用的标准 recipe 了, 如果存在这样的 C, 那 C 大概率是 nontrivial 的, 不太可能是简单的调参. 这也是为什么要尽量使用标准 recipe.
为了尽量降低 C 存在的可能, 在计算资源允许的情况下应当对 recipe 进行公平的搜索: 如果 proposed 使用的 hyperparameter B 是 grid search 找出来的, 那么也应对 baseline 的 hyperparameter 进行类似的 grid search, 看看是否能找到一个更好的 C.
例子: DETR 的训练代价比主流 detection 模型都大得多 (100-500 epochs), 这点在技术上难以避免. 这个区别导致公平的实验不容易做, 因为主流模型 (Faster R-CNN) 还没有一个常用的, 训练这么长时间的 recipe. 据我了解, 作者们当时尝试了不少方法提高 Faster R-CNN 在这个训练长度下的性能, 尽量让 baseline 更强. 这是很负责任的做法.
前面两节都提到了, 使用一个常用的, 有代表性的, 标准的 baseline 是很重要的. 这样的 baseline 在文中的结果应该至少与别人 paper 相同实验的结果接近. 如果 baseline 比别人差, 说明 baseline 里一定有某些因素与那个常用的 recipe 不同, 因此会弱化结论的可信度.
反例: 某 paper 提出了 ResNet 的小改动, 但是文中的 ResNet baseline 比 pytorch 官方样例显著的差. 在这个 baseline 上有 1% 的提升, 并不意味着在常用的 baseline 上能有提升: 因为如前文所说, 不同的因素常常是不可叠加的. 事实就是, 有许多方法只在弱的 baseline 上有效.
如果 baseline 确实无法 reproduce 怎么办? 这种处境很遗憾. 一个 research topic 如果没有大家公认的 reproducible 的代码和 baseline 设定, 就容易陷入乱象. 例如 A Metric Learning Reality Check,Deep Reinforcement Learning that Matters 都是在吐槽各自领域里的问题. 这正是为什么要做开源高质量 codebase.
前面说到, 实验设定一般使用 "常用, 标准" 的 recipe, 否则有拿着锤子找钉子的嫌疑. 而有的时候, 如果我们恰好要 claim"我的锤子适合特定的钉子", 那么巧妙的改变 recipe 也许会有更好的效果. 下面举几个正 / 反面例子.
正例: ResNet paper 多次 report 了 training error (Fig. 4, 6), 这也许会显得奇怪, 毕竟 training error 不是一个大家常用的 metric. 这是因为文章的大 claim 是关于 residual connection 对训练 / 优化有帮助, 而 deep plain network 难以优化.Training error 才是跟这个 claim 直接相关的 metric, validation error 变好只是 training error 变好的一个副产品.
正 / 反例: Optimizer 的根本目标是降低 training loss, 所以比较不同的 optimizer (例如 SGD/Adam) 的时候不能不看 training loss. 这篇 paper Sec. 5 就吐了这个槽: 有的 optimizer 跑出来的 validation error 更低就声称自己更好, 但是实际上发现它的 training loss 更高.
反例: 我 review 过的某 paper claim 一个方法能够提高模型的 capacity 或表达能力, 但是实验是拿 ResNet 在 Cifar10 上看 validation error. 虽然 validation error 是一个常见的 metric, 但是 ResNet 在 Cifar10 上严重 overfit (training error = 0), validation error 跟模型 capacity 没什么关系.
正例: detection 里有很多可用 metric, 如大小物体的 AP, 不同 IoU 的 AP, 等等. 当有合适的 justification 的时候 (例如模型设计上对大物体更友好), 比较其中某个特定的 metric 能够帮助文章的 claim. PointRend paper 里为了证明 "边界结果更准确" 这个 claim, 设计了一个新 metric: 拿 COCO 训练的模型在 LVIS 的高质量标注下算 AP. 这样得到的结果比使用标准的 metric 更有说服力.
正例: Mask R-CNN 的 Table 2 (d) 使用了一个很少见的 recipe: 基本没人用的 ResNet-Conv5 backbone. 这是为了证明关于 RoIAlign vs. RoIPool 的 claim: RoIPool 的 feature map 不对齐, stride 越大, 影响越大. 通过 Conv5 (stride=32) 上的实验更加强化了这个 claim. 当初之所以在 detectron2 里保留 Conv4 这些性能并不好的模型, 就是因为它们在许多实验中仍然有研究价值.
正例: 我的 Rethinking “Batch” in BatchNorm 实验很多, 里面做了对 BatchNorm 的各种魔改. 这些魔改里, 大多数的目的不是为了 propose 一种新方法, 而是通过改变 BatchNorm 的行为来验证某个 claim. 如何找一个好钉子, 设计一个实验来巧妙的突出 claim, 是一项技术活.
]]>array[H][W]
, where each elementarray[i][j]
is a pixel.How does discretization work? How does a discrete pixel relate to the abstract notion of the underlying continuous image?These basic questions play an important role in computer graphics & computer vision algorithms.
This article discusses these low-level details, and how they affect our CNN models and deep learning libraries.If you ever wonder which resize function to use or whether you should add/subtract 0.5 or 1 to some pixel coordinates,you may find answers here.Interestingly, these details have contributed to many accuracy improvements in Detectronand Detectron2.
Sampling theory tells us howa continuous 2D signal is turned into a discrete array by sampling and filtering.
We choose a
Values on these sampled points are not directly retrieved from the original signal, but come from afiltering step that removes high-frequency components.A bad choice of filters can lead to aliasing effects.
Sampling and filtering are both important in basic image processing operations, such as resize.Resize operation takes a discrete image, resamples it, and creates a new image.The choice of sampling grid and sampling filter will then affect how such a basic operation is implemented.
For example, the paper On Buggy Resizing Libraries and Surprising Subtleties in FID Calculationstudies the filtering issues, and shows that the resize operations in many libraries(OpenCV, PyTorch, TensorFlow) don't take into account the low-pass filtering. This then leadsto incorrect deep learning evaluation.
In this article, we ignore the issue of sampling filter, and only study the coordinates of sampling grid.We'll see that this choice is also inconsistent among libraries, and can affect the design and performance of CNN models.
Pixels are located on a sampling grid we choose.Naturally, we would like to only consider rectangular grids where pixels are spaced evenly.But there are many other factors to be concerned with:
(These terminologies may have a different meaning elsewhere, but this is how I define them in this article.)
For simplicity, we look at the one-dimensional case instead.We want to answer this question: for a 1D signal defined on
In this figure, the green bars represent the 1D signal of length
They (or at least the first two) are all valid interpretations when we are given an array of pixels.The interpretation we choose affects how we implement operations and models,because they each have some unique weird properties.To understand them more, let's check how a 2x resize operation should be implemented under each interpretation.
We'll now see that a simple "2x resize" operation has many possible implementations.
A unique undesired property of ① is that, stride is not the inverse of resolution.So a 2x resize is ambiguous: we have to be clear about whether we want half of stride, or twice more pixels.The new grids after resize look like these:
Resize for grid ② & ③ aren't ambiguous:
You can easily verify that the 4 different resized grids still match thecorresponding definition in our table above.
For 2D case, the 2x resize in ①(twice more pixels) and ② look liks this (image credit: here),from which you can see why ①(twice more pixels) is also called align_corners
:
These 4 different versions of 2x resize have some issues:
Extrapolation: ② and ③ both need extrapolation outside the border of the original grid to perform resize, but ① only needs interpolation.Extrapolation is sometimes undesirable.
Asymmetry: ③ is asymmetric, and it's probably a good reason to never use it. One consequence is that resize(flip(x)) != flip(resize(x))
. All others are symmetric.
Information Loss: in ①(half of stride) and ③ , about half of the points on the new grid exist in the old grid.By not having to interpolate their values, we minimize the loss of information.However, in ①(twice more pixels) and ②, most or all of the new pixels need to be recomputed.
For resize with other arbitrary scale factors, all versions have information loss. But 2x/0.5x resize aremost common in deep learning.
The DeepLab series of segmentation models are famous for using grid ①(half of stride) for all the 2x resize.See here for words from its author.This matches the inconvenient image shapes they use, such as 321x513.I've heard opinions that the benefits of "no information loss" and "no extrapolation" may let itoutperform ② in segmentation, but I have yet to see more evidence.
What do libraries use? Situation is a bit messy. I'll list what I know and look forward to your help to add more.No guarantee they are all correct, since I didn't check the source code for all of them.
Library & Operation | Pixel Grid Convention |
---|---|
OpenCV cv2.resize | interpolation=LINEAR/CUBIC : ② interpolation=NEAREST : buggy, none of the above. issue interpolation=NEAREST_EXACT : ② |
Pillow Image.resize | ② |
scikit-image transform.resize | ② |
PyTorch F.interpolate | mode=linear/cubic, align_corners=False : ② mode=linear/cubic, align_corners=True : ① mode=nearest : buggy like OpenCV. issue mode=nearest_exact : ② |
PyTorch F.grid_sample | align_corners=False which I requested: ② align_corners=True : ① |
TensorFlow tf.image.resize | TFv1 method=BILINEAR/NEAREST, align_corners=False : ③ TFv1 method=BILINEAR/NEAREST, align_corners=True : ① TFv2 method=BILINEAR/NEAREST : ② (In TFv2, align_corners option was removed) |
TensorFlow tf.image.crop_and_resize | none of the above. issue I reported |
It seems the mess is unique in the deep learning world. How come?From what I can tell, the history looks like this:
TensorFlow is the first place that introduces ③, in its initial open source.This was later considered as a bug and fixedin v1.14 using a new optionnamed half_pixel_centers=True
that follows grid ②.
align_corners=True
(①) appeared in TensorFlow 0.7 in 2016.I guess this was probably intended for DeepLab development and not for general use.
In TensorFlow v2, grid ② becomes the only version of resize, but it was too late.During all these years, the uncommon version (①) and the wrong version (③) have propagated to people'smodels and other libraries.
PyTorch's interpolate
comes originally from upsample
operation. Nearest upsample was buggy when it's firstadded in LuaTorch in 2014.Bilinear upsample was first added in LuaTorch in 2016 andused grid ①. Grid ② was added in 2018 to PyTorch under an align_corners=False
option,and became the default since then.
Due to this mess, resize operator in ONNX has to support5 versions of coordinate transform!Kudos to ONNX maintainers.
Many computer graphics textbooks and papers talk about this topic and choose ②, for example:
(Note that some of them uses ② but defines the continuous signal in the range
Given all the graphics literature, computer vision and deep learning libraries promoting grid ②, we use ② as the convention.
We pick ② as the convention for grid locations, but this is not the end of the story!We now know the grid locations relative to the beginning of the signal are 0.5, 1.5,
This is just a choice of convention and has no substantial effect on any algorithms.Two of the graphics literature I listed above put the origin on the first pixel.This has the benefit that all pixel locations have integer coordinates, but then it's weird that the signal lies oninterval
Another convention, "integer corners", or "half-integer centers", puts the origin at the beginning of the signal, so the first pixel is centered at (0.5, 0.5).The two conventions are demonstrated in this figure:
We choose "integer corners", and then willhave the following relationship between continuous coordinates and discrete pixel indices:
The choice doesn't matter for resize because absolute coordinates are not part of its API.However, for functions that accept or return absolute coordinates, we should be aware of their convention. For example:
cv2.findContours
returns integer polygons represented by indices. So we always add 0.5 pixel to itsresults to obtain coordinates that match our convention.cv2.warpAffine
uses "integer centers" and this is complained about in this issue.In fact most OpenCV functions use the "integer centers" convention.pycocotools.mask.frPyObjects
renders polygons as masks. It accepts polygons in "integer corners"convention.Same for PIL.ImageDraw.polygon
, but its results are 0.5 pixel "fatter" due to howits implemented. This has affected cityscapes annotations.RoIAlign
in torchvision takes a box in absolute coordinates that match our "integer corners" convention.scipy.ndimage.map_coordinates
takes coordinates in "integer centers" convention.If a dataset is annotated with coordinates, we also need to know its choice of coordinate system. Thisinformation is often not provided by dataset owner, so we make guesses. For example, in COCO itappears that polygon annotations match our convention, butkeypoint annotations do not and should be incremented by 0.5.
Now that we have a convention for the coordinate system, it's a good practice in computer vision systems toalways use coordinates rather than indices to represent geometries, such as boxes and polygons. This is because indices are integers, andcan easily lose precision during geometric operations.Using indices for bounding boxes has caused some issues in Detectron.
Models in Detectron / Detectron2all involve localization of objects in images, so the convention of pixels and coordinates matters a lot.Various improvements and bugfixes in the two libraries are related to pixels.
In detection models, bounding box regression typically predicts "deltas" between the ground truth (GT) box and a reference box (e.g. anchor).In training, GT box is encoded to deltas as training target. In inference, the predicted deltas are decoded to become output boxes.
Boxes in Detectron often use integer indices, instead of coordinates. So the width of abox is given by
|
As innocent as the code seems, the two functions are not inverse of each other: decode(encode(x0, x1)) != (x0, x1)
.x1
is incorrectly decoded: it should be center + 0.5 * w - 1
instead.
This bug appeared in the py-faster-rcnn
project around 2015, and is still theretoday.It was carried into Detectron and negatively affected results in the Mask R-CNN paper.Then it's fixed in late 2017after I found it, and contributed to an improvement of 0.4~0.7 box AP.Detectron went open source in 2018 with this fix.In Detectron2, we adopt the rule to always use floating-point coordinates for boxes, so the issue nolonger exists.
How to horizontally flip a geometry? Although pixel indices should be flipped by
Detectron isn't so rigorous on this and it uses
The augmentation library "imgaug" also made this fix.
COCO's instance segmentation data is annotated with polygons that have sub-pixel precision.Converting polygons to binary masks loses the precision due to quantization,and the lost might become more severe during augmentations.Therefore it's preferrable to keep the polygon representation and delay the conversion as much as possible.
In both Detectron and Detectron2, polygon representation are kept during flipping, scaling, and RoI cropping.Masks are not created until the second stage's box predictions are made, where the boxes are used to crop the groundtruth polygonsand generate the mask training target.
On the contrary, in TensorFlow's detection code hereand herepolygons are turned to binary masks immediately at dataset creation time.
The code to generate anchors in Detectron is quite long,because it tries to generate integer-valued anchor boxes.By adopting coordinates for all boxes in Detectron2, integer boxes are not needed.This simplifies all the logic to just a few lines of code.
This does not affect accuracy, because the exact values of anchors are not that important as long asthe same is used in training & testing.
The RoIAlign operation crops a region from an image and resize it to certain shape.It's easy to make mistakes becausetwo images and two coordinate systems are involved.Let's derive how to perform RoIAlign.
Given an image and a region (the green box), we want to resample a K [i,j]
issampling_ratio=1
).We show the 4 neighboring input pixels of output[0,0]
in the figure.The indices of 4 nearest pixels of
The original implementation of RoIAlign in Detectron doesn't subtract 0.5 in the end, so it's actually not very aligned.It turns out this detail does not affect accuracy of R-CNNs, because RoIAlign is applied on CNN features, and CNNis believed to be able to fit slightly misaligned features.
However, we have new use cases of RoIAlign in other places, e.g. to crop mask head training targets from the ground truth mask, soI fixed it in the detectron2 / torchvision RoIAlign with an aligned=True
option.Its unittestdemonstrates how the old version is misaligned.
Btw, once we figured out the coordinate transform formula, it's easy to implement RoIAlign using grid_sample
.This shows that RoIAlign is nothing more than a fused bilinear sampling + averaging.Using grid_sample
is about 10%-50% slower than the RoIAlign CUDA kernel.
Mask R-CNN is trained to predict masks of fixed resolution (e.g. 28x28) restrained inside given boxes(we call it "RoIMask").But in the end, we often want to obtain full-image masks.A "paste mask" operation is needed to paste the small RoIMask into the given region in the image.
This operation is an inverse of RoIAlign, so it should be implemented similar to our derivation above.In Detectron, this was implementedwith some magic rounding & resize that are not exactly the inverse of RoIAlign.Fixing it in detectron2 increases the mask AP by 0.1~0.4.
Obviously, the paste mask operation can introduce aliasing in the results due to the low resolution RoIMask.This is the motivation behind our work ofPointRend.
PointRend is a segmentation method that focuses on point-wise features, where a "point" isnot necessarily a pixel, but any real-valued coordinates.Pointly-Supervised Instance Segmentation, also from our team, uses point-wise annotations to train segmentation models.Both projects involve heavy use of point sampling and coordinate transforms.Having a clear and consistent convention of pixels and coordinates was important to their success.
Due to some sloppy code in the early days of deep learning libraries,today we're facing multiple versions of resize functions.Together with the two different coordinate system conventions, they easily cause hidden bugs in computer vision code.
This article revisits these historical technical debts and shows how these fun details matter in modeling and training.I hope they will help you make proper choices.
]]>这几年来, 从 FAIR 的几位大佬身边学习到的最多的是对待 research 的态度. 因此说说写 paper 和做实验的体会.
实验是为了证明或强化文章里给出的 claim/hypothesis 的.
Ross ICCV 2019 tutorial 最后谈了谈怎么写 paper. 第 126 页说, 文章中所有的 claim, 理想情况下都应该要么是文献中已有的 claim, 要么是有实验能够证明的 claim.
举个例子, BatchNorm paper 的实验可以 claim 很多东西, 包括 BatchNorm 让结果很好, 对初始化不敏感, 大 learning rate 也不炸. 但是文章说 BatchNorm "reduce internal covariate shift", 就遭到了一些人的质疑. 如著名的 Ali Rahimi Neurips 2017 test-of-time award presentation(B 站) 里的五连问 (第 17 分钟).BatchNorm paper 中把 "internal covariate shift" 粗略定义为 feature distribution 的变化, 但唯一一个与此相关的实验是 Fig.1 (b)(c). 虽然实验是合理的, 但结果确实不算不太显著. 甚至 GroupNorm paper 的 Fig.6 展示的结果可能都更强.
Ross 在 tutorial 中的建议是: 如果一个 claim 没有得到实验的有力支持, 在表述上可以弱化一些, 例如写 "intuitively/hypothetically, 如何如何...".
类似于这个 BatchNorm 的例子, 很多 paper 中一个常见的问题是, 实验只证明了结果好, 文字里却讲了个故事并 overclaim 了结果好的原理.
这里的 claim 并不仅仅指明显的 "We claim ...". 还可以有其他表现形式:
Troubling Trends in Machine Learning Scholarship 的 talk 里也指出了这些问题.
由于整个领域还是以实验驱动的, 对原理的研究不深, 所以原理常常都是 speculation / intuition, 在写作时容易 overclaim. 因此要注意弱化表述.
从结果深入到原理, 对应的实验大约有这样的几类:
System-level 实验, 就是直接跟已有的结果比. 这类实验可以没有控制变量: 两个非常不同的方法 (e.g. SVM vs. CNN) 照样可以在尽量相似的设定下比较结果. 它证明的 claim 是 " 这整个系统能够达到好的性能 ".
Ablations, 也即控制变量, 用来证明 " 系统由于这个方法 (而不是其他因素) 性能得到提高 ". 这就需要严格控制其他因素.
深入分析, 试图解释 " 为什么某种方法能够提高性能 ". 但是在 deep learning 中, 由于理论工具的缺乏, 这类实验往往不容易设计.
如果一篇论文提出了全新的系统, 结果还特别好 (例如 AlexNet, Bert, AlphaGo 这种 breakthrough 级别), 那么即使仅有 system-level 的实验也没关系: 以后总会有别人去更深入的研究的.Yann LeCun 针对 Ali Rahimi 的 presentation 曾经说过, 历史上 engineering 往往比 science 快一步, 很多科技的发展过程都是先做 work 了再去研究为什么的.
另一个极端是, 如果一篇论文对结果毫无提高, 但是通过详细的分析帮助读者理解了更多原理, 或提供了理解原理的视角和工具, 那同样也是很好的工作.
当然, 大部分工作既没有 breakthrough 级别的结果, 也很难给出令人信服的分析 (毕竟炼丹), 因此往往需要多种类型的实验结合: 有什么样的实验, 决定了论文能做出什么 claim, 进一步才能 justify 论文的价值.
举个例子, 每个人都熟悉的 ResNet paper, 同样的模型, 文章可能有下面几种不同写法, 对应不同的实验和 claims:
我们设计了一类 VGG 的变种, 有五个模型叫做 "MSRANet {18,34,50,101,152}", claim 它们达到了 SOTA. 实验内容是跟以前的 SOTA 比一比.
其实 AlexNet 文章就是这么写的, 但是 AlexNet 的 SOTA 结果本身就是一个巨大的 breakthrough. 如果 ResNet 也这么写, 影响力会小得多, 毕竟 SOTA 是短暂的. 说的不好听一点, 这个模型被下一个人拿过去改一改, 再换个名字成了新的 SOTA, 可能就没有人记得 "MSRANet" 了.2013 年的 ZF-Net 大概就是这么一个地位.
我们设计了包含 residual connection 的 "bottleneck block / basic block" 方法, 能够提高模型性能. 实验做一些 ablations, 确认了这个 claim.
这个方法有一些能够自圆其说的 intuition, ablations 证明有效, 同时也能达到 SOTA. 这就类似于大多数的好 paper.
而 ResNet 原文的层次就更高一些了: 文章标题说的是 "residual learning", 内容强调的是 residual connection 对优化的好处. 其他的各种 block, ResNet-50, 只是测试这个 idea/claim 的手段.
这个大 claim 对实验和分析的要求就更高了, 以至于 ResNet 没分析完, 到了 ResNet-v2 (pre-activation) 继续把这个故事讲了下去, 专门分析 residual connection 的重要性.
最后, 实践证明 residual connection 确实是 deep learning 今天为止最重要的发明之一, 几乎统治了所有领域, CNN 和 Transformer 都离不开它. 这远比 ResNet-50 到底长什么样, 里面的 block 到底是什么要重要得多.
人们一般希望探索结果背后的原理, 因为这是科学研究的本质, 也让科研工作更有价值. 这导致了上面提到的那种 overclaim 现象.
还有另一种现象, 俗称马后炮, 或 "Harking" (Hypothesizing After Results are Known). 也即先做实验, 在有结果之后 "看图说话 / 强行解释", 找一个可以被这个实验证明的结论, 或可以解释实验结果的猜想 / 原理. 然而在写作时, 先 "We hypothesize/claim ...", 再 "设计实验" 证明自己的 hypothesis/claim.
"Harking" 这个词最早出现在心理学研究里. 在 deep learning 中也有对此的批评: HARK Side of Deep Learning. 它之所以是一种不太好的 research practice, 是因为它背离了科学研究的目标. 科学与迷信的一大区别, 是科学应当不仅能够解释已知, 还能够预测未知. 如果一个工作仅仅追求找到一个解释, 来与现有的少量实验结果兼容的话, 这个解释未必在其他实验中适用, 因此可能不是一个科学的结论.
然而, 如今 deep learning 是以实验为基础的科学. 其研究过程确实经常要先做实验, 看到结果, 才能提出猜想或决定下一步的实验. 因此马后炮行为一般都存在. 但是, 一个科学的马后炮研究者应当在制造出一个猜想或结论之后, 再去新的实验里尝试预测一下未知, 打一个马前炮.
一个我自己深有感触的例子, 是我参与的这篇 Feature Denoising for Improving Adversarial Robustness. 起初我们猜测 non-local 会帮助 adversarial training, 并很快得到了实验验证. 这时候, 一个中规中矩的 paper 写法就是 claim "non-local 对 adversarial training 有用", 实验就是有 / 无 non-local 的 ablations.
有意思的是, 我们的 claim 是 "denoising layers 对 adversarial training 有用". 这是一个更大的 claim, 而且其实是一个马后炮的结论, 因为它仅仅是 intuitively 能解释已有的实验结果:non-local 在传统 vision 里用作 denoising, 而对抗样本的扰动可以看做为 noise.
为了支持这个更大的 claim, 我们要打几个马前炮, 用它去预测更多的实验:
这些实验是在有了 claim 之后, 专门为了验证这个 claim 而设计的实验. 它们的正面结果让我们对这个 claim 更有信心, 即便它最初是靠马后炮和直觉猜出来的. 从马后炮到马前炮, 是 researcher 的自我要求.
好的研究应当能经受时间和实践的检验, 因此一个好的研究者应自己先审视自己的 claim, 并真心的尝试用实验检验它们. 有机会再详细写写怎么设计科学的实验.
]]>STB_GNU_UNIQUE
就是 ELF 中一个不太好的设计, 带来了不少语义冲突. 拥有 STB_GNU_UNIQUE
binding 的符号, 即使在被用 RTLD_LOCAL
方式装载的时候, 也会拥有 global linkage. 另外它还会导致 dlclose 无效. 网上对此有很多吐槽, 例如这里, 这里.
这个 binding 最初的引入似乎是由于一些全局符号的内在状态不能重复多次, 因此把这些符号标记为 unique, 即使从多个 plugins 里装载了多次, 符号也只有一个定义. 但是另一方面, 程序也会有一些全局符号的状态必须是 local 的. 到底哪种行为是用户需要的, 编译器是不知道的. 结果是, gcc "聪明" 的自动把 template function & inline function 里的 static variable 标记为了 unique
其实 C++ 标准确实规定了这样的 variable 必须是 "single entity". 理论上说 gcc 没做错, 但这并不总是用户的预期行为, 而 C++ 标准也没提供别的办法. 如果要禁用 unique binding, 可以使用 -fno-gnu-unique
重编译, 或者暴力 patch 编译好的 ELF binary.
STB_GNU_UNIQUE
导致了 PyTorch 1.8.0 最近的一个严重 bug, 影响了所有 R-CNN 模型, torchvision / detectron2 / mmdetection 里都有用户报告. 重新编译 PyTorch 太麻烦了, 为了以后更快验证此类问题, 我就写了一个暴力 patch ELF 的脚本:
|
以上脚本把所有 STB_GNU_UNIQUE
符号的 binding 改成了 WEAK
, visibility 改成了 HIDDEN
. 符号表的 entry 结构可参考 /usr/include/elf.h::Elf64_Sym
.
用这个脚本 patch 了一下 libtorch_cuda_{cpp,cu}.so
之后, 以上 bug 就消失了. 同时, 这样我也能够方便的确认另一个看似相关的 bug 还是跟 STB_GNU_UNIQUE
有关系.
然而, PyTorch 就可以写 IfElse 了?
最近 detectron2 遇到的产品 / 部署的需求越来越多, 看看 PyTorch 五花八门的部署 / 加速方案里对 IfElse 都有什么限制吧:
虽然 researcher 用 PyTorch 写 IfElse 很开心, 欠的技术债终究要换一种方式还回来的. 坚持对 researcher 友好的后果就是对产品不友好.
本质上说, 如果用户写了 IfElse, 就意味着这段代码只能在单一进程的 Python 解释器里运行 -- 这本身就是一个巨大无比的 limitation. 需要用各种方式来 workaround:
虽然 TensorFlow 的 autograph 没深入用过, 但是从原理上看比以上方案都更合理. 当然, autograph 的实现是建立在 TensorFlow 已经有了足够多的 control flop operator 的前提下, 可以把 IfElse 变成 tf.cond
. 而 PyTorch 在面向用户的 API 里仍然一个这样的 operator 都没有 (虽然 torchscript IR 里有), 并且可能以后也不会有.
UPDATE: 关于这件事写了一篇详细的文章: TorchScript: Tracing vs. Scripting
]]>Three years ago, I wrote an article Unawareness of Deep Learning Mistakes:buggy code can still train and appear to work, so it's difficultfor users to realize that their code is wrong.
What's apparently more difficult to find out, is when the bug comes from the deep learning library we use.Imagine, what if the library unfortunately computes wrong results for certain parts of our model during training?The training will probably still work to some extent thanks to the magic of SGD,so how could we ever possibly find out such bugs?I'll share some experience and lessons.
"Bugs" in this article specifically refer to silent bugs that lead to wrong computation results,but no errors.
Such bugs exist in deep learning libraries and will continue to exist, because these librariesare young, and new features such as operators and training paradigm will continue to emerge in them as the researchdevelops.
Such bugs in deep learning are very hard to notice.A model typically contains billions of floating point operations (FLOPs) grouped into hundreds of operators.Even with small bugs, it may still train, converge, andappear to work well. Maybe it works slightly worse, or it fails occasionally, but it's extremely difficultfor a user to associate a suspicious result to a concrete bug in libraries.After all, there are many other explanations of a bad result that need to be ruled out:the model simply does not work; incorrect model implementation; bad hyperparameter; bug in users' training code, etc.
The situation gets worse when the buggy part of computation is not even explicitly written byusers, but implicitly generated.Auto-generated computation such as auto differentiation and graph optimization are often notwell exposed to users at all, making it more difficult to observe the bug.For example, pytorch/5801 is a bug in gradientcomputation that's found during the development of ELF OpenGO at FAIR.Models can still work to some extent with the bug, which hides the bug for a long time.It has unfortunately wasted many months in the project.
PyTorch has a "silent correctness"issue label, which shows many bugs of this kind.Most of these issues are also labeled as "high priority",which says a lot about the severity of such bugs.
Compared to user's training code that may also have many silent bugs,deep learning libraries have some advantage in test-ability.They provide well-defined small building blocks (e.g. operators and their gradients),so they are more testable than an end-to-end training.But I've seen a few limitations of unittests in the context of deep learning:
A test only covers a tiny input space, but other inputs may cause bugs.
As an example, pytorch/36485computes softmax incorrectly only if number of classes (C > 1024) && (C % 4 != 0)
, which israre in real applications.It is found in the development of MoCo which uses 65537 classes.After noticing regression in model's accuracy, the root cause is later found by bisection.
Behaviors under combinations of context are hard to test exhaustively.
Deep learning libraries usually separate the definition of computation from its execution.As a result, a computation may run under different combinations of runtime context:graph/eager mode (TensorFlow), eager/tracing/scripting mode (PyTorch), eager/jit/pjit mode (JAX),fusion with other computations, the device to run on, the level of parallelismto use, the underlying compute library and algorithm to choose from, etc.Unittests are often insufficient to cover such a huge space.
This issue gets worse in higher-level interface (e.g. Keras).TensorFlow is well-known for its many high-level ways to do the same thing:users can write a model under graph or eager mode, using either object-oriented or functionalstyle, with either raw TF APIs or Keras/Estimator interface, and Keras has many more modes within itself.Handling these combinations gets more challenging,because a high-level component has much richer semantics (therefore more side effects),that are often not strictly defined and harder to test than a pure-math operator.
For example, tensorflow/25175 andtensorflow/40638are two silent bugs in Keras causing models to not train properly. Both are due to unconventional combination in waysTensorFlow / Keras interact with each other.
Concurrency bugs that happen nondeterministically.
Deep learning software and hardware stacks by design have a high degree of parallelism, whichprovide room for concurrency bugs.Concurrency bugs such as a race condition may happen only in certain program or hardware, ormay not be reproducible at all. They are difficult to notice, report, and debug.
As an example, pytorch/18465 is a use-after-free concurrency bug I found.The only symptom I observed is that some tensor values in my model are unexpectedly modified.Drawing any conclusions beyond that is challenging, because any simplication I applied to the model can cause the bug to disappear.A lot of hours were put to track down and reproduce it with minimal examples.And there is little chance that a unittest can guard against such bugs.
I'll share stories of two more silent bugs that I found in TensorFlow and PyTorch,where they both compute wrong gradients for some operators.Both bugs stayed unnoticed for > a year waiting to be discovered by me,presumably because users can hardly blame bad training on wrong gradients, rather than their own models.
nn.SyncBatchNorm
¶Notice the bug
I started to try out PyTorch's nn.SyncBatchNormin the summer of 2019 due to the need of this layer in the MoCo project.To gain some trust in this layer(I knew that BatchNorm is often implemented wrong, seethis later paper of mine),the first thing I did is to try it on some baselines I'm familiar with:a Mask R-CNN in detectron2.
Luckily, this was before TensorFlow introduced the next bug I would find later. So when I compared itwith my TensorFlow implementationof Mask R-CNN that also supports SyncBatchNorm, I can see that most results in detectron2 were a few AP (average precision) worse.
I know every detail of the two implementations since I wrote both of them, and their gap isnegligible when not using SyncBatchNorm.So I'm relatively confident that such a large gap is a library bug in PyTorch.
Confirm the bug
Next, we decided to just reimplement a correct SyncBatchNorm.It turned out to be quite easy, and this was later releasedin detectron2.Comparing the results of the two implementations further confirmed the bug is related to nn.SyncBatchNorm
.
Narrow down the bug
From the experiments in various models, I noticed that suboptimal results only appear ifSyncBN is added in Mask R-CNN's mask head. Adding it to all other components is OK.Therefore I hypothesized that there are wrong computation results when batch size is differentacross workers, since that's where mask head is different from others.This hypothesis can be verified quite easily.After sharing our findings with the code owner, the root cause in gradient computation wasfixed.
nccl_ops.all_sum
¶NCCL is widely used to reduce gradients among GPUs.However, it turns out that TensorFlow can do it wrong sometimes.This bug may affect all NCCL-based multi-GPU data-parallel training.Interestingly, it also affects SyncBatchNorm in TensorFlow if using NCCL.
Notice the bug
In the summer of 2020 I gave TF v1.15 a try.I planned to just do some basic benchmarks of my code, but a few Mask R-CNN training blowed up with NaNs after 10~20minutes of training. This has not happened before.
Confirm the bug
My first thought was that I broke my Mask R-CNN implementation at some commit.But after trying a few combinations of code versions,it became clear that TensorFlow was to blame, because the same code can train in TF v1.14,even if I make sure they use identical version of CUDA/CuDNN.
Narrow down the bug
I know that no one in TF team would use my entire training code to debug, so I have to narrow it down myself.But this was never easy, because wrong results in any step in the whole training system can lead to NaNs, andthere is nowhere to start looking.Moreover, the bug does not happen deterministically, and when I tried to simplify my code,it started to happen less frequently.
Luckily, there is still a painful but practical way to go: bisection. So I:
|
Unfortunately, the offending commit seems correct to me. This means the commit which increases parallelism in NCCL probablytriggers a bug that dates back even earlier.
Further narrow down the bug
After playing with the offending commit a bit, given the non-deterministic behavior of the bug,and the content of the commit,my hypothesis was that the way TensorFlow uses NCCL contains concurrency bugs.
My original code only uses NCCL's all_sum
to all-reduce gradients.To add a simple check of its results, I used tf.add_n
to all-reduce the gradients again,and added tf.debugging.Assert
to ensure that the two results have to match.Unsurprisingly, the results don'talways match -- a large discrepancy appears once a while between the results of tf.add_n
andnccl_ops.all_sum
.
This is where the heavy-lifting ended: I've turned the silent training bug into an obvious error.The bug is no longer about a failed training which "I think" should succeed,but is now about something that's obviously wrong in TensorFlow: weadded tensors in two different ways and results don't match!No one is obligated to trust the correctness of my training code,but every one has to admit that nccl_ops.all_sum
and tf.add_n
must not produce differentresults.
The rest is easy: I started to simplify my training code for better understanding of the bug,removed all depenencies, and eventually made a small enough self-contained reproducible script andreported a bug.Beyond that, it is no longer my responsibility.
Summarizing from my own experience, the following are important to fight silent bugsin deep learning libraries:
Reproducing known resultsis the only way to discover silent bugs in model training.This is how we have an "expected output", so that we can notice if anythingunexpected is happening.
Narrowing down is necessary at least in the open source environment.Unless a small enough code clearly demonstrates a bug in the library, it's not the libraryowners' responsibility to understand and debug user code.After all, a bug often lives in user code rather than the library.The general guidelinesabout how to ask good questions/bug reports can apply to deep learning.
Bisection is slow, costly, but effective.When there is no obvious clues, and its cost is affordable, do a bisection.If anything can be better than bisection, that would be a trisection or k-section to reduce itslatency, because verifying whether a commit works or not may require training a model for quite a while.
Bisection is not always applicable. If there isn't a good history version as a reference,other more creative debugging methods will be needed.
Know the library well, understand its internals so we can make reasonable hypothesis andinvestigate them.It's often helpful to dig into library code: a few lines of debugging code at the right place can provide valuable informationthat cannot be easily obtained in user code.
Silent bugs exist in deep learning libraries, and are extremely hard tofind. What does this mean for everyone working on deep learning?
As an average user, follow what the experts are using.Silent bugs exist but are hard to find. Without enough confidence onour own ability to always discover such bugs, follow the experts.
A library without years of battle testing may have many sharp edges or hidden bugs.Using a mature library like PyTorch or TensorFlow, the bug you may run into is more likelyto have been discovered by others already.This applies not only to libraries as a whole,but also to different features of a library, modules within a library, extensions of a library, etc.
This is not saying we should use the most popular thing.On the contrary, high-level frameworks that build over-simplified APIs to gain popularity among non-experts (e.g. Keras)are something a serious researcher would rather avoid:they may have silent bugs buried underneath simply because the intended user group is not capableof noticing them.
To make your code/library popular, reproduce known results to increase credibility."Following the experts" tends to create monopoly. To break that,deep learning training libraries can earn trust by reproducing known results, rather than justprovide examples of arbitrary toy models.This is a core principle in tensorpackthat I follow since the beginning,and is probably the most effective way to convince a user that your library/implementationdoes not have hidden silent bugs.