Yuxin's Blog

为什么应该使用 Stacked Diffs / Stacked PRs

2023-08-14T07:00:00.000Z

Meta 与 Google 内部的代码管理工具都支持一种被称作 "stacked diffs / stacked PRs" 的 workflow. 然而, 基于 git 的主流平台 (github, gitlab) 都不支持这种 workflow. 许多离开 Meta 后不得不使用 github 的朋友表示, stacked diffs 对于工程师是一个 "ultimateproductivity tool", 我也深有同感. 这篇文章介绍一下什么是 stacked diffs workflow, 以及为什么它能够极大的提升团队开发效率.

场景 ¶

这里我们把讨论限制在如下的使用场景中:

假设 1: 一个复杂的代码仓库有一个唯一的核心主线分支.
假设 2: 有多个开发者试图进行 non-trivial 的改动.
假设 3: 项目对代码质量有要求: 每个要合并的改动都必须经过充分 review 和测试.
假设 4: 适用对象不是开源项目里的路人随手 PR, 也不是实验性质的 prototype, 而是较为确定性的严肃的团队合作.

由于假设 2, 一个新 feature 往往要对仓库里的多个部分进行相对独立的改动, 例如在实现 feature X 的过程中, 我们可能会:

为模块 A 添加一个 API 及测试
顺便发现了模块 B 的一个 bug, 修复它并添加测试
对模块 C 的一个 API 进行了不兼容的改动并修改了它所有的 callsite
实现 feature X 及添加测试
完善关于 feature X 的文档和 release note

等等多个相对独立的步骤.

由于假设 4, 开发者知道自己的方向是大概率正确, (做少量修改后) 会被 reviewer 通过的. 因此为了提高效率, 在实现了部分改动 (例如 1-3) 后, 后续 (例如 4-5) 的开发不应被 code review block 住. 代码管理系统应当很好的支持这种 无阻塞的开发模式, 才能最大化团队的效率. 然而我们将发现, git + github 的设计并不鼓励这种开发模式.

Code Review 的基本单元: branch vs. commit¶

传统的基于 github/gitlab 的 workflow 具有这样的特性:

Code review 的基本单元是 PR (在 gitlab 上叫 MR). 这是由 github 平台的 UI 设定的: 它的设计极大程度上鼓励用户对 PR 整体进行 review, 而不是去 review PR 里独立的 commit.(PR 里的每个 commit 虽然可以分开看, 但是体验上难以分开 review. 这里有关于这一点的讨论.)
Code review 的基本单元也是 branch, 因为 PR 与 branch 一一对应. 换句话说, 每个未被合并的 PR 必须存在于一个 branch 上, 且每个这样的 branch 上只能有一个未被合并的 PR.

而 stacked diffs workflow 的最重要的特性是: code review 的基本单元对应仓库里的 commits. 在 phabricator (Meta 使用的 code review 系统, 也被用于 llvm 等开源项目) 中这个单元叫做 "diff", 概念上对应 "PR".

Code review 的基本单元是 branch 还是 commit, 究竟有什么区别?

读者可能会认为, 在 github 上 review PR 的时候, 也是在 review commit. 但是, PR 里提供给 reviewer 的内容, 其实是通过 branch 的状态计算得来的: 每个 PR 有一个 target branch (例如下图中的pytorch:main) 和一个 featurebranch (例如ppwwyyxx:logging). Code review 的内容是它们之间的差别. 也就是说, 如果 feature branch 的内容发生变化 (例如有了新的 commit), PR 就会发生变化.

而在基于 commit 的 workflow 中, 甚至根本不需要 branch 这个概念: 我所有的工作都会是本地的一系列 commits, 它们被同步到 code review 系统里成为 diffs. 由于 diff 与单个 commit 对应, 添加新的 commit 并不会影响 code review 系统里现存的内容, 而会创建新的 diff. 如果要修改某个 diff 的内容, 我们可以把修改 amend 进这个 diff 对应的 commit.

说了这些基础概念, 接下来我们解释为什么 git + github PR 的 workflow 并不好用. 整个逻辑总结下来是这样的:

git+github 的 workflow 里, "code review 单元 == PR == branch" (这一节)
复杂的开发应尽可能拆成多个独立的 code review 单元 (也即 PR) (下一节)
Code review 不应阻塞开发: 我们需要能够在较早的 PR 还没完成 review (甚至没完成开发) 时, 就能够在其基础上开发后续的一系列 PR (上一节). 这些 PR 间自然就有了依赖关系
当较早的 PR 发生改动时, 我们需要对依赖它的后续 PR 做操作, 以维护 PR 间的依赖关系. 而 git + github 难以维护 branch 或 PR 间的依赖关系. (下下节)

Code Review 单元应尽可能拆小 ¶

这一节介绍一个普遍的工程实践: code review 单元应尽可能的小. 一个复杂的开发任务 应尽可能拆成不同的 code review 单元 而不是合在一起 review. 这是因为:

Code review 所需要的精力与 PR 长度并不成比例: 大的 PR 要比等量的小 PR 更难 review.
难以 review 带来的结果是, 大的 PR review 时间更长, 收到的 review 质量更低. 以上两点都是有很多研究佐证的, 例如 "Modern Code Review: A Case Study at Google" 这篇 paper.
Review 是有延迟的. 分开独立的 code review 使得 较早的改动在被 accept 后可以尽早合并, 这样能 (i) 减少冲突; (ii) 尽早被别人使用, 触发可能的问题.
举例来说, 假如一个工作的第一部分可以很快被 accept, 而其他部分还需要讨论 & review 一周. 如果我们等整体被 accept 了再全部一起合并, 则第一部分可能会与这一周里其他改动产生合并冲突, 而这种冲突本可以完全避免.
尽早将已完成的小部分工作拿出来 review, 这样被 reviewer 发现问题可以及时调整后续路线. 否则的话, 如果憋了个大招一起 review, 再发现问题调整起来就会额外花很多工作量.
不同模块的改动可能需要由不同的人来 review. 合在一起 review 会给每个 reviewer 增加额外的心智负担: review 时要找 "哪些是我该看的?". 不断收到 code review 平台发送的新的通知要想 "跟我有没有关系?"
合并进仓库的 commit 历史应该与 code review 单元一一对应 (而不应多对一). 当回看 commit 历史时, 小的, 一次只解决一个独立问题的 commit 会看起来更清晰, 找问题也会更容易.

因为这些原因, 好的工程实践是 鼓励将大的单个改动拆成多个小的部分, 分开进行 review 和提交. 每个部分各自需要逻辑上是一个完整的, 正确的小单元. 一个改动通常不超过 100 行, 原则上不超过 300 行.

Google 的 "Modern Code Review" 论文中也说:

Developers are strongly encouraged to make small, incremental changes.

Software Engineering at Google 这本书中有一节叫做 "Write Small Changes". 好处上面已经分析过了就不再重复, 这里摘录一些其中的数据:

“Small” changes should generally be limited to about 200 lines of code. ... Most changes at Google are expected to be reviewed within about a day.
... About 35% of the changes at Google are to a single file.

Software Enginering at Google 一书写成时, Google 内部的 stacked diff 工具还不成熟, 因此实操的便捷性被书中列为一个缺点. 本篇文章正要介绍如何用更好的工具解决这个问题. 除此之外, 有时后, 拆分会导致各部分的总和略大于单个改动; 有时, 为了将一个大规模改动 (例如重构) 变得 "可以拆分", 甚至需要额外做一些工作 (例如增加兼容层). 但 "small incremental change" 的优点值得这些额外的付出.

如何管理 code review 间的依赖关系 ¶

当有多个互相依赖的小的 code review 后, 需要工具来自动化的管理它们的依赖关系. 笔者在 Meta 和 Google 都使用本地的 mercurial 仓库配合 Meta/Google 内部的 code review 工具. 这套 workflow 可以非常方便的管理 code review 间的依赖.

下面以几个例子说明 Meta 的基于 mercurial 仓库 + Phabricator Diff 的 workflow 为什么优于 git 仓库 + github PR 的 workflow. 在每个例子中, 用😞来表示体验糟糕的部分.

Example 1: 我们以这样两个改动开始:

改动 A: 为模块 A 添加一个 API 及测试
改动 X: 基于 A 来实现 feature X 及添加测试

它们有依赖关系 A <- X . 在 Meta, 我会这么做:

两个改动就是本地仓库 main branch 的两个 commit. 不管我当前的 checkout 是哪个 commit, 从hg log 里都能看到全部两个.
通过一个命令可以将两个 commits 一起发送到 phabricator 上成为两个 diff.
它们可以被独立 review, UI 会显示它们的依赖关系.

如果使用 github, 我将不得不这么做:

😞取两个新 branch 的名字 (这里暂且叫 branchA 与 branchX).
在 branchA 实现改动 A 后, 在另一个 branchX 实现改动 X. 😞只有 branchX 的git log 能看到两个改动, branchA 只能看到改动 A.
😞用 至少两个命令 将两个 branch 分别 push. 一个命令做不到, 因为 git 仓库并不知道 branch 之间有依赖关系.
从这两个 branch 创建两个 PR. 注意这里 PR A 的 merge target 是项目的 main branch, 而 PR X 的 merge target 需要是 branchA (否则这个 PR 里就会有两个改动, 没法独立 review 了). 😞merge target 需要一些手工操作来设置, 因为 github 并不知道两个 branch 的依赖关系.
😞PR X 的 UI 上能看到 merge target 是 branchA, 但是看不到 PR A 的链接. 要方便的链接到 PR A 还得用其它工具 (或手动描述). 这主要是个 UI 问题.

Example 2: 继续上一个 example, 在经过一些 review 后, 我们需要对改动 A 的内容进行修改. 在 Meta 我会这么做:

本地 amend commitA. 默认自动触发 rebase commitX. 这样本地仍然是两个 commits. 通过一个命令将 phabricator 上的两个 diff 都同步到最新.
phabricator 会保存旧版本的 commitA, 如果要看旧版本也有办法.
reviewer 可以选择 review 完整的 commitA, 也可以选择 review commitA 的增量.

如果使用 github, 我需要:

在 branchA 上加入额外的 commitA-fix 再 push. 或者在 branchA 上 amend commitA 再 force push. 😞它们各有各的问题:
- amend 没法很好的保留旧的 commitA (这单纯是个 github UI 的问题, 而不是 git 的问题).
- amend 会导致 rebase branchX on branchA 的时候产生 conflict. (下一节解释)
- 如果在 branchA 里加入一个不干净的 commitA-fix (即并不需要最终保留在 main branch 里的 commit), 在 rebase branchA on main 时会带来很多不必要的痛苦. 例如: 假设 commitA 和 commitA-fix 都对函数 func 进行了修改. 而与此同时 main branch 的 func 也有了变化. 这时候 rebase branchA on main 时需要 resolve conflict 两次.
😞无论怎样, 都需要再手动切换到 branchX, rebase on branchA, 再 force push branchX, 才能维护好它们的依赖关系.

Example 3: 接着上一个 example, 在经过一些 review 后, 我们发现需要额外对函数 S 进行修改才能更好的实现 feature A. 也即依赖关系为S <- A <- X. 在 Meta 我会这么做:

直接在 base 的基础上添加 commitS.
一个命令将 commitA 和 commitX rebase 到 commitS 上. 如有需要, amend commitA 和 commitX 的内容
一个命令将 phabricator 的状态与本地同步, 即: 创建新的 diffS, 更新已有的 diffA, diffX, 并更新三个 diff 间的依赖关系.

如果使用 github, 我需要:

从 main branch 开辟新 branchS, 添加 commitS
rebase branchA on branchS, 如有需要, 修改 branchA 的内容.
😞再次 rebase branchX on branchA.
😞类似 Example1 中的缺点, 我们这里需要分别 push branchS, branchA 和 branchX, 并且手动将 PR A 的 merge target 改为 branchS.
😞UI 问题: 在 PR X 里能看到它对 branchA 的依赖, 但看不到它对 branchS / PR S 的依赖.

Example 4: 接着上一个 example, 假如我们有S <- A <- X <- Y 的依赖链, 此时 S 和 A 都被 accept, 我们想要尽快将其合并, 并在合并后的最新主线上继续开发 X 和 Y. 在 Meta 我会:

在 phabricator 上用按钮将其合并.
在本地仓库里 rebase on main. rebase 完成后本地只剩下 commitX 和 commitY. 一个命令同步这两个 diff 的状态.

而使用 github 时, 我需要:

在 github 上操作将 S 合并
😞在 github 上再次操作, 将 PR A 的 merge target 设为 main branch, 再将 A 合并
本地将 branchX rebase on main. 😞将 branchX push 到 PR X 并手动把 merge target 设为 main branch.
😞再将 branchY rebase on branchX. push branchY.

从这几个例子可以看出, github workflow 的本质缺点在于: 无论是 git 还是 github 都 没有充分的关于 branch 之间的依赖关系 (也即 PR 之间依赖关系) 的信息. 这带来的主要问题是:

当多个 PR 的依赖链条较长时, 每次改变中间 PR 的内容, 或合并 / 删除了某个中间 PR 后, 都需要一个个手动 rebase 所有依赖它的后续 branch, 并手动 push github. 有时候还需要手动改 github merge target.

而当 commit 作为工作单元时, 以上这些工作都可以自动完成: 当中间 commit 被改动时, 所有需要被 rebase/push 的 commit 都可以通过依赖关系自动找到.

Commit Identifier¶

除了依赖关系的缺失之外, 另一个 git/github 的缺点是, branch 之间 rebase 有更大的概率产生 conflict. 这是由于缺少一种 commit identifier 机制.

什么是 commit identifier? 在基于 branch 的 workflow 里, 本地 branch 与远端 PR 通过 "branch 的名字" 这个 identifier 来匹配. 在基于 commit 的 workflow 里, commit 与远端 diff 也需要一种匹配机制, 工具才知道每个 commit 应该更新哪个 diff. 它的实现方式一般通过本地工具 (如 hg) 在 commit metadata 里添加一个随机 unique identifier 来实现. 同时, 本地工具需要维护这个 identifier, 确保一个 commit 在经历了 rebase, reorder, amend 等操作后 identifier 不变, 且在 squash 操作时询问用户保留哪个 identifier. 这个 commit identifier 替代了 "branch 名字" 的功能.

不仅如此, commit identifier 能使得 rebase 的体验更加的丝滑. 例如, 在上一节的 Example 2 中, 我们要将 branchX rebase 到修改后的 branchA 上:

图中的 rebase 并没有想象中那么简单: 由于 git 并不知道 commitA 与 new commitA 之间有任何关系, git 会尝试将 commitA, commitX 分别应用到 new commitA 上. 而将 commitA 应用到 new commitA 上几乎一定会产生 conflict. 然而, 当有了 commit identifier 后, rebase 工具通过 identifier 和 commit 时间知道 "new commitA" 是最新版的 "commitA", 就可以直接避免这个 conflict.

另外, 一个常见的小问题是 github PR 的 inline comment 经常会在 force-push 之后丢失, 这同样是因为 github 不知道新的 commit 与旧的的对应关系.

From Stack to DAG¶

不难想到, 不同的 PR/diff 之间的依赖关系未必是一条单链表, 而可以是一个有向无环图 (DAG). 这种依赖关系就更难在 git 中处理了, 这是 git 的另一个小缺点.

相较于 git branch 内的所有 commits 必须是一条直线, 一个 mercurial 仓库的本地 workspace 可以包含分支, 例如, 我可以在本地创建这样 5 个 WIP 的 commits, 它们可以有 DAG 的依赖关系:

$ hg log --graph
@  changeset:   30:abcdef123456
|  user:        John Doe
|  summary:     Fix a small bug.
|
| o  changeset:   29:789012345678
| |  user:        John Doe
| |  summary:     Implemented feature X
| |
| | o  changeset:   27:abcdef012345
| |/   user:        John Doe
| |    summary:     Updated documentation for module B
| |
| o  changeset:   26:456789abcdef
| |  user:        John Doe
| |  summary:     Fix module B
| |
| o  changeset:   25:abcdef012345
|/   user:        John Doe
|    summary:     Change module A
|

由于 code review 与 commits 对应, 这 5 个 commits 将成为 5 个 "diff" 以供 review. Phabricator 的 UI 上也可以显示 diff 间的 DAG 关系, 例如:

让 git + github 更好 ¶

在 Meta/Google 工作时, 我的 mercurial workspace 里通常有数十个开发中的 commits, 对应着 code review 平台上的 diffs (在 Google 又叫 CL). 它们可能有复杂的 DAG 依赖, 也可能是独立的. 它们有的是严肃的开发, 有的是 prototype, 有的只用来临时 debug, 但是没关系, 因为我可以选择哪些 commits 要给人 review, 不会受到新增 commits 的影响. 我也可以方便的通过 amend/rebase 修改 commits 或它们的依赖关系, 并且所有修改都可以一键与 code review 平台同步. 在 git 上如何复刻这种体验, 仍然是个难题.

如果要在不改变 git / github 的情况下, 实现接近 stacked diff 的 workflow, 就需要实现一个新的 git 仓库管理工具, 负责:

为每个 commit 创建 branch, 并记录他们的依赖关系
通过依赖关系, 进行自动的 rebase 等维护工作
为每个 commit 创建一个 identifier, 据此维护 commit 与 PR 的对应关系
在本地显示 commits 之间 DAG 形态的依赖, 用于替代git log.
将必要的依赖信息通过 github API 更新到 PR 上

有一些工具已经部分实现了这些功能, 例如:

ghstack: PyTorch 团队开发的的 stacked PR 工具. 除了 readme 提到的一些小问题外体验还不错.
git-spr: 试过一次, 当时体验并不好.
git-branchless: 有 DAG 及自动 rebase 等的功能. 本地体验还不错, 但并没有和 github 整合.
jj: 有 DAG 及自动 rebase 等功能, 但并没有和 github 整合. 作者是 google 内部 stacked diff 工具 (fig) 团队成员. 试了一下, 操作和传统 git 的差距有点大, 不太习惯.
git-patchstack: 没用过.
graphite: 一个专注做 stacked PR 工具的 startup. 没用过但是看上去做的很认真.
aviator: 也是一个专注做 stacked PR 工具的 startup.

最后, 关于 stacked diffs 的话题, 这里提供一些其他参考:

上面两篇文章写的最详细, 本文也参考了其中的一些观点. 除此之外, 还有:

Stacked diffs and ghstack: ghstack 作者的 podcast.
git-patchstack 开发者的文章: How we should be using Git, 和它们的文档
graphite 开发者的文章: Stacking
aviator 开发者的文章: Rethinking code reviews with stacked PRs
In-Praise-of-Stacked-PRs
Taichi 开发团队也使用 ghstack, 在 B 站有个中文教程

注: 本文大部分写于离开 Meta 的 Stacked Diff 后, 在 Cruise 工作期间. 当时苦于 Cruise 用 github 没有 Stacked Diff. 然而文章还没写完我就去了 Google, 又有了 Stacked Diff. 如今再次回到 git 的世界, 所以又开始研究这个问题.

Registration Does Not Scale Well

2023-04-23T07:00:00.000Z

People have many different opinions about config systems. Having worked withvarious styles of configs, I also want to write about what a greatconfig subsystem in a large-scale (in terms of system complexity, number ofusers, etc.) system should look like.

The design space is complex, so in this article I'll start with a smaller topic:registration in config systems. I'll show why this common pattern, though works fine forsmall-scale projects, does not scale well in the long term. I'll also discussan alternative.

Registration for configs¶

Configs are often constrained to include only primitive types (numbers andstrings), and there are a lot of good reasons to keep this property.

A global registry in a config system is typically a Mapping[str, Any].Its purpose is to allow users to refer to complex objects through simple strings in a config,overcoming the constraint of primitive types.

Objects can be added to a registry like this:

# Maps string to a class/type.
@ModelRegistry.register
class MyModel():
  ...

# Maps string to an object.
DatasetRegistry.register("my_dataset", MyDataset(...))

This allows users to select which model / data to use by settingcfg.model.name = "MyModel" or cfg.dataset.train = "my_dataset".

Registration has a clear benefit, but at a larger scale, some of its downside could become serious.

Pay for what you use¶

Users should only pay (compute cost and mental cost) for what they use is a general design philosophy I found pretty important in almost all aspects of software design.

The registration pattern breaks this philosophy by running unnecessaryregistration code: users will only provide one (or very few) string in theirconfig, but they have to pay the overhead of registering many candidate stringsthat users might need.

To make matter worse, the overhead has to happen very early in a program, typically at import time. Import speed is crucial for developer ergonomics: unlike other code that may run async with development, import often blocks developers.

The registration overhead includes:

Cost to import extra Python modules that contain registration code, and all their dependencies.
For registries that map strings to non-trivial objects (not just types/functions), the cost to create these objects.
- A better practice is to avoid such registries: don't store objectsin the registry, but store functions that create these objects if possible. Howeverthis does not always solve the problem: the function may have to be aclosure that close on non-trivial objects, in which case the objects stillhave to be created at registration time.

These costs are negligible for small-scale projects, but they can become quitebad when there are hundreds or more objects to register. Bad patterns are guaranteed to appear at larger scale:There will be some users doingnon-trivial registration (e.g. register objects in a for loop) that's slow oreven has unintended side effects. I had to work with many projects that take >10s to import and the most common reason of slow import is registration.

The import overhead is also greatly magnified by Python's multiprocessing module: all subprocesses will have to spend the time and RAM to rerun the imports.

Global states¶

Registries are typically defined as a global dictionary, so they share manyinherent problems of using global states.

Name conflicts¶

It's not uncommon that different users register different objects under thesame name -- at a large scale that's guaranteed to happen.

Such conflicts can live in two users' own code for a long time, unnoticed,until one day someone needs to depend on both.The only viable solution is usually to rename one, hence break all its users.

Overwrites¶

To complicate the issue even more, people sometimes decide to resolve nameconflicts by overwriting what's already registered. For example, an "overwrite"option is provided inthe registry of iopath,mobile_cv,and paxml.Using this option may introduce hard-to-debug problems, because now aninnocent import statement may silently change the behavior of user code.

Despite of this, note that overwriting is actually necessary when working in notebooks, where it's common to "reload" code (therefore re-register objects) on the fly. Here is some code I use to always enable overwrite during reload.

Pickle & multiprocessing¶

When running a function using a multiprocessing.Process created with a safe start_method like "spawn", the child process receives a pickled closure from its parent, so it knows what to run. However, this pickle does not include any global states. This implies that if a function access global states, it may behave differently depending on if it runs in the subprocess or the parent process. Python's documentation has a clear warning about this:

if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called.

The ray framework can run a pickledPython function remotely, and therefore it has similar (and even more counter-intuitive) issues.

Obscure Provenance¶

Since the registration is globally accessible, it's not easy to find where inthe code an object is registered (or modified, if overwrite is allowed) just byreading the code. When a user sees cfg.dataset.name = 'dataset_X' and iscurious what is "dataset_X", a global string search is almost the only way tofind it out without running the code. And the search does not always work: ifthe name is programmatically generated, then the string cannot be founddirectly in source code, e.g.:

for dataset in ["ds1", "ds2"]:
  for r in [0.1, 0.5]:
    DatasetRegistry.register(f'my_{dataset}_ratio{r}', MyDataset(dataset, r))

In this case, users will have to be more creative about what strings to search.

In C++, registries cause more trouble becauseconstruction and destruction of global objects are verytricky. In Safe Static Initialization, No DestructionI talked about a few PyTorch's C++ bugs related to this. Luckily,in Python, there are better alternatives.

Alternative: module name + variable name¶

If the only goal of registration is to provide a name → object mapping, then asimple alternative in Python is to use obj.__module__ + '.' + {variable name} as the name,which may look like some_library.some_module.MyClass."Variable name" can be obj.__qualname__ for classes & functions.

Given this string, one can then call a simple function such as the builtinpydoc.locate to obtain the object it refers to: modules will be importedon-demand by importlib.

Use registration: Use full qualname:

# my_lib/my_module.py:
@ModelRegistry.register()
class MyModel(...):
  ...

# main.py --name=MyModel:
from my_lib import ModelRegistry
import my_lib.my_module  # import to register
model = ModelRegistry.get(args.name)

# my_lib/my_module.py:
class MyModel(...):
  ...

# main.py --name=my_lib.my_module.MyModel
model = pydoc.locate(args.name)

This pattern has some obvious advantages over registration:

No need to import any unused modules. Modules are imported on-demand.
No global states.

There are some common concerns of this pattern, but they are not hard to address.

It's slightly harder to dynamically create candidates: there is no "registry" to add objects to,and the only equivalence is to edit the globals() dictionary directly.
for dataset in ["ds1", "ds2"]: for r in [0.1, 0.5]: globals()[f"my_{dataset}_{r}"] = create_dataset(dataset, r)
I don't consider this a big issue because it's actually discouraging bad practice:in a proper config system (e.g. one that's based on recursive instantiation)there should be no need to dynamically generate candidates like above.I hope to get to this in a future article.
The names in config have to match the names of classes/functions in code.
This has the benefit of clarity on one hand. But on the other hand, code owners have more responsibilityto maintain backward compatibility, especially after renaming their classes and files.The standard good practice suffices to address this: distinguish private vs. public symbols;keep an alias from the deprecated name to the new name; etc.
The names are too long.
This is a real problem. Here are some possible ways to address it:
- A $PATH-like mechanism can be used to specify which modules to search for names.The search path can include common prefixes like "my_lib.my_module" so that users only have to provide "MyModel".
- There can be a registry-like Mapping[str, str] that maps from "MyModel" to"my_lib.my_module.MyModel" so that users don't have to write long strings.This mapping doesn't have to be global and doesn't introduce import overhead.This can help with problem (2) as well.
- This is just a UI-level issue.Having a better config frontend, e.g. using Python code as the config language,can make this issue disappear! Let me save this for a future article.

Safe Static Initialization, No Destruction

2023-04-06T07:00:00.000Z

Since I joined Google Brain, I brought PyTorch to Google's internal infra andowned its maintenance. Being a "tech island", it's well known that almosteverything in Google works differently from the outside world, and thatcreates many challenges when building a massive library like PyTorch.

Among those challenges, there are a few tricky bugs related to staticinitialization order fiasco(SIOF) and their destructions. This time I was forced to learn a lot more detailsthan I'd like to know about these topics, so it's good to write them down before I forget.

Terminology¶

"Static initialization" is an ambiguous term because "static" is very overloaded in C++.In our context, it is supposed to mean "initialization of objects that have static storage duration",i.e. objects that live through the lifetime of a program.The word "static" actually talks about the object lifetime, not about initialization.

Meanwhile, initialization of such objects can have two steps:

Zero/Constant initialization (it's also confusinglycalled "static initialization" 😱):allocate memory and fill inzero or constantbytes. For objects with static storage duration, this step is done during compile time.
Dynamic initialization: call constructor (if non-trivial, and if not constexpr) of the object.

Objects with static storage duration can be categorized into following two types, based on when their "dynamic initialization" happen:

Global: these objects are dynamic-initialized at program launch (before main()).
Object a1; static Object a2; class A { static Object a3; };
Function-local static:
void func() { static Object a4; }
They are dynamic-initialized the first time it's reached. Since C++11, they are guaranteed to initialize once and only once even with multiple threads.

Static Initialization Order Fiasco¶

SIOF typically refers to the problem that the dynamic initialization order of objects from different translation units is undefined, e.g.:

// a.cpp:
Object a;
// b.cpp:
AnotherObject b;

If a and b have non-trivial constructors, and the constructor of b somehow needs to access a, the program may crash or behave unexpectedly because a may be initialized after b.

PyTorch heavily uses registrations, which all have static storage duration. A few SIOF bugs were found when Itried to build PyTorch in Google. As an example, when an ATen operator has many overloads, initialization order affects which overload is called, because an overload that's initialized earlier will be preferred over those initialized later.

Standard ways to avoid SIOF problems are:

Avoid dynamic initialization: change object type to something that can be zero/const-initialized. totw/140 shows a few examples on how to replace std::string with non-dynamic counterparts.
Use well-defined initialization order: there is a guarantee that objects within the same translation unit are dynamically initialized according to the well-defined program order. So we can sometimes just move code into the same translation unit. In another PyTorch bug where one global depends on another,I simply merged two files so that their constructors are properly sequenced.
Construct on first use: it's often not practical to merge files. A better solution is the "construct on first use" idiom:
❌ Don't use globals ✅ Use function-local static:
Object a;
Object& get_a() { static Object* a = new Object(); return *a; } // or: Object& get_a() { static Object a; return a; }
By doing this, anyone that needs to access a will have to call get_a(). Because function-local static is guaranteed to initialize on first use, we can rest assured that a will not be used before initialization.
The "construct on first use" idiom may look differently, because sometimes we don't need to use a directly but do need to observe the side effects of its constructor. In such cases we just manually call get_a to make sure a is constructed. I used this to fix another PyTorch bug .

Safe Destructions¶

There are more ways things can go wrong in the destruction of objects with static storage duration.

In general, we have to carefully avoid use-after-free, i.e. access a global/function-local variable after it's destructed. This is typically protected by this rule:

Non-local objects with static storage duration are destroyed in the reverse order of the completion of their constructor.

Given this rule, we can deduce that:The above result sounds nice and is often enough protection, but people tend to overlook a few ways things can still go wrong:

The program can, in theory, pass around the address of b. This should be discouraged, but it means that technically ANY object could access b in their destructor. If any of these objects are destructed after b, we're doomed.
When one thread crashes and destructs objects, there is a possibility that at this point another thread is still running and have access to the object. We're doomed again.

The above "reverse-ordering" rule, though written in cppreference.com, is not always true under certain build options!

// bar.cpp:
#include 
#include 
extern void register_B();
struct A {
  A() { register_B(); puts("Finishing A()"); }
  ~A() { puts("~A()"); }
};
A a;
// main.cpp:
#include 
#include 
struct B {
  B() { puts("Finishing B()"); }
  ~B() { puts("~B()"); }
};
void register_B() { static B b; }
int main(void) { puts("main");}

$ g++ -fPIC -shared bar.cpp -o bar.so
$ g++ -pie -Wl,--no-as-needed main.cpp ./bar.so -o main-pie && ./main-pie
Finishing B()
Finishing A()
main
~B()
~A()
$ g++ -no-pie -Wl,--no-as-needed main.cpp ./bar.so -o main-no-pie && ./main-no-pie
Finishing B()
Finishing A()
main
~A()
~B()

I discovered this the hard way when debugging another PyTorch issue:Still trying to understand if this is considered a compiler bug.

Given the above issues, the Google C++ style guide bluntly forbids such destructions:

Objects with static storage duration are forbidden unless they are trivially destructible.

This "no destruction" rule implies that the following code is illegal

Object& get_a() { static Object a; return a;}

if Object is not trivially destructible. C++ FAQ advises the same.

Writing static Object* a = new Object; return *a; is safe as long as we never call delete, but this introduces a heap-allocation overhead.The last trick is to use a NoDestructorwrapper class to bypass RAII(the trick is placement new operator):

Safe, but has heap allocation overhead Safe and low overhead

Object& get_a() {
    static Object* a = new Object();
    return *a;
}

Object& get_a() {
    static NoDestructor a;
    return *a;
}

Finally, as an alternative to "no destruction",another way to safely run destructors is toref-counting all such objects,but it's perhaps not worth the complexity. "No destruction" is usually a good enough solution.

Summary¶

In conclusion, to safely construct and destruct objects with static storage duration + dynamic initialization, follow these rules of thumb:

Safe Initialization: use "construct on first use" idiom
No Destruction: don't run any non-trivial destructors

Some Useful Terminal Escape Sequences

2023-01-15T08:00:00.000Z

最近学习到了一些 Terminal Escape Sequences, 其中尤其对 OSC52 相见恨晚. 这里稍微记录一下各种 Sequences.

Terminal Escape Sequences 是终端应用向 stdout 打出的一些具有特殊含义的字符串. 终端看到这些串之后不会显示它们, 而是执行这些串所对应的终端高级功能.

Color & Rendering¶

最常见的 escape sequence 就是改变字的颜色. 例如, 这个命令会打印出红色的 " Hello World ":

printf "\e[31mHello World\e[0m\n"

终端颜色最初只有 8 种, 而如今多数终端已经支持种的 truecolor 了. 这个命令会打印出 RGB 为 (255, 100, 0) 的 " Hello World ":

printf "\e[38;2;255;100;0mHello World\e[0m\n"

不过大多数应用还是只使用 8 种颜色. 丰富的颜色主要在代码高亮里比较有用: vim 中使用set termguicolors 来打开 true color 支持, 之后便可以用 truecolor 来配置各个 highlight group 的 guifg 和 guibg.

这个有用的脚本可以打出终端支持的各种颜色, 以及其他渲染特性, 可惜大多数都没有什么应用在用. Kitty 终端下的输出是这样的:

Clipboard (OSC52)¶

在支持的终端里执行以下命令会将 "Hello World" 复制到剪贴板.

printf "\e]52;c;$(echo "Hello World" | base64)\a"

这个 escape sequence 一般称作 "OSC52", 其中 OSC 是 "operating system command" 的意思.OSC52 科学的解决了一个困扰我十多年的问题: 怎么在 ssh + vim/tmux 的时候复制终端上的文字到本地剪贴板?

终端自带的选中 + 复制的功能并不能很好的与 vim/tmux 这类有 "窗口" 的终端应用一起工作, 因为:

无法选中超过一屏的文字. 因为终端的选中功能无法对 vim/tmux 里的窗口进行翻页.
经常会被迫选中终端应用的 UI 字符, 尤其是当应用有多个窗口的时候. 例如当我想要选中右边两行时:

如果 vim/tmux 运行在本地, 这些问题都很容易解决: vim/tmux 各自提供了自己的选中 + 复制功能, 并且都可以读写本地的系统剪贴板. 但是当它们跑在 ssh 里的时候, 我就只能依赖 hack:

如果要选中超过一屏的文字, 就把字体缩小试试能不能一屏装下..
让 vim/tmux 各自把 UI 尽量关掉 (例如把窗口独立出来, 让 vim 不显示行号等等).
以上都不 work 的时候就没办法了. 曾经尝试过 piknik, 但是使用太复杂了.

有了 OSC52 之后再也不会有这个问题了: ssh 里的应用只要输出了 OSC52 的控制字符, 被本地的终端看到了, 就可以写入本地的剪贴板. 具体方案可以这样:

vim 里使用 vim-oscyank 插件在复制时输出 OSC52 字符.
tmux 里使用set-clipboard on 选项. 这个选项同时做了两件事 (感觉官方 wiki 解释的并不清楚):
- tmux 会将 copy-mode 里选中并复制的内容经由 OSC52 输出.
- tmux 内运行的应用输出的 OSC52 控制字符会被 tmux 正确的转发到外面. 这样如果 tmux 里运行了 vim, 也能正确工作.
另外搞了个简单的yank 脚本用于命令行:$ run_some_command | yank.

Hyperlink (OSC8)¶

在支持的终端里, 这个命令会输出 "This is a link", 鼠标点击输出的文字会打开 "example.com":

printf '\e]8;;http://example.com\e\\This is a link\e]8;;\e\\\n'

我与其他人共同维护了一个文档, 记录了支持 OSC8 hyperlink 的终端和会使用 OSC8 hyperlink 的应用.

由于大部分终端已经有了基于 regex 匹配文字中的 URL 的功能, 因此 hyperlink 的功能并不是十分刚需, 可能还需要应用开发者发挥更多想象力. 目前我用到的场景仅有:

ls --hyperlink=auto. 使用了这个 alias 之后可以在终端里点击文件名打开文件.
在 source control 工具里, 点击 commit 打开对应的网页 (例如 github, bitbucket).
- 类似的, 希望在 git prompt 里点击也可以打开对应 repo 或 branch 的网页.
- Google 内部的 source control 有这些功能. 希望有人能给git log 做一个类似的.
  - UPDATE: git 可以使用 delta
mdcat, 但是并不常用.

Kitty Graphics Protocol¶

Kitty 终端自己发明了一套 escape sequence 用于在终端中显示图片. 这样显示的图片不是用彩色的 unicode 字符拼出的高糊图, 而是正常的高清图.timg 是一个支持 Kitty protocol 的看图工具. 有了它就可以在 ssh 的时候看远端的图片了.

要注意的是, tmux 并不支持这个非标准的 protocol, 会把相应的 escape sequence 吞掉. 所幸, tmux 提供了 "passthrough" 功能: 在打开了allow-passthrough on 之后, 使用特殊的 passthrough escape sequence 可以让 tmux 把应用打出的 escape sequence 转发到外层. 由于 tmux 不支持, 在 tmux 下看图还会有位置错乱的问题. 我搞了一些 hack 勉强解决了, 就不过多解释了.

Desktop Notification (OSC9 / OSC99)¶

在支持的终端里, 这两个命令会弹出 "Hello World" 的通知:

printf '\e]9;Hello World\e\\'
printf '\e]99;;Hello World\e\\'   # Only in kitty

主要的用途是让 ssh 远端的程序给本地发通知.tmux 同样不支持这个 sequence, 需要配合 passthrough 使用.

OSC9 的出现比较早, 兼容性会更好. OSC99 是 kitty 自己发明的版本, 支持更丰富的通知格式.

Window Title (OSC0)¶

这个命令让上层设置当前的窗口标题. 具体做什么由上层 (tmux 或终端) 实现决定:

printf '\e]0;Hello World\a'

用处不大. 主要是可以让 shell 自动设置标题为 PWD 或当前在执行的命令, 这样当存在多个 tmux tab 或终端 tab 的时候可以方便区分.zsh 可以这样:

function _my_update_title_cmd() { echo -ne "\e]0;${1%% *}\a" }
function _my_update_title_pwd() { echo -ne "\e]0;${(%):-"%3~"}\a" }
autoload -Uz add-zsh-hook
add-zsh-hook preexec _my_update_title_cmd
add-zsh-hook precmd _my_update_title_pwd

最后吐点槽.

有不少有用的终端 feature 仅存在于一两个终端里: 读剪贴板, 传输文件, 看图看视频, 进度条, 鼠标悬浮时显示 tooltip...

终端 feature 一直缺乏标准化: 很多 escape code 并没有详细的 spec -- 一个新的终端开发者基本上主要靠看其他终端的代码来理解它们的行为. 另外, 一些终端自己发明的 escape code 甚至会互相冲突 -- 用了别人已经用过的字符.

每个终端只会实现一部分它认为有价值的终端 feature. 这就使得大部分应用为了兼容性都不会去使用高级的 feature.

即使一个应用想要使用高级的 feature, 它也没有一个好的方法判断终端是否支持一个 feature. 在这里我发现了一个类似于浏览器的 User-Agent 的故事:

最初, xterm 非常流行并且实现了各种高级 feature. 那个时候, 应用判断$TERM 环境变量里有没有 xterm 来决定要不要使用这些 feature, 判断$TERM 里有没有 "256color" 来决定要不要使用 256 色输出.
后来, 更多的终端支持了这些 feature, 但是由于终端名字里没有 xterm, 应用不会使用这些 feature, 所以各个终端都把 "xterm" 加入自己的名字.
直到今天, gnome-terminal, konsole, iTerm, sakura 等大多数终端默认的$TERM 名称还是 "xterm" 或 "xterm-256color". kitty 终端的名字是 "xterm-kitty".

terminfo 允许应用查询终端是否支持特定 feature, 但由于 feature 缺乏标准化, terminfo 也并没有很好的解决这个问题.

在这个混乱的情形下, 终端开发者也难以达成共识. 曾经几个终端开发者组织了个 terminal working group 来讨论各种 feature 的提案, 但最后大家不欢而散. 这个帖子记录了组织者的吐槽.

由于这些原因, 终端 feature 的演化已经基本陷入停滞. 只有少数几个终端在自己发明新 feature: 例如 iTerm2 的自酿 feature 和 kitty 的自酿 feature. 但由于缺乏标准化, 没有对社区产生太多影响.

终端是我工作的主力工具, 希望这些问题能得到解决.

Demystify RAM Usage in Multi-Process Data Loaders

2022-12-24T08:00:00.000Z

A typical PyTorch training program on 8 GPUs with 4 dataloaderworkers per GPU would create at least processes.A naive use of PyTorch dataset and dataloader can easilyreplicate your dataset's RAM usage by 40 times. This issue has probably affected everyone who has done anything nontrivial with PyTorch.In this post, we will explain why it happens, and how to avoid the 40x RAM usage.

All code examples and experiment results are available on github at ppwwyyxx/RAM-multiprocess-dataloader.The content is not specific to PyTorch: it applies to any user of Python's multiprocessing library on Linux.

Motivation for In-RAM Data¶

Datasets for machine learning are usually not stored in RAM. But it's common to store their "metadata" in RAM, and this may still cause nontrivial RAM usage. The metadata could be:

For ImageNet dataset: A million file names and their labels.
For COCO dataset: 100k file names and their bounding boxes, segmentations, etc.

As a concrete case, loading the metadata of COCO training set into Python takes ~2.4G of RAM:

# Download from https://huggingface.co/datasets/merve/coco/resolve/main/annotations/instances_train2017.json
def create_coco() -> list[Any]:
  with open("instances_train2017.json") as f:
    obj = json.load(f)
    return obj["annotations"]

We obviously don't want to replicate this 2.4G of RAM across all processes.

In-RAM metadata is needed for flexibility¶

We acknowledge that there are ways to offload these metadata to disk. For example, people sometimes do:

Store all the metadata together with raw data on disk, so metadata are not stored in RAM.
Read sequentially from a combined single-file dataset, so that file names or indices are not stored in RAM.

By doing these, the RAM usage of a dataset becomes negligible. However, these methods will sacrifice flexibility and capabilities, such as random-access, perfect shuffle, merging datasets arbitrarily, custom subsampling support, etc.Notably, PyTorch's commonly used map-style datasets supportrandom access & sampling.All of these capabilities require certain metadata in RAM.

This article ignores any of these offloading methods. Instead, we'll discuss how to reduce the RAM usage without moving these data out of RAM. The idea is simple: we'll try to let all processes share a single copy of the dataset.

Measure RAM Usage¶

First let's build tools to measure RAM usage - which is not as easy as it sounds.

Common tools like top -p PID or psutil.Process(PID).memory_info() obtains memory statistics from /proc/{PID}/statm or /proc/{PID}/status, but they are insufficient for our analysis. Instead, we'll use the information provided in

/proc/{PID}/smaps: per-memory-mapping RAM usage information, documented inthis man page
/proc/{PID}/smaps_rollup: aggregation of data from smaps

We'll derive the following important measurements from it:

USS (Unique Set Size): RAM that's unique/private to this process, i.e. not shared with any other process. This is obtained by the sum of "private_*" entries in smaps.
Shared: RAM in this process that's also shared with other processes. This is obtained by the sum of "shared_*" entries in smaps.
Shared_File: RAM that's shared with other processes through files. It should be no larger than "Shared".
- This number should be almost the same as the "SHR" column in top/htop.
RSS (Resident Set Size): All memory that this process holds in RAM. RSS = USS + Shared.
PSS (Proportional Set Size): Like RSS, but it avoids overcounting shared memory multiple times across all processes that are sharing it. It's basically "USS + Shared / (number of processes sharing it)". By definition, we should use total PSS to count the total RAM usage of N processes.

To obtain these measurements, we use psutil.Process(PID).memory_maps() which parses smaps under the hood:

def get_mem_info(pid: int) -> dict[str, int]:
  res = defaultdict(int)
  for mmap in psutil.Process(pid).memory_maps():
    res['rss'] += mmap.rss
    res['pss'] += mmap.pss
    res['uss'] += mmap.private_clean + mmap.private_dirty
    res['shared'] += mmap.shared_clean + mmap.shared_dirty
    if mmap.path.startswith('/'):  # looks like a file path
      res['shared_file'] += mmap.shared_clean + mmap.shared_dirty
  return res

Then we create a MemoryMonitor utility to measure and print the results for a list of PIDs.The code is straightforward and can be found here.

Copy-on-read Overhead and "Memory Leak"¶

We start with a naive implementation of a dataset that produces itemsfrom a list:

class NaiveDatasetFromList(torch.utils.data.Dataset):
  def __init__(self, lst):
    self.lst = lst
  def __len__(self):
    return len(self.lst)
  def __getitem__(self, idx: int):
    return self.lst[idx]

Then we launch subprocesses to read from this dataset with the list of COCO data. To make a cleaner demo, we don't use PyTorch's dataloader, but just launch 4 subprocesses by ourselves:

def worker(_, dataset: torch.utils.data.Dataset):
  while True:
    for sample in dataset:
      # read the data, with a fake latency
      time.sleep(0.000001)
      result = pickle.dumps(sample)

if __name__ == "__main__":
  ds = NaiveDatasetFromList(create_coco())
  ctx = torch.multiprocessing.start_processes(
      worker, (ds, ), nprocs=4, join=False, daemon=True, start_method='fork')

We then added our MemoryMonitor to it. The full code and its output logs are available on github. Each segment in the log contains memory measurements for the main process + 4 workers:

$ ./main-naive.py
  time     PID  rss    pss    uss    shared    shared_file
------  ------  -----  -----  -----  --------  -------------
 34724  791339  2.7G   2.0G   1.8G   993.8M    163.5M
 34724  791625  2.6G   1.9G   1.8G   848.6M    16.4M
 34724  791626  2.6G   1.9G   1.8G   848.6M    16.4M
 34724  791627  2.6G   1.9G   1.8G   848.6M    16.4M
 34724  791628  2.6G   1.9G   1.8G   848.8M    16.5M

The code looks completely innocent. However, if we plot the memoryusage of any dataloader worker over time, we seem to find a memory leak!This is the notorious "dataloader leaks memory" issue that is discussed at multiple places, e.g. this PyTorch issue and Edward's podcast.

In fact, the growth of RAM usage does stop in the end, so this issue is not a memory leak. But in reality, users often do not see the end before the system OOMs, and they may wrongly conclude this as a "memory leak".

The root cause of this issue is "copy-on-read" of forked CPython objects.

Copy-on-read of forked CPython objects¶

Linux has a copy-on-write mechanism: when a process forks, the child process will share its entire memory space with the parent, and only copy the relevant pages when necessary, i.e. when the child process needs to write to the page. This mechanism allows read-only pages to be shared to reduce total memory usage.

The copy-on-write behavior can be clearly observed in the above figure:at time=0, the worker has 2.6G of shared RAM, 0 USS, and of PSS because the RAM is shared among 5 processes (4 workers + 1 main).

However, this mechanism did not help us when we read our dataset. The problemis that our dataset is a large nested data structure that contains many smallPython objects. Even though the dataset is "read-only" in theory, accessingany Python object will increment its refcount - causing a lot of memorywrites. With these writes, memory can no longer be shared among parent andchild processes. In other words, objects are not only copy-on-write, but also copy-on-read.Therefore, in the figure we see that the "Shared" RAM decreases and "USS" increases,since many pages are copied from shared memory into each process.

The end game is that each child process has to replicate all the pages that contain object refcounts in the dataset. For a dataset with many objects, this is almost the size of the dataset itself. In the output log, we see that this program uses10G total PSS in the end,where each child process replicates 1.8G of USS.

Serialize to a Numpy Array¶

The copy-on-read issue is due to CPython's reference counting.There are ways to change CPython's behavior, e.g. by gc.freeze, but it has far-reaching consequences and I failed to make it work for the example here. However, there is a simple and transparent way to solve the issue: store the dataset with very few number of Python objects, so there are very few refcounts!Below is a minimal implementation that stores a listusing 2 numpy arrays:

class NumpySerializedList:
  def __init__(self, lst: list[Any]):
    lst = [np.frombuffer(pickle.dumps(x), dtype=np.uint8) for x in lst]
    self._addr = np.cumsum([len(x) for x in lst])
    self._lst = np.concatenate(lst)

  def __len__(self):
    return len(self._addr)

  def __getitem__(self, idx: int):
    start = 0 if idx == 0 else self._addr[idx - 1]
    end = self._addr[idx]
    return pickle.loads(memoryview(self._lst[start:end]))

Detectron2 enables this type of serialization by default (since this commit by Yanghan). To compare different serialization mechanisms,we borrow its code into a serialization util, and use it here:

- ds = NaiveDatasetFromList(create_coco())
+ from serialize import NumpySerializedList
+ ds = NaiveDatasetFromList(NumpySerializedList(create_coco())

Just by this simple one-line change, the RAM usage greatly reduces. The end of the output log file is shown below.

$ ./main-numpyserialize.py
    PID  rss    pss     uss    shared    shared_file
 ------  -----  ------  -----  --------  -------------
 877767  1.6G   396.3M  20.2M  1.6G      184.8M
 877901  1.5G   306.5M  3.8M   1.5G      22.3M
 877902  1.5G   306.5M  3.7M   1.5G      22.3M
 877903  1.5G   306.6M  3.9M   1.5G      22.3M
 877904  1.5G   306.4M  3.6M   1.5G      22.3M

We can see that:

The total PSS usage is only 1.6G -- a 6x reduction.
All processes have almost 0 USS, which means everything is shared! In fact, from the logs we can see that 1.6G is exactly the memory usage of the main process before starting subprocesses. Subprocesses add no extra memory usage.
The reduction factor is better than #processes because pickle.dumps notonly serializes but also compresses the data. We benefit from both sharingand compression by applying this optimization, at the cost of a tinypickle.loads overhead in each access.

More on compression (not important)¶

Actually, after compression, the dataset only takes ~500M (printed at the beginning of log). So a question arises: why does the main process use 1.6G RAM before starting subprocesses?

This is in fact just an artifact of modern memory allocators: it does not always release memory back to the OS. In fact, if we run this simple serialization/compression code:

monitor = MemoryMonitor()
print("Initial", monitor.str())
lst = create_coco()
print("JSON", monitor.str())
lst = NumpySerializedList(lst)
print("Serialized", monitor.str())
del lst; import gc; gc.collect()
print("End", monitor.str())

We see that we seem to "lose" ~700MB of RAM even after we've deleted everything:

Initial PID=1156792, rss=328.7M, pss=238.7M, uss=161.4M, shared=167.3M, shared_file=167.3M
JSON PID=1156792, rss=2.8G, pss=2.7G, uss=2.6G, shared=167.3M, shared_file=167.3M
Serialized PID=1156792, rss=1.6G, pss=1.5G, uss=1.5G, shared=167.3M, shared_file=167.3M
End PID=1156792, rss=1.1G, pss=1.0G, uss=986.2M, shared=167.3M, shared_file=167.3M

Using a better allocator, e.g. by export LD_PRELOAD=libjemalloc.so, can make this issue largely disappear.

This artifact is typically not a big concern, since allocators will find opportunities to reuse these free buffers.(Well, they may be concerning in start_method="fork" because reusing these free buffers may trigger copy-on-write!But I'm not going to talk more about that.)

Pickle Overhead in Spawn / Forkserver¶

In our code above, we launched subprocesses using a start_method="fork" argument."fork, spawn, forkserver" are the 3 "start methods" of Python's multiprocessing library. This article is a good reference that explains their differences.

Since start_method="fork" is unsafe (in practice, it causes various crashes & deadlocks) and might no longer be the default in the future, we want to rerun our code above with start_method="spawn" or "forkserver". Sadly, the serialized array is no longer shared among workers. Each worker has a large USS:

$ ./main-numpyserialize.py spawn
     PID  rss     pss     uss     shared    shared_file
 -------  ------  ------  ------  --------  -------------
 1177291  1.6G    1.5G    1.5G    168.7M    168.7M
 1177405  840.8M  698.3M  672.1M  168.7M    168.7M
 1177419  840.9M  698.3M  672.1M  168.8M    168.8M
 1177443  840.7M  698.3M  672.1M  168.6M    168.6M
 1177456  840.6M  698.5M  672.2M  168.4M    168.4M

The reason why our trick no longer works is that "spawn" and "forkserver" don't benefit from the copy-on-write mechanism. They will start a "fresh" subprocess with fresh memory space, instead of sharing with the parent. Everything the child process needs to access is pickled in the parent process and sent to the child. This ensures safe behavior, but is bad for start-up speed and memory usage.

In our case, the entire dataset will be pickled and sent to child processes. This is why each child process consumes a large USS.

Serialize to a `torch.Tensor`¶

It turns out there is a simple fix to this problem: just store the serialized dataset in a torch.Tensor instead of a numpy array. The reason why it works, is that multiprocessing uses a customizable pickle implementation called ForkingPickler, and PyTorch customizes how torch.Tensor should be pickled by it: the tensor data will not be serialized to bytes. Instead, during pickling the tensor will be moved to shared memory files (typically under /dev/shm) to be accessed by other processes directly.

To test tensor-based serialization, we run ./main-torchserialize.py spawn using the code here, and observes the following memory usage in workers (raw log is here):

"Shared_File" grows because workers will load from the shared torch.Tensor as needed. This is different from start_method="fork" where the entire memory space is shared at the beginning.
"Shared_File" stops growing when the worker has accessed the entire shared tensor.
The size of "Shared_File" in the end is roughly 500M(size of serialized dataset) + 170M, where 170M is thesize of all the binary files that import torch needs to load such as libtorch.so.This can be easily verified by printing the measurements after import torch.

After applying tensor-based serialization,the total PSS usage in the end is 2.2G-- still worse than our earlier number using start_method="fork".Next section will optimize it further.

Per-Process Import Overhead¶

The last culprit in the above experiment is the 160MBper-worker USS in the above figure: this is just the memory footprint of import torch,mainly for PyTorch's global variables, etc. Since every child process launched by "spawn / forkserver" is a "fresh" one, they all need to import torch independently, hence each has 160MB of USS.

Luckily, "forkserver" provides a way to share the import torch RAM usage through copy-on-write. By calling the undocumented Python API multiprocessing.set_forkserver_preload(["torch"]) before launching processes, each child process will be "less fresh": the torch library is preloaded (and shared), and don't need to be imported by each process independently.

Below are the experiment results. Code and full logs are on github:

$ ./main-torchserialize.py forkserver
     PID  rss     pss     uss     shared    shared_file
 -------  ------  ------  ------  --------  -------------
 1204121  1.6G    1.1G    988.6M  681.5M    681.5M
 1204230  707.7M  152.1M  16.9M   690.9M    559.5M
 1204231  707.7M  152.2M  16.9M   690.9M    559.5M
 1204232  707.7M  152.1M  16.8M   690.9M    559.5M
 1204233  707.7M  152.1M  16.8M   691.0M    559.5M

The total PSS is only 1.7G, which is roughly the same as our best number using start_method="fork".
The USS of each worker is negligible, which means we've successfully shared everything. There is no per-worker memory overhead anymore.

(Note that this optimization may be unsafe if import torch creates any threads.My observation is that threads are indeed created due to import numpy inside torch, but they can be disabled with environment variables.)

So far we've only looked at a single dataloader (with 4 workers). In reality, the only scalable way to use PyTorch on multiple GPUs is to use one process per GPU, each will have its own dataloader and dataloader workers. This gives a total of #GPUs x (#DL workers + 1) processes organized like below:

We modified the previous experiment slightly into this code to run on 2 GPUs. The memory usage looks like this:

$ ./main-multigpu-naive.py
    PID  rss     pss     uss      shared    shared_file
-------  ------  ------  -------  --------  -------------
1495766  1.7G    1.1G    1017.0M  694.2M    694.2M        # GPU worker 0
1495938  757.7M  198.5M  67.8M    689.9M    580.7M
1495939  757.6M  198.5M  67.8M    689.8M    580.6M
1495940  757.7M  198.5M  67.8M    689.9M    580.6M
1495941  757.7M  198.5M  67.8M    689.9M    580.7M
1495767  1.7G    1.1G    1015.9M  693.9M    693.9M        # GPU worker 1
1495934  757.9M  198.5M  67.7M    690.1M    580.8M
1495935  757.7M  198.4M  67.7M    690.0M    580.6M
1495936  757.7M  198.4M  67.7M    690.0M    580.6M
1495937  757.9M  198.4M  67.7M    690.2M    580.8M

Our previous optimization on dataloader workers is still effective - dataloader workers have a tiny USS. However, RAM usage is now replicated by #GPUs times because we let each GPU worker read the dataset independently.

An inconvenient solution to this problem is to load and serialize the dataset before launching GPU workers. By doing this, all GPU workers share the dataset just like what dataloader workers do. However, this limits flexibility and often requires significant refactoring, due to reasons such as:

Dataset would have to be made ready much earlier than usual
Per-GPU data loading logic (e.g. sharding) may need to be modified
Most launchers (e.g. torchrun, accelerate) don't support this at all

Another simple solution to this problem is again to use torch.Tensor and ForkingPickler to share the dataset among GPU workers, except that now we need tomanage the sharing explicitly like this:

if comm.get_local_rank() == 0:  # GPU0 reads data and moves it to shared memory.
    # Move data to shared memory, obtain a handle to send to each local worker.
    handles = [None] + [
      bytes(mp.reduction.ForkingPickler.dumps(tensor_dataset))
      for _ in range(comm.get_local_size() - 1)]
else:
    handles = None
# Each GPU receives its handle from GPU0.
handle = local_scatter(handles)

if comm.get_local_rank() > 0:
    # Materialize a tensor from shared memory.
    tensor_dataset = ForkingPickler.loads(handle)

This logic is implemented as another serialization utilhere.When using it as a drop-in replacement (full code here),the dataset is no longer replicated by GPU workers:

$ ./main-multigpu-sharedmem.py
    PID  rss     pss     uss      shared    shared_file
-------  ------  ------  -------  --------  -------------
1533910  1.7G    1.1G    1015.4M  693.4M    693.4M     # GPU worker 0
1534032  757.9M  152.9M  67.9M    690.0M    580.8M
1534033  757.9M  152.9M  67.9M    690.0M    580.8M
1534034  757.9M  152.9M  67.9M    690.0M    580.8M
1534035  757.9M  152.9M  67.9M    690.0M    580.8M
1533911  374.2M  220.0M  192.6M   181.6M    181.6M     # GPU worker 1
1534036  757.8M  152.7M  67.7M    690.1M    580.7M
1534037  757.8M  152.7M  67.6M    690.1M    580.7M
1534038  757.8M  152.7M  67.6M    690.2M    580.7M
1534039  757.8M  152.7M  67.6M    690.2M    580.7M

GPU worker 1 still has a small amount of extra USS, and that's just the footprint of import torch that we saw earlier, and can be avoided using set_forkserver_preload.

Note that the multiprocessing library itself also provides shared memory support.This PR contains an implementation of our serialization util without using PyTorch.

Summary¶

We've successfully reduced the total RAM usage by (approximately) a factor of

The essence of the solution is to let all processes share memory through a single torch.Tensor object, which needs to be moved to Linux shared memory by PyTorch's custom pickling routine. The TLDR on how to achieve sharing is:

Don't let dataloader workers access many Python objects in their parent. Serialize all objects into a single torch.Tensor (but not numpy array) for workers to access.
Don't let all GPU workers load data independently. Load in one GPU worker, and share with others through a torch.Tensor.

For list-like data, all of these can be implemented transparently using the serialization routines developed in this article.

Multi-processing is often the only way to achieve trueparallelism in Python(until PEP703),but it comes with many tricky problems.This article hopefully provides an in-depth view of the problem of RAM usage.

Not Every Model Has a Separate "Loss Function"

2022-10-08T07:00:00.000Z

"Loss function" is one of the most basic concepts today in deep learning.Despite that,it is actually not necessarily a good programming abstraction whendesigning general-purpose systems. A system should not assume thata model always comes together with a "loss function".

"Loss Function": Separation of Logic¶

"Loss function" may mean different things in different systems.The version I'm going to criticize is the most common one that looks like below:

Bad Worse

def trainer_bad(data, model, loss_func):
  while True:
    inputs = next(data)
    predictions = model(inputs)
    loss = loss_func(predictions, inputs)
    # compute gradients and update model

def trainer_worse(data, model, loss_func):
  while True:
    inputs, labels = next(data)
    predictions = model(inputs)
    loss = loss_func(predictions, labels)
    # compute gradients and update model

The key property of the bad "loss function" abstraction is:Users are asked to provide a "loss function" that's executed separately after the "model / forward logic".Such abstraction appears in a few open source systems: Keras model.compile(loss=),fast.ai Learner(loss_func=), Lingvo BaseModel.ComputeLoss.

The main problem is not with the function itself, but that the users' algorithm logic is forced to separate into two parts: model and loss_func.

As an alternative, trainer_good below no longer separates "loss_func" from the model, and has equal functionalitieswith trainer_bad.

def trainer_good(data, model):
  while True:
    inputs: Any = next(data)
    loss: Scalar = model(inputs)  # or losses: dict[str, Scalar]
    # compute gradients and update model

In this article, I want to argue that this is a better design because:

Separating out "loss function" can be troublesome for many reasons, so we should not force it.
Users can still split their model into two parts if they like, but they don't have to.
There is not much value to let the trainer be aware of the separation.

(Apparently, trainer_good == partial(trainer_bad, loss_func=lambda x, y: x).So trainer_bad can still be used - we just set loss_func to a no-op if we don't like it.But trainer_good is cleaner.)

Problems of a Forced Separation¶

It's true that the separation can be useful to certain types of models.But it's not always the case, and enforcing it can be harmful instead.

Duplication between "Model" and "Loss"¶

The separation is not convenient for a model with many optional losses.Take a multi-task model for example:

Separation No Separation

class Module(nn.Module):
  def forward(self, inputs):
    out = {}
    if has_task1:
      out["out1"] = # get outputs for task 1
    if has_task2:
      out["out2"] = # get outputs for task 2
    # ...
    return out

def loss_func(predictions, inputs):
  losses = {}
  if has_task1:
  # Or: if "out1" in predictions:
    losses["loss1"] = # get loss for task 1
  if has_task2:
    losses["loss2"] = # get loss for task 2
  # ...
  return losses

class Module(nn.Module):
  def forward(self, inputs):
    losses = {}
    if has_task1:
      out =             # get outputs for task 1
      losses["task1"] = # get loss for task 1
    if has_task2:
      out =             # get outputs for task 2
      losses["task2"] = # get loss for task 2
    # ...
    return losses

The right one is simpler in that it does not duplicate thebranches that enable different tasks/losses.In reality, these conditions can be more complex than a simple if,and branching is generally less straightforward to maintain.So it's beneficial to not have to repeat the logic.

Note: If you think a wrapper likemulti_loss_func({"task1": loss_func1, "task2": loss_func2})will help (like what Keras supports), it is not going to work wellbecause it doesn't know how to route the inputs/outputs to loss functions.

"Loss" is not Independent of "Model"¶

One may argue that separating "loss" from "model" is nice becausethen we can easily switch different loss functions independent of "model".That is indeed useful in many cases.However, in many algorithms, loss computation is simply not independent ofthe model and should not be switched arbitrarily.This could be due to:

Loss computation depends on internal states computed during model.forward, e.g.:
- Loss needs to know which part of training data is sampled during forward.
- Some predicted auxiliary attributes control whether a sample in a batch should participate in losses.
- Losses such as activation regularization should naturally happen during forward.
In these cases, forcing a separation of "loss" and "model" will require "model" to return its internal states, causing an abstraction leak.
Different loss functions expect different representations of model's predictions. For example, these representations could be:
- Discrete vs. one-hot encoding of class labels.
- Boxes as absolute coordinates, or as reference anchors plus offsets.
- Segmentation masks as polygons, binary bitmasks, or many other formats.
Since conversion between representations may be expensive or lossy, we'd like the model toproduce the exact representation needed by loss computation.Therefore, a separation would not make model independent of losses.On the contrary, it's even worse because loss-relatedlogic will be unnaturally split like this:

Separation No Separation

class Model(nn.Module):
  def __call__(self, inputs):
    hidden_representation = layers(inputs)
    if use_loss1:
      # Return proper representation for loss1
      return predict1(hidden_representation)
    if use_loss2:
      return predict2(hidden_representation)

def loss_func(pred, inputs):
  # pred could have different representations
  # depending on the loss function used
  if use_loss1:
    return loss1(pred, inputs)
  if use_loss2:
    return loss2(pred, inputs)

class Model(nn.Module):
  def __call__(self, inputs):
    hidden_representation = layers(inputs)
    if use_loss1:
      pred_fmt1 = predict1(hidden_representation)
      return loss1(pred_fmt1, inputs)
    if use_loss2:
      pred_fmt2 = predict2(hidden_representation)
      return loss2(pred_fmt2, inputs)

We can see in the above snippet that the model is in factnot independent of losses.It also makes loss_func a bad abstraction because the semanticsof its prediction argument is complex: it should bein different formats depending on which of loss{1,2} is used.In the version with no separation, it's very clearthat the losses are computed using the right representation.

No Clean Separation¶

One may argue that the separation is helpful because it's nice to let the "model"return the same data in training and inference.This makes sense for simple models where training and inference share most of the logic.For example, in a standard classification model shown below,we can let the "model" object return logits, which will be useful in bothtraining and inference.

But many models don't have a clean separation like this.In theory, training and inference only have to share (some) trained weights,but don't necessarily have to share any logic.Many object detection models, for example, do not compute "predictions" in trainingand do not compute losses in inference.A simplified diagram of Region-Proposal Network (RPN)of a two-stage detector looks like this during training:

Any attempt to split a complicated algorithm like this into "model" and "loss function" will:

Force "model" to return its internal algorithmic details that are not useful anywhere,except in a corresponding "loss function". This is an abstraction leak.
Make code harder to read, because closely related logic is separated in an unnatural way.
Lead to higher memory usage (if executing eagerly), because some internal stateshave to be kept alive until "loss function" is called after the "model".

Therefore, it's unrealistic to expect that there is a nice separation, or that "model" can producea consistent format in both training and inference.A better design is to include loss computation in the model's training-mode forward, i.e., let model outputlosses in training, but predictions in inference.

Trainer Does Not Need to Know about the Separation¶

Separation No Separation

def trainer(model, loss_func, ...):
  # model: data -> outputs
  # loss_func: outputs -> losses

def trainer(model, ...):
  # model: data -> losses

In the "no separation" design, users provide a "model" that returns losses.This model internally can still use separation of "loss function" and "forward logic"as long as it makes sense for this model.However, trainer is no longer aware of the separation,and the trainer can no longer obtain the "outputs".

Will this become a limitation of the "no separation" design? What if we'd like to do something with"outputs"? My answer is:

For 99% of the use cases where the "outputs" don't directly affect the training loop structure,trainer doesn't need to know about "outputs".
For some use cases where the trainer does something non-critical (not affecting loop structure) with "outputs",a proper design would move such responsibility away from trainer.
- For example, writing "outputs" to tensorboard shouldn't be a responsibility of trainer. A commonapproach is to use a context-based systemthat allows users to simply call write_summary(outputs) in their model.
For other obscure use cases, they should have custom trainers anyway.

Summary¶

Design is always a trade-off.Adding assumptions to a system might result in some benefits, but at the same time can cause trouble when the assumption isn't true.Finding a balance in between is difficult and often subjective.

The assumption that models have to come together with a separate "loss function", in my opinion, brings more trouble than it's worth.

How to Maintain Clean Core APIs for Research

2022-09-19T07:00:00.000Z

Building a library for research and experiments is quite different from building other types of software.A key challenge is that, in research, abstractions and APIs are rarely set in stone:users may want to propose a slight variant or modification to literally ANYWHERE in the whole program,just because they have a new idea.

In deep learning libraries, these variants can be a different implementation of a layer,a change in optimization algorithm, or a small modification to the training logic, etc.

Designing and maintaining these "research APIs" is difficultthanks to how frequently users want to change their behaviors.Such changes are often implemented by simply adding featuresto the target API they want to modify, e.g. by adding a new flag to the API,or by adding a new abstraction that generalizes the target API towards the users’ use case.

However, when maintaining a generic, core library meant to be adopted by diverse use cases for a long term,the above approach does not scale and poses many problems (discussed morebelow).

This note lists a few principles when working with "research APIs" that should help answer:

How to maintain a clean set of core APIs in research libraries.
How library maintainers & users collaborate to achieveusers’ diverse needs without complicating the core APIs.

Core does not aim to implement all use cases¶

Researchers' job is about doing things in new ways.Hence their needs are so diverse that a core library should not aim to include or implementfeatures for all possible use cases. The library should aim to only include the most popularand standardized features (more on the criteria later).

Core should allow most features to be implemented out-of-core¶

For features not included in the core, ideally there should be a way for users to implementthem out-of-core as extensions, without too much overhead / repetition.

This requires a continuous design evolution to make the core more modular and composable,so that core code can be reused in users’ new implementation.

A good sanity check for library maintainers is to ask the following question:

For any feature currently in the core library, suppose we remove it today, how much effort would it takefor users to reimplement it out-of-core?

A well-designed library should be decoupled such that most of its features are just extensions of itself, and they can beimplemented out-of-core the same way as it is in the core.

Criteria for feature inclusion¶

There are 3 criteria for feature inclusion in core, ordered by their importance.

Popularity: Whether the feature is used by many users
Standardization: Whether the feature’s API is standardized/agreed among its users
Simplicity: Whether the feature is simple to implement

To understand the criteria more, let’s ask: what if the feature is —

Popular but not standardized: sometimes a feature is popular, but its users don’t yet align on the properparameterization, its API, or the subtle implementation details. Including such features is risky, as it may create unclearsemantics orimpede its standardization in the future. It’s still OK to include it if it’s very popular (popularity is the #1 most important criteria),but try to do it in a composable way and with warning signs.

As a negative example, "Transformer" is a popular but not standarized feature.It's included in Pytorch, but received many complaints,and many projects (e.g. fairseq, detr)eventually have to fork and reimplement their own Transformer.

Simple but not popular/standardized: Simplicity alone is not sufficient for inclusion, no matter how simple it is.Because if everyone adds a simple feature they need, together it becomes complex.

Popular, standardized but not simple: Simplicity is the #3 important factors.If something is complex but very popular & standardized (e.g. BatchNorm being a headachefor DL library developers), it should be included. In fact this is where a library couldprovide a lot of value to users.

Concern of New Arguments¶

When a user wants to change the behavior of a "research API" def func() defined in core,adding new arguments is often the quickest way to get things done. But it may introduce a number of maintenance problems.

Simple Flags / Options¶

New flag New argument

def func(flag=False):
   ...
   if flag:
       Variant here
   ...

def func(multiplier_of_x=1.0):
   ...
   x = x * multiplier_of_x
   ..

Adding a simple argument to control the behavior like above is OK,if we think that the new option is very clear and popular.But as a "research API", many users will want to add their own customizations.This could lead to the following problems:

Poor Code health: The library may gradually accumulates too many features that are:
- Hard to read due to branching (as there are too many flags).Ideally, readers should not pay too much extra mental overhead for logic they don’t care about
- Hard to maintain because the contextual knowledge about them is distributed among different developers
Confusing behaviors: More and more features added over time may not interact with each other in a clear way,causing confusing or silent wrong behaviors
- E.g. featureA becomes a no-op when featureB is enabled
- E.g. featureA and featureB are conflicting / overlapping in semantics
- E.g. featureA’s semantics becomes undefined when featureB is enabled
"More general" may mean "less general":A common argument for adding options like this, is thatit doesn't change existing behavior and"makes the function more general".
However, keep in mind that when a function becomes more general in one aspect,it's often less general in other aspects.Generalizing towards one direction may not be a net win, becauseresearch code has too many possible directions to generalize towards, andpicking one direction may affect its eligibility to pick others in the future.We will show what this means shortly.

Callbacks¶

New behaviors can also be encapsulated inside an argument:

Inject custom behaviors through callbacks: Use object.method as callbacks:

def func(callback):
  ...
  callback(x)
  ...

def func(obj):
  ...
  obj.do(x)
  ...

This appears useful, since the custom logic is not implemented in core,but in a user-provided callback.For example, given the original code below (left), a researcher who wants to compute y differently may proposea compute_y_fn argument like below (right).

Original: With callbacks:

def func():
 a = something ...
 x = something ...
 y = x.norm()
 z = ...
 ...

def func(compute_y_fn=lambda x: x.norm()):
 a = something ...
 x = something ...
 y = compute_y_fn(x)
 z = ...
 ...

However, this design may be even more problematic:

Premature abstractions: Assumptions/constraints are implicitly created about where the callback is triggered,what arguments it needs and what it returns. These assumptions may be bad.
For example, a 2nd researcher may want to computey using both x and a; a 3rd researcher may want to compute y, zin one function compute_y_z_fn because it's more efficient. These variants conflict with the 1st researcher's design.
In the future, after seeing enough use cases, we might realize that a xyz = compute_xyz(a) is a truly good abstraction.However, at that time the premature abstraction of compute_y_fn will get in our way implementing compute_xyz.In other words, although the current design makes the computation of y "more general", the abstraction limits our abilityto generalize the function in other ways. That's why we said earlier that "more general means less general".
Obscure logic: readers can't easily figure out what this function does: they needto look at the caller of this function to see which callback is supplied, and thenlook at the implementation of the callback function. The aforementioned issue of "confusing behaviors" also applies here.

Sometimes callbacks are good and useful abstractions. But because it is too powerful, I saw it frequently abused to altera behavior into something that's strongly overfitted to a small number of use cases.In code reviews, I usually frown upon APIs that require callbacks/user-defined functions.

Prefer forks over new arguments¶

To customize a "research API" def func() defined in core, we have the following options:

Out-of-core, e.g. a def func_v2() in user code.(Or a class ClassV2 for classes).
In-core, but keep existing APIs unaffected, e.g. a def func_v2() in core.
In-core, and change existing APIs, e.g. a new option in def func(option).

The best choice is heavily subjective and should be evaluated case-by-case.Due to the concern of new arguments,in general we recommend methods (1) and (2), i.e. prefer forking func() over changing func().

If a fork will create significant code duplication, choose (2) and try to reduce duplication with private abstractions (see next section).
Adding flags / simple args is acceptable for simple, clean, popular additions.
Adding callbacks / new abstraction requires scrutiny, and should come with more than a handful of use cases in mind.

This also echoesFlax design philosophy thatsays "prefer duplication over adding options / bad abstractions".

Accept duplication, but aim to reduce them later¶

Users/developers may find that the core design is not good enough yet, and recreating a variantof func() without touching it may lead to too much code duplication.For example, ... is duplicated between the two functions below.

Existing API in core New variant

def func():
  ...
  ...

def func_v2():
  ...
  New custom logic in between.
  ...

Such duplication is acceptable for a short term.We do NOT mean to encourage users to heavily fork core code.Instead, users and core developers should engage and aim to evolve the core design to reduce duplication— but design change takes time to happen, and duplication is preferred before a good design is found.

How to reduce duplication¶

The most risk-free way to reduce duplications is by moving them into shared reusable code:

Existing API in core New variant

def func():
    ... # fewer duplications than before
    _reusable_parts()
    ...

def func_v2():
    ... # fewer duplications than before
    _reusable_parts()
    Variant logic inserted here
    ...

This should be the preferred way to reduce duplications. The benefits are:

No change to the API of func(), hence little risk.
Create reusable sub-routines that may be useful to new use cases.

However, there are also challenges:

This adds a new API (_reusable_parts()) to maintain.
Sometimes it's difficult to identify a clean & reusable subset that can be easily split from the duplicated code.It may require small refactoring to expose a clean subset.Also, remember that the approach that reduces the most duplications might not be the one with the best abstraction.

The above challenges are less significant if _reusable_parts() is private. Therefore:

If func_v2() is in core, make _reusable_parts() private.
If func_v2() must be out-of-core, consider _reusable_parts() as "internal/experimental APIs".

Inheritance, e.g. class ModuleV2(ModuleCore) may also reduce duplication between two variants.However, this is generally less preferable than composition like above. The reason is similar towhy callbacks are not preferred: overriding methods is like passing callbacks - they are both user-definedfunctions and suffer from the same limitations: users are constrained by the assumption ofwhen/where/how the methods/callbacks are triggered.

Prefer branching at shallower code path¶

We generally prefer adding a new implementation over adding new conditional branches to the existing implementation,but branches probably will happen somewhere anyway – after all, the new feature variant probably ends up as a new option/argument in the end-users' config.

If branching has to happen, we prefer it at earlier, shallower code path:

Branch earlier Branch later

class Module():
  def __call__(self, flag):
      ...
      if flag:
          func()
      else:
          func_v2()
      ...

def func(flag):
    ...
    if flag:
        ...
class Module():
  def __call__(self, flag):
      ...
      func(flag)
      ...

By branching earlier, we keep a clean func() unaffected by the new variant.This recommendation is consistent with the preference to fork func_v2(), not to add flag to func().

Automatically Flatten & Unflatten Nested Containers

2022-06-16T07:00:00.000Z

This post is about a small functionality that is found useful in TensorFlow / JAX / PyTorch.

Low-level components of these systems often use a plain list of values/tensorsas inputs & outputs.However, end-users that develop models often want to work with morecomplicated data structures:Dict[str, Any], List[Any], custom classes, and their nested combinations.Therefore, we need bidirectional conversion between nested structures and a plain list of tensors.I found that different libraries invent similar approaches to solve this problem, and it's interesting to list them here.

Nested Containers Are Useful Abstractions¶

Though many simple deep learning models just needs a few inputs/outputs tensors,nested containers are useful abstractions in advanced models.This is because many concepts are naturally represented by >1 tensors, e.g.:

A sparse tensor consists of values + indices
A masked tensor (common in transformers) is represented by a tensor + its binary mask
A segmentation mask can be represented in different ways:
- single whole-image bitmask tensor
- shape + bounding box + mask within the box (aka "RoIMask")
- shape + list of polygons
- shape + run-length encoding
Detected objects in an image are represented by boxes + scores + labels + many possible attributes
A list of variable-length vectors may be represented by a concatenated vector + a length vector, i.e.:
[[1, 2, 3], [42], [6, 6]] --> [1, 2, 3, 42, 6, 6], [3, 1, 2]

When a frequently-used concept has natural complexity like above, representing itin a flat structure (e.g. Dict[str, Tensor]) consisting of only regular tensors may result in ugly code.A multi-level nested structure sometimes becomes helpful.Take sparse tensor as a simple example:

	Use nested containers	Use a flat `Dict[str, Tensor]`
Representation	{"a": SparseTensor, "b": Tensor} `SparseTensor` can be a namedtuple/dataclass, or a new class.	{"a_values": Tensor, "a_indices": Tensor, "b": Tensor}
Sanity check	`SparseTensor` class can guarantee both tensors exist and follow certain contracts (e.g. their shapes match)	Need to check `a_{values,indices}` co-exist in the dict
Pass to another function	Pass `x["a"]` directly	Extract `x["a_values"], x["a_indices"]` and pass both
Operations	`SparseTensor` class can have methods that work like regular tensors, e.g. `y = x["a"] + 1`	Need to implement many new functions, e.g. `y = add_sparse(x["a_values"], x["a_indices"], 1)`

Bidirectional Conversion¶

Despite the benefits, lower-level stacks often ignore these abstractionsand choose to use a "flat" interface: their inputs & outputs are a flat list of values / Tensors.This is because:(i) the abstraction may no longer be useful in lower level;(ii) a simple structure simplifies their implementation;(iii) a flat list is a data structure available even in lower-level languages & systems.

Therefore, conversion from a nested structure to a plain list of values is important.This is often referred to as "flatten".It is pretty straightforward to flatten a container recursively -- like the following flatten function:

def flatten(container):
  if isinstance(container, Sequence):
    return list(itertools.chain.from_iterable(flatten(x) for x in container))
  elif isinstance(container, Mapping):
    return flatten(list(container.values()))
  elif isinstance(container, Tensor):
    return [container]
  # ... handle other containers

x, y, z = torch.rand(3)
obj = [{"a": x, "b": [y, {"c": z}]}]
flatten(obj)  # ==> [x, y, z]

The inverse of flatten is also important: given new values [x2, y2, z2],we want the unflatten function below to construct obj2 that has the samestructure as obj.

x2, y2, z2 = torch.rand(3)
obj2 = unflatten([x2, y2, z2], ???)  # ==> [{"a": x2, "b": [y2, {"c": z2}]}]

unflatten is a very handy utility. For example, to create a clone of objon a different device, we simply do this:

values = flatten(obj)
new_values = [x.to(device) for x in values]
new_obj = unflatten(new_values, ???)

Without unflatten, every such functionality needs to be reimplemented as a recursivefunction, like PyTorch's pin_memory.

Implementation of `unflatten`¶

How do we implement unflatten?Apparently, we need to give it a representation of structure (noted as a placeholder ??? in the above code).There are two high-level approaches to solve this problem:

Schema-based: when flattening a container, explicitly record its structure/schema to be used for unflatten.Its API may look like this:

>>> from jax.tree_util import tree_flatten, tree_unflatten
>>> obj = [3, ([5, 6], {"name": [7, 9], "name2": 3})]
>>> res, schema = tree_flatten(obj)
>>> res     # Flattened results:
[3, 5, 6, 7, 9, 3]
>>> schema  # An explicit representation of the container's structure
PyTreeDef([*, ([*, *], {'name': [*, *], 'name2': *})])
>>> # Construct a nested container using the given values and the structure/schema:
>>> tree_unflatten(schema, [1, 2, 3, 4, 5, 6])
[1, ([2, 3], {'name': [4, 5], 'name2': 6})]

Examples: Detectron2's flatten_to_tuple, TensorFlow's FetchMapper, JAX's pytree.

Schema-less: use the entire nested container as an implicit representation of structure. Its interface looks like this:

>>> import tensorflow as tf
>>> obj = [3, ([5, 6], {"name": [7, 9], "name2": 3})]
>>> tf.nest.flatten(obj)  # Flattened results:
[3, 5, 6, 7, 9, 3]
>>> # Construct a nested container that has same structure as obj, using the given list of values:
>>> tf.nest.pack_sequence_as(obj, [1, 2, 3, 4, 5, 6])
[1, ([2, 3], {'name': [4, 5], 'name2': 6})]

Examples: TensorFlow's tf.nest. DeepMind's dm-tree.

The two approaches have some pros and cons:

The schema-less approach has simpler API and implementation.
The schema-based approach likely has a more memory-efficient representation of schema,compared to using an entire container as schema.
An explicit schema representation allows more functionalities to be added by customizing the representation.

Applications¶

JAX Pytree¶

JAX's low level components accept/return flat tensors, so functions can be transformed and optimized more easily.Since end-users need nested containers, JAX transformations supports pytree containers,which by default includes flattening & unflattening for common Python containers.It further allows users to register custom classes byregister_pytree_node.

Pytree uses a schema-based implementation that we already show-cased above.

When we need to independently process each leaf of the container, JAX provides another handyfunction tree_map:

def tree_map(f, obj):
  values, schema = flatten(obj)
  new_values = [f(x) for x in values]
  return unflatten(schema, new_values)

PyTorch also adds a similar implementation of pytree at herethat is used in its FX tracing.

Detectron2 `TracingAdapter`¶

torch.jit.trace(model, inputs) executes the model with given inputs, and returns a graph representationof the model's execution.This is one of the most common methods (and the best IMO) how PyTorch models are exported today.However, it limits model's input & output format.

In order to trace models with more complicated inputs & outputs,I created the TracingAdaptertool in detectron2, that flattens/unflattens a model's inputs and outputs into simple Tuple[Tensor] to make it traceable.A minimal implementation of it may look like this:

class TracingAdapter(nn.Module):
  def __init__(self, model, inputs):
    super().__init__()
    self.model = model
    self.flattened_inputs, self.inputs_schema = flatten(inputs)

  def forward(self, *flattened_inputs):
    inputs = self.inputs_schema(flattened_inputs)  # This does unflatten
    outputs = self.model(inputs)
    flattened_outputs, self.outputs_schema = flatten(outputs)
    return flattened_outputs
# torch.jit.trace(model, inputs)   # Fails because inputs is a nested container
flattened_inputs, _ = flatten(inputs)
torch.jit.trace(TracingAdapter(model, inputs), flattened_inputs)  # Succeeds

where flatten uses a schema-based implementation that can be found in this file.Coincidentally, its interface looks like JAX's pytree:

>>> from detectron2.export.flatten import flatten_to_tuple as flatten
>>> obj = [3, ([5, 6], {"name": [7, 9], "name2": 3})]
>>> res, schema = flatten(obj)
>>> res     # Flattened results:
(3, 5, 6, 7, 9, 3)
>>> schema  # An explicit representation of the container's structure
ListSchema(schemas=[IdentitySchema(), TupleSchema(schemas=[ListSchema(schemas=[IdentitySchema(), IdentitySchema()], sizes=[1, 1]), DictSchema(schemas=[ListSchema(schemas=[IdentitySchema(), IdentitySchema()], sizes=[1, 1]), IdentitySchema()], sizes=[2, 1], keys=['name', 'name2'])], sizes=[2, 3])], sizes=[1, 5])
>>> # Construct a nested container using the given values and the structure/schema:
>>> schema((1, 2, 3, 4, 5, 6))
[1, ([2, 3], {'name': [4, 5], 'name2': 6})]

Perception models in Meta accept a wide range of inputs/outputs formats:they may take any number of images plus auxiliary data as inputs, andpredict boxes, masks, keypoints or any other interesting attributes as outputs.But deployment prefers a flat interface for optimizability and interoperability.TracingAdapter's automatic flattening and unflattening mechanism has freed engineers fromwriting format conversion glue code when deploying these models.

In addition to deployment, TracingAdapter is also useful in a few other places to smooththe experience of torch.jit.trace:

Flop counting: fvcore's flop counteruses tracing to obtain a graph of operators.To let it support counting of complex models,wrapping the model with TracingAdapter is the easiest way.
Tensorboard graph visualization: PyTorch's tensorboard writer has a add_graphmethod that visualizes the graph structure in tensorboard.The method requires flattened inputs,therefore TracingAdapter can be used like this.
PyTorch's ONNX export is also based on tracing. So TracingAdapter is useful as well, e.g. here.

TensorFlow `tf.nest`¶

tf.nest.flatten and tf.nest.pack_sequence_asimplement schema-less flattening and unflattening.

The unflatten function requires a container, and it will flatten this container on-the-fly whilesimultaneously "pack" flat values into the structure of this container. Here is an official example (note that dict values are ordered by keys):

>>> container = { "key3": {"c": ('alpha', 'beta'), "a": ('gamma')},
...               "key1": {"e": "val1", "d": "val2"} }
>>> flat_sequence = ['val2', 'val1', 3.0, 1.0, 2.0]
>>> tf.nest.pack_sequence_as(container, flat_sequence)
{'key3': {'c': (1.0, 2.0), 'a': 3.0}, 'key1': {'e': 'val1', 'd': 'val2'}}

tf.nest.{flatten,pack_sequence_as} are widely used in TensorFlow because many low-level components have a flat interface, especially forinterop with C APIs.

~/tensorflow/tensorflow/python$ ag pack_sequence_as | wc -l
243

tf.nest.map_structurehas the same functionality as JAX's tree_map.

TensorFlow `FetchMapper`¶

TFv1's session.run(fetches) supports fetching nested containers.This is demonstrated in an example from theofficial documentation:

a = tf.constant([10, 20])
b = tf.constant([1.0, 2.0])
MyData = collections.namedtuple('MyData', ['a', 'b'])

v = session.run({'k1': MyData(a, b), 'k2': [b, a]})
# v is a dict with:
# v['k1'] is a MyData namedtuple with 'a' (the numpy array [10, 20]) and
#                                     'b' (the numpy array [1.0, 2.0])
# v['k2'] is a list with the numpy array [1.0, 2.0] and the numpy array # [10, 20].

This powerful interface exists in TF's Python client only.The client interacts with the C API's TF_SessionRunwhich only accepts a plain array of inputs/outputs.Therefore, the client needs to:

Flatten the container to a plain array of tensors
Send this array to the C API to obtain an array of results
Unflatten / reconstruct the container using the results

The flatten/unflatten logic uses a schema-based implementation in the client's FetchMapper.This implementation is a bit more complicated due toan extra guarantee thatthe flattened tensors are unique. (This is to ensure the client won't fetch the same tensor twice in one call;this cannot be done by using tf.nest.)

In addition to builtin Python containers, FetchMapper supports a few other TF containers(such as SparseTensor) and can be extended to new containers by registering conversion functions.

DeepMind `tree` library¶

DeepMind has a tree library as a standalone alternative to tf.nest:

`deepmind/tree`	`tf.nest`	`jax.tree_util`
`tree.flatten`	`tf.nest.flatten`	`jax.tree_util.tree_flatten`
`tree.unflatten_as`	`tf.nest.pack_sequence_as`	`jax.tree_util.tree_unflatten`
`tree.map_structure`	`tf.nest.map_structure`	`jax.tree_util.tree_map`

TorchScript: Tracing vs. Scripting

2022-05-23T06:59:00.000Z

PyTorch provides two methods to turn an nn.Module into agraph represented in TorchScript format: tracing and scripting.This article will:

Compare their pros and cons, with a focus on useful tips for tracing.
Try to convince you that torch.jit.trace should be preferred over torch.jit.scriptfor deployment of non-trivial models.

The second point might be an uncommon opinion:If I Google "tracing vs scripting", the first articlerecommends scripting as default.But tracing has many advantages.In fact, by the time I left, "tracing as default, scripting only when necessary" is thestrategy all detection & segmentation models in Facebook/Meta products are deployed.

Why tracing is better? TL;DR: (i) it will not damage the code quality; (ii) its main limitations can beaddressed by mixing with scripting.

Terminology¶

We start by disambiguate some common terminologies:

Export: refers to the process that turns a model written in eager-mode Pythoncode into a graph that describes the computation.
Tracing: An export method. It runs a model with certain inputs, and "traces / records" all the operationsthat are executed into a graph.
torch.jit.trace is an export API that uses tracing, used like torch.jit.trace(model, input).See its tutorialand API.
Scripting: Another export method. It parses the Python source code of the model, and compiles the code into agraph.
torch.jit.script is an export API that uses scripting, used like torch.jit.script(model).See its tutorialand API.
TorchScript: This is an overloaded term
- It often refers to the representation / format of the exported graph.
- But sometimes it refers to the scripting export method.
To avoid confusion, I'll never use "TorchScript" alone in this article.I'll use "TS-format" to refer to the format, and "scripting" to refer to the export method.
Because this term is used with ambiguity, it may have caused the impression that "scripting" is the"official / preferred" way to create a TS-format model. But that's not necessarily true.
(Torch)Scriptable: A model is "scriptable" if torch.jit.script(model) succeeds, i.e. it canbe exported by scripting.
Traceable: A model is "traceable" if torch.jit.trace(model, input) succeeds for atypical input.
Generalize: A traced model (returned object of trace()) "generalizes" to other inputs(different from the inputs given during tracing), if it can inference correctly when given other inputs.Scripted models always generalize.

Dynamic control flow or data-dependent control flow: control flow where the operatorsto be executed depend on the input data, e.g. for a Tensor x:

if x[0] == 4: x += 1 is a dynamic control flow.

model: nn.Sequential = ...
for m in model:
  x = m(x)

is NOT a dynamic control flow.

class A(nn.Module):
  backbone: nn.Module
  head: Optiona[nn.Module]
  def forward(self, x):
    x = self.backbone(x)
    if self.head is not None:
        x = self.head(x)
    return x

is NOT a dynamic control flow.

The Cost of Scriptability¶

If anyone says "we'll make Python better by writing a compiler for it", you should immediatelybe alarmed and know that this is extremely difficult.Python is too big and too dynamic. A compiler can only support a subset of its syntax features and builtins, at best --the scripting compiler in PyTorch is no exception.

What subset of Python does this compiler support?A rough answer is: the compiler hasgood support for the most basic syntax, but medium to no support for anything more complicated (classes, builtins like range and zip, dynamic types, etc.).But there is no clear answer: even the developers of the compiler usually need to run the code to see if it can be compiled or not.

The incomplete Python compiler limits how users can write code.Though there isn't a clear list of constraints,I can tell from my experience what impact they have had on large projects:code quality is the cost of scriptability.

Impact on Most Projects¶

To make their code scriptable / compilable by the scripting compiler,most projects choose to stay on the "safe side" to only use basic syntax of Python:no/few custom structures, no builtins, no inheritance, no Union, no **kwargs, no lambda, no dynamic types, etc.

This is because these "advanced" compiler features are either not supported at all, or with "partial support"which is not robust enough: they may work in some cases but fail in others.And because there is no clear spec of what is supported,users are unable to reason about or workaround the failures.Therefore, eventually users move to and stay on the safe side.

The terrible consequence is that:developers stop making abstractions / exploring useful language featuresdue to concerns in scriptability.

A related hack that many projects do is to rewrite part of the code for scripting:create a separate, inference-only forward codepath that makes the compiler happy.This also makes the project harder to maintain.

Impact on Detectron2¶

Detectron2 supports scripting, but the story was a bit different: it did not go downhill in code quality which we value a lot in research.Instead, with some creativity and direct support from PyTorch team (and some volunteered help from Alibaba engineers), we managed to make most modelsscriptable without removing any abstractions.

However, it is not an easy task:we had to add dozens of syntax fixes to the compiler, find creative workarounds,and develop some hacky patches in detectron2 that are inthis file(which honestly could affect maintainability in the long term).I would not recommend other large projects to aim for "scriptability without losing abstractions" unlessthey are also closely supported by PyTorch team.

Recommendation¶

If you think "scripting seems to work for my project"so let's embrace it, I might advise against it for the following reasons,based on my past experiences with a few projects that support scripting:

What "works" might be more brittle than you think (unless you limit yourself to the basic syntax):Your code might happen to compile now, but one day you'll add a few innocent changes to your modeland find that the compiler refuses it.
Basic syntax is not enough:Even if more complex abstractions don't appear necessary to your project at the moment,if the project is expected to grow, it will require more language features in the future.
Take a multi-task detector for example:
1. There could be 10s of inputs, so it's preferable to use some structures/classes.
2. The same data can have different representations (e.g. different ways to represent a segmentation mask),which demands Union or more dynamic types.
3. There are many architectural choices of a detector, which makes inheritance useful.
Large, growing projects definitely need evolving abstractions to stay healthy.
Code quality could severely deteriorate:Ugly code starts to accumulate, because clean code sometimes just doesn't compile.Also, due to syntax limitations of the compiler,abstractions cannot be easily made to clean up the ugliness.The health of the project gradually goes downhill.

Below is a complaint in PyTorch issues.The issue itself is just one small papercut of scripting,but similar complaints were heard many times.The status-quo is: scripting forces you to write ugly code, so only use it when necessary.

Make a Model Trace and Generalize¶

The Cost of Traceability¶

What it takes to make a model traceable is very clear, and has a much smaller impact on code health.

First, neither scripting nor tracing works if the model is not even a proper single-device, connected graph representable in TS-format.For example, if the model has DataParallel submodules, or if the modelconverts tensors to numpy arrays and calls OpenCV functions, etc, you'll have to refactor it.
Apart from this obvious constraint, there are only two extra requirements for traceability.

Input/output format

Model's inputs/outputs have to be Union[Tensor, Tuple[Tensor], Dict[str, Tensor]]or their nested combinations. Note that values in a dict have to belong to the same type.

Similar constraints exist for scripting as well.However, in tracing the constraint does not apply to submodules:submodules can use any input/output format: dicts of Any, classes, kwargs, anything that Python supports.Only the top-level model is required to use the constraint format.

This makes the constraint very easy to satisfy.If the model uses richer formats, just create a simple wrapper around it that converts to/fromTuple[Tensor].Detectron2 even automates this for all its models by a universal wrapperlike this:

outputs = model(inputs)   # inputs/outputs are rich structure, e.g. dicts or classes
# torch.jit.trace(model, inputs)  # FAIL! unsupported format
adapter = TracingAdapter(model, inputs)
traced = torch.jit.trace(adapter, adapter.flattened_inputs)  # Can now trace the model

# Traced model can only produce flattened outputs (tuple of tensors):
flattened_outputs = traced(*adapter.flattened_inputs)
# Adapter knows how to convert it back to the rich structure (new_outputs == outputs):
new_outputs = adapter.outputs_schema(flattened_outputs)

Automatically Flatten & Unflatten Nested Containers has more details on how this adapter is implemented.

Symbolic shapes:
Expressions like tensor.size(0), tensor.size()[1], tensor.shape[2]are integers in eager mode, but Tensors in tracing mode.Such difference is necessary so that during tracing, shape computation can becaptured as symbolic operations in the graph.An example is given in the next section about generalization.
Due to different return types,a model may be untraceable if parts of it assume shapes are integers.This usually can be fixed quite easily by handling both types in the code.A helpful function is torch.jit.is_tracingwhich checks if the code is executed in tracing mode.

That's all it takes for traceability - most importantly, any Python syntax is allowed in model implementation, because tracing does not careabout syntax at all.

Generalization Problem¶

Just being "traceable" is not sufficient.The biggest problem with tracing, is that it may not generalize to other inputs.This problem happens in the following cases:

Dynamic control flow:

>>> def f(x):
...   return torch.sqrt(x) if x.sum() > 0 else torch.square(x)
>>> m = torch.jit.trace(f, torch.tensor(3))
>>> print(m.code)
def f(x: Tensor) -> Tensor:
  return torch.sqrt(x)

In this example, due to dynamic control flow,the trace only keeps one branch of the condition, and will not generalize to certain (negative) inputs.

Capture variables as constants:

>>> a, b = torch.rand(1), torch.rand(2)
>>> def f1(x): return torch.arange(x.shape[0])
>>> def f2(x): return torch.arange(len(x))
>>> # See if the two traces generalize from a to b:
>>> torch.jit.trace(f1, a)(b)
tensor([0, 1])
>>> torch.jit.trace(f2, a)(b)
tensor([0])  # WRONG!
>>> # Why f2 does not generalize? Let's compare their code:
>>> print(torch.jit.trace(f1, a).code, torch.jit.trace(f2, a).code)
def f1(x: Tensor) -> Tensor:
  _0 = ops.prim.NumToTensor(torch.size(x, 0))
  _1 = torch.arange(annotate(number, _0), dtype=None, layout=0, device=torch.device("cpu"), pin_memory=False)
  return _1
def f2(x: Tensor) -> Tensor:
  _0 = torch.arange(1, dtype=None, layout=0, device=torch.device("cpu"), pin_memory=False)
  return _0

Intermediate computation results of a non-Tensor type (in this case, an int type) may be captured as constants, using thevalue observed during tracing. This causes the trace to not generalize.

In addition to len(), this issue can also appear in:

.item() which converts tensors to int/float.
Any other code that converts torch types to numpy/python primitives.
A few problematic operators, e.g. advanced indexing.

Capture device:

>>> def f(x):
...   return torch.arange(x.shape[0], device=x.device)
>>> m = torch.jit.trace(f, torch.tensor([3]))
>>> print(m.code)
def f(x: Tensor) -> Tensor:
  _0 = ops.prim.NumToTensor(torch.size(x, 0))
  _1 = torch.arange(annotate(number, _0), dtype=None, layout=0, device=torch.device("cpu"), pin_memory=False)
  return _1
>>> m(torch.tensor([3]).cuda()).device
device(type='cpu')  # WRONG!

Similarly, operators that accept a device argument will remember the device used during tracing (this canbe seen in m.code).So the trace may not generalize to inputs on a different device.Such generalization is almost never needed, because deployment usually has a target device.

Let Tracing Generalize¶

The above problems are annoying and often silent (warnings, but no errors),but they can be successfully addressed by good practice and tools:

Pay attention to TracerWarning: In the first two examples above, torch.jit.trace actually emits warnings.The first example prints:

a.py:3: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect.
We can't record the data flow of Python values, so this value will be treated as a constant in the future.
This means that the trace might not generalize to other inputs!
if x.sum() > 0:

Paying attention to these warnings (or even better, catch them)will expose most generalization problems of tracing.

Note that the "capture device" case does not print warnings because tracing was not designed to support such generalization at all.

Unittests for parity: Unittests should be done after export and before deployment, to verify thatthe exported model produces the same outputs as the original eager-mode model, i.e.
assert allclose(torch.jit.trace(model, input1)(input2), model(input2))
If generalization across shapes is needed (not always needed), input2 should have differentshapes from input1.
Detectron2 has many generalization tests, e.g. thisand this.Once a gap is found, inspecting the code of the exported TS-format model can uncover the place whereit fails to generalize.
Avoid unnecessary "special case" conditions:Avoid conditions like
if x.numel() > 0: output = self.layers(x) else: output = torch.zeros((0, C, H, W)) # Create empty outputs
that handles special cases such as empty inputs.Instead, improve self.layers or its underlying kernel so it supports empty inputs.This would result in cleaner code and also improve tracing.This is why I'm involved in many PyTorch issues that improve support for emptyinputs, such as#12013,#36530,#56998.Most PyTorch operations work perfectly with empty inputs,so such branching is hardly needed.
Use symbolic shapes: As mentioned earlier, tensor.size() returns Tensor during tracing, sothat shape computations are captured in the graph.Users should avoid accidentally turning tensor shapes into constants:
- Use tensor.size(0) instead of len(tensor) because the latter is an int.For custom classes, implement a .size method or use .__len__() instead of len(), e.g. like here.
- Do not convert sizes by int() or torch.as_tensor because they will capture constants.This helper functionis useful to convert sizes into a tensor, in a way that works in both tracing and eager mode.
Mix tracing and scripting: they can be mixed together, so you can use scriptingon the small portion of code that tracing does not work correctly.This can fix almost all problems of tracing. More on this below.

Mix Tracing and Scripting¶

Tracing and scripting both have their own problems, and thebest solution is usually to mix them together.This gives us the best of both worlds.

To minimize the negative impact on code quality,we should use tracing for the majority of logic, and use scripting only when necessary.

Use @script_if_tracing: Inside torch.jit.trace, the @script_if_tracingdecorator can compile functions by scripting.Typically, this only requires a small refactor of the forward logic to separate the parts that need tobe compiled (the parts with control flow):

def forward(self, ...):
  # ... some forward logic
  @torch.jit.script_if_tracing
  def _inner_impl(x, y, z, flag: bool):
      # use control flow, etc.
      return ...
  output = _inner_impl(x, y, z, flag)
  # ... other forward logic

By scripting only the parts that need it,the code quality damage is strictly smaller than making the entire model scriptable,and it does not affect the module's forward interface at all.

The function decorated by @script_if_tracing has to be a pure function that does not contain modules.Therefore, sometimes a bit more refactoring is needed:

Before Refactoring After Refactoring

# This branch cannot be compiled by
# @script_if_tracing, because it
# refers to `self.layers`
if x.numel() > 0:
  x = preprocess(x)
  output = self.layers(x)
else:
  # Create empty outputs
  output = torch.zeros(...)

# This branch can be compiled by @script_if_tracing
if x.numel() > 0:
  x = preprocess(x)
else:
  # Create empty inputs
  x = torch.zeros(...)
# Needs to make sure self.layers accept empty
# inputs. If necessary, add such condition branch
# into self.layers as well.
output = self.layers(x)

In fact, for most vision models, dynamic control flow is needed only in a few submodules whereit's easy to be scriptable.To show how rare it is needed, the entire detectron2 only has two functions decorated with @script_if_tracing due to control flows:paste_masksand heatmaps_to_keypoints,both for post-processing only.A few other functions are also decorated to generalize across devices (a very rare requirement).

Use scripted / traced submodules:
model.submodule = torch.jit.script(model.submodule) torch.jit.trace(model, inputs)
In this example, suppose submodule cannot be traced correctly, we can script it before tracing.However I do not recommend it.If possible, I will suggest using @script_if_tracinginside submodule.forward instead,so that scripting is limited to the internals of the submodule,without affecting the module's interface.
And similarly,
model.submodule = torch.jit.trace(model.submodule, submodule_inputs) torch.jit.script(model)
this uses a traced submodule during scripting.This looks nice, but is not so useful in practice: it will affect the interfaceof submodule, requiring it to only accept/return Tuple[Tensor] -- this is abig constraint that might hurt code quality even more than scripting.
A rare scenario where "tracing a submodule" is useful, is this:
class A(nn.Module): def forward(self, x): # Dispatch to different submodules based on a dynamic, data-dependent condition: return self.submodule1(x) if x.sum() > 0 else self.submodule2(x)
@script_if_tracing cannot compile such control flow because it only supports pure functions.If submodule{1,2} are complex and cannot be scripted,using traced submodules in a scripted parent A is the best option.

Merge multiple traces:

Scripted models support two more features that traced models don't:

Control flow conditioned on attributes: a scripted module can have mutable attributes (e.g. a boolean flag)that affect control flows. Traced modules do not have control flows.
Multiple methods: a traced module only supports forward(), but a scripted module can havemultiple methods.

Actually, both features above are doing the same thing: they allow an exported model to be used indifferent ways, i.e. execute different sequences of operators as requested by the caller.

Below is an example scenario where such feature is useful: if Detector is scripted, the caller can mutate itsdo_keypoint attribute to control its behavior, or call predict_keypoint methoddirectly if needed.

class Detector(nn.Module):
  do_keypoint: bool

  def forward(self, img):
      box = self.predict_boxes(img)
      if self.do_keypoint:
          kpts = self.predict_keypoint(img, box)

  @torch.jit.export
  def predict_boxes(self, img): pass

  @torch.jit.export
  def predict_keypoint(self, img, box): pass

This requirement is not seen very often. But if needed, how to achieve this in tracing?I have a solution that's not very clean:

Tracing can only capture one sequence of operators, so the natural way is to trace the model twice:

det1 = torch.jit.trace(Detector(do_keypoint=True), inputs)
det2 = torch.jit.trace(Detector(do_keypoint=False), inputs)

We can then alias their weights (to not duplicate the storage), and merge thetwo traces into one module to script.

det2.submodule.weight = det1.submodule.weight
class Wrapper(nn.ModuleList):
  def forward(self, img, do_keypoint: bool):
    if do_keypoint:
        return self[0](img)
    else:
        return self[1](img)
exported = torch.jit.script(Wrapper([det1, det2]))

Performance¶

If a model is both traceable and scriptable,tracing always generates same or simpler graph (therefore likely faster).

Why?Because scripting tries to faithfully representyour Python code, even some of it are unnecessary. For example:it is not always smart enough to realize that someloops or data structures in the Python code are actually static and can be removed:

class A(nn.Module):
  def forward(self, x1, x2, x3):
    z = [0, 1, 2]
    xs = [x1, x2, x3]
    for k in z: x1 += xs[k]
    return x1
model = A()
print(torch.jit.script(model).code)
# def forward(self, x1: Tensor, x2: Tensor, x3: Tensor) -> Tensor:
#   z = [0, 1, 2]
#   xs = [x1, x2, x3]
#   x10 = x1
#   for _0 in range(torch.len(z)):
#     k = z[_0]
#     x10 = torch.add_(x10, xs[k])
#   return x10
print(torch.jit.trace(model, [torch.tensor(1)] * 3).code)
# def forward(self, x1: Tensor, x2: Tensor, x3: Tensor) -> Tensor:
#   x10 = torch.add_(x1, x1)
#   x11 = torch.add_(x10, x2)
#   return torch.add_(x11, x3)

This example is very simple, so it actually has workarounds for scripting (use tuple instead of list),or the loop might get optimized in a later optimization pass.But the point is: the graph compiler is not always smart enough. For complicated models, scripting mightgenerate a graph with unnecessary complexity that's hard to optimize.

Concluding Thoughts¶

Tracing has clear limitations:I spent most of this article talking about the limitations of tracing and how to fix them.I actually think this is the advantage of tracing: it has clear limitations (and solutions),so you can reason about whether it works.

On the contrary, scripting is more like a black box:no one knows if it works before trying.I didn't mention a single trick about how to fix scripting:there are many of them, but it's not worth your time to probe and fix a black box.

Tracing has small blast radius:Both tracing and scripting affect how code can be written, but tracing has a much smaller blastradius, and causes much less damage:

It limits the input/output format, but on the outer-most module only. (And this issue can be automaticallysolved as discussed above.)
It needs some code changes to generalize (e.g. to mix scripting in tracing), but these changes only go into theinternal implementation of the affected modules, not their interfaces.

On the other hand, scripting has an impact on:

The interface of every module & submodule involved.
- IMO, this is the biggest damage:Advanced syntax features are needed in interfaces, and I'm not willing to compromise on interface design.
- This may end up affecting training as well because interface is often shared between training and inference.
Pretty much every line of code in the inference forward path.

Having a large blast radius is why scripting can do great harm to code quality.

Control flow vs. other Python syntax:PyTorch is loved by its users because they can "just write Python", and most importantly writePython control flows. But other syntax of Python are important as well.If being able to write Python control flow (scripting) means losing other great syntax,I'd rather give up on the ability to write Python control flow.

In fact, if PyTorch is less obsessed with Python control flow, and offers mesymbolic control flows such as torch.cond like this (similar to the API of tf.cond):

def f(x):
  return torch.cond(x.sum() > 0, lambda: torch.sqrt(x), lambda: torch.square(x))

Then f could be traced correctly and I would be happy to use this, no longer having to worryabout scripting.TensorFlow AutoGraphis a great example that automates this idea.

谈谈Github上如何交流(4): {Feature,Pull} Request

2022-05-18T09:00:00.000Z

本系列文章

这篇文章说说用户怎么提出好的 feature request / pull request, 以及维护者如何对待它们.

这里, 我们忽略那种特别简单的 (例如 10 行代码以内可以实现的) request, 只考虑 non-trivial 的 feature request 和 pull request.

好的 feature request¶

首先, 一个残忍的事实是, 开源项目中大多数的 feature requests 不会得到 maintainer 的回应. 理由也很简单: 项目的资源是有限的, 而修 bug, 维护现有 feature 的优先级自然会更高. 当项目有额外的开发资源时, 一般也会优先推进团队自己原有的开发计划 / roadmap, 或优先为项目的赞助方 (如背后的公司) 实现 feature. 路人的 feature request 优先级可以说是最低的, 排在所有这些之后.

下图是 vscode 社区处理 feature request 的流程: (来源)

Vscode 是一个非常注重社区的项目, 因为编辑器必须要有好的生态才能成功. 因此我们才能看到 vscode 把用户的 "upvote" 也考虑在内. 绝大多数项目并没有这最后一步: 和项目 roadmap 不 align 的 feature request, 一般就直接进入 backlog 了.

在这种情况下, 要想提出一个 "好的 feature request", 并得到 maintainer 的重视, 当然不是那么容易. 一个好的 feature request 一般至少要在以下某一点中比较突出:

特别好的 idea: 提供 maintainer 可能还不了解, 或没想到的信息, 让 maintainer 看到尚未注意到的重要 feature
- 例如: "我发现很多用户需要对现有的 feature A 加一个 workaround, 如果我们能够做 X 的话, 就能够提升用户体验"
- 例如: "现在的 feature A 做了某个假设, 但是对于很多用户这个假设其实不成立. 我们希望通过实现 X 来放松这个假设"
描述这个 feature 的设计和实现: 更多的细节能节省 maintainer 的时间, 让人更容易判断这个 feature 是否值得投入
- 例如, 用 pseudo-code 来描述用户想要的新接口 / 功能 / 输入输出; 描述接口内部的可能实现; 甚至给出多种备选方案, 等等
愿意贡献: 如果用户表示自己愿意在 maintainer 帮助下完成这个 feature, 那么 maintainer 可能就更愿意投入时间来利用好社区资源

要做到这些, 有时候确实需要用户对项目有一定的深入了解, 能够把握住项目的 direction. 毕竟想要项目的 developer 改变原定的计划, 自己没两把刷子是不行的.

反过来, 一个 "平凡的 / 不好的 feature request" 可能会有如下特征:

平凡的 idea: nice to have, 但是只对少数用户有用
已经能够被用户以扩展的方式自己实现, 不一定需要加到项目中
目标很好, 但不知道怎么实现
- 例如: "请把这个程序变快" 就不是一个好的 feature request. 这甚至不算是一个有效的 feature request, 因为优化是无止境的, 这个 request 什么时候算完成?
- "程序的 X 模块可以通过 ABC 的方式变快" 就是一个好的 request."程序比上个版本慢" 可以是一个 bug report.
描述不清楚: 使用过多有歧义的人类语言, 不知道 feature 到底是什么. 如果可以, 用代码来描述会有更少歧义.
Out of scope: feature 离项目的核心功能太远

当然, 一个平凡的 feature request 照样值得提出, 虽然它可能会进入 backlog 暂时无人问津, 但是也许在沉寂一段时间之后会引发更有价值的讨论和实现.

好的 pull request¶

Pull request 是社区向项目贡献代码, 因此一般更受 maintainer 欢迎, 但也不全是. 围绕 pull request 的主要矛盾是 可维护性 : 当 maintainer 同意接受一个 PR 时, 就意味着 maintainer 同意负责维护这段别人写的代码, 这对代码的可维护性是有要求的.

因此, 用户应该认识到, maintainer 关注的绝不仅仅是一个 PR 是否 "work", 而是会考虑更多的因素:

PR 的设计和实现方案是否是最合适的?
- 最适合维护的方案并不一定是最容易实现的方案: 例如科学的实现一个新 feature 可能依赖某些小的重构. 这时, maintainer 的高标准就会与用户 "快速解决自身需求" 的目标不一致.
PR 是否有足够的 test coverage?
- Maintainer 对 PR 的测试要求甚至可能会比对自己的代码更严格, 因为维护别人的代码, 难度本来就更大.
PR 是否引入了新的 dependency?
- 每个 dependency 都是一个额外的维护负担. 在我曾经做的 House3D 这个小小的项目里, 就发现了至少5个不同dependency中的bug.Maintainer 会避免添加 dependency.
Lint / documentation 等

Jeff Geerling 的 Why I Close PRs 和 The Burden of an Open Source Maintainer 也介绍了什么样的 PR 是 maintainer 更乐于见到的. 文章写的很好, 且另外提到了一条重要的沟通原则:

对于大的改动, 先在 issue 里讨论再 PR.
- 这个 feature 是否值得做? 是否会被接受? 如何设计? 这些问题都没得到认可就发个大 PR, 可能会白白浪费作者的时间.
- 关于这一点, 胡渊鸣的如何优雅地参与开源开发也有更详细的解释.

Maintainer 应在CONTRIBUTING.md 或 .github/pull_request_template.md 里为 contributor 提供引导, 包括介绍提交 PR 的注意事项, PR 被接收的原则, 项目的 coding style, 如何使用 linter, 如何测试, 如何更新 documentation, 等等. 例如 detectron2 的 contributing.md 和 pull_request_template.md.

Extensions¶

开源社区中, 用户会有无数不同的需求. 即使 maintainer 有时间 (大部分 maintainer 没有) 去处理 feature request / pull request, 也会有很多人的需求无法满足.

在这种现实下, 面对没有精力实现 / 维护的 {feature, pull} request, maintainer / contributor 可以采取的一个好的策略是: 通过一些改动让项目变得更 extensible, 使得 feature 可以被用户以扩展 / extensions 的方式独立实现, 而不是在项目中实现.

具体要怎么做到这一点, 是一个系统设计问题, 这篇文章就不跑题多说了. 采用这种方式的好处是:

鼓励 contributor 将一些 pull request 变为它们自己的项目, 由 contributor 自己负责维护
maintainer 可以专心负责核心系统的核心功能, 减轻负担. 这也更符合 "只做好一件事" 的 UNIX 哲学
将大量不会被 maintainer 接受的 feature request 看作 "out of scope": 它们可以看作这个项目的 "applications", 因为用户可以在项目的基础上实现它们
好的 extensible design 是 future proof 的: 它不仅能解决眼前的一个 feature, 还能够支持一大类未来可能会出现的需求
对于一些不成熟的功能, 可以让它们在项目之外经过迭代后, 再吸收回项目中由官方负责维护
建立一个更丰富的生态, 每个人都可以发挥自己的创造力, 不被 core maintainer 的精力限制

很多成功的开源项目都是靠着可扩展性创建了优秀的生态.

Vim, Emacs 等编辑器的核心功能都不多, 要靠海量的扩展实现各类功能.
Vscode 对很多 feature request 会关闭并加上一个 extension candidate 的 label, 意思是这个 feature request 适合作为扩展来独立实现.
PyTorch 也十分注重可扩展性, 除了 module/operator 上的扩展之外, 甚至有允许用户实现自己的Tensor subclass, 自己的 device 等非常夸张的扩展. 最近的torch.fx 也是在给用户实现 graph transformation 扩展的机会.PyTorch 团队会使用 "extension points" 这个词, 指系统中可以由用户实现扩展的部位.

Detectron2 也从最初就尽量走这条路, 把 "尽量让所有模块都可扩展 / 可替换" 作为一个设计目标.Facebook 与之相关的 research project 就都以 detectron2 扩展的形式开源. 除此之外也有不少来自社区的优秀扩展, 例如 AdelaiDet, YOLOv7 等.

维护的负担 ¶

如果 pull request 并不容易被接受, 那么开发者是不是应该干脆自己 fork 项目, 来实现自己想要的改动呢? 要回答这个问题, 要先想清楚将这些改动开源的目的是什么:

如果只是一个 proof-of-concept, 为了公开的展示这个改动的内容, 那么 fork 是没问题甚至更合适的:

例如一些实验性的大型改动可以作为 fork 来向别人展示
例如论文作者发表可以复现论文结果的代码, 可以用 fork 的形式, 甚至可以把所有 dependency 的版本都固定住

开发者也要意识到, 如果认为自己的工作不只是一个 proof-of-concept/toy, 想要让自己的 fork 真的被人严肃的使用的话, 就不得不自己承担维护的责任. 而维护的负担是很重的, 挑几个点来说:

在开源世界中, 即使代码不变, 环境也在变, bug 也会自己找上门来
只要有人用, 就会有人报告问题
经验: 项目中很多看似奇怪的决定, 或实现细节, 很可能是有其历史原因, 或是在踩了坑之后总结出来的. 在自己的 fork 中魔改, 没有与官方维护者合作交流, 很多坑可能要再踩一次. 有个与此相关的 Chesterton's Fence 原则:
Do not remove a fence until you know why it was put up in the first place
-fs.blog/chestertons-fence
测试: 例如 PyTorch 这样的项目, 每天花在跑测试上的钱都是个天文数字. 每个用户也是天然的在帮助项目做测试.fork 没有这样的测试资源和用户量, 如何保证自己的版本仍然正确?
我就遇到不止一次这样的情形: 我的开源项目中的一些代码, 早期有问题的版本被别人复制转手了多次, 转了一圈居然还回到了我参与的其它项目里, 让我被自己多年前已经修好的 bug 给坑了. 这就是由于一些没有承担维护责任的 fork

因此, 虽然一个成功的 pull request 要付出额外的交流, 但它换来的是项目维护者的维护工作. 如果开发者想加入新 feature, 又没有自信能胜任整个项目的维护, 与其另起炉灶, 不如多参与交流, 与维护者讨论一个更可维护的方案 (pull request 或 extension).

谈谈Github上如何交流(3): 如何管理issue

2022-05-11T06:59:00.000Z

本系列文章

我听过不少人凭借爱好开源了自己的项目后, 却对 issue 太乱感到困扰, 甚至想干脆直接禁用 issue. 其实, 任何项目达到一定规模后, 如果不对 issue 进行适当管理, 都会使 issue 信噪比过低, 失去原本的功能.

这篇文章主要从 maintainer 的角度说说, 在具备规模的项目中管理 issue 的一些方法和原则.

Issue Template¶

任何具备一定规模的项目都应该使用 issue template.Issue template 位于项目的.github/ISSUE_TEMPLATE 目录, 包含两种文件:

每个 template 有一个 markdown 文件, 对应一类 issue. 其中描述需要用户提供的信息.
还可以为这个 issue template 自动配置 issue label. 然而由于 template 是用户选择的, 这种方式得到的 issue label 噪音较大, 可能还需要 maintainer 纠正. (我的策略是仅对 "feature request" 和 "documentation issue" 自动 label)
可选的config.yml 全局配置文件. 有用的配置包括:
- blank_issues_enabled: 是否允许用户不使用 template 自己写 issue.
- contact_links: maintainer 用它将用户引导到其他地方 (论坛, discussions 等).

Github 近期在测试 issue form, 是 issue template 的升级版, 有了更好看的 UI 和丰富的输入类型. 可惜我一直还没有测试机会.

Issues vs. Questions¶

常见的 issue 有如下两大类:

Unexpected behavior / bug: 报告 bug 以及有可能是 bug 的 unexpected behavior.
Feature request / enhancement: 对项目的代码, 文档, 注释等的各类 improvement/enhancement.

除了这些之外, 用户常常还想问各种其他问题, 譬如 "怎么用 XXX", "我这样做对不对", "项目里这段 code 是干嘛用的" 等等. 暂且将它们称为 "question". 我认为, 大的 (issue 很多的) 开源项目中 issue 里不应包含这些 "question", issue 应当 不超过上面的两类.

为什么? 当 issue 很多的时候, "question" 与两大类 "issue" 有些本质的不同, 会导致 issue 难以管理:

无法定义 close: issue 的设计是有 open/close 的状态的, 但 question 往往是 open-ended, 只能以提问者是否满意来定义 close. 然而提问者是否满意是很主观, 不可控的: 提问者可能水平不高不理解答案, 可能会有 follow up 的跑题的问题, 甚至可能问题都不成立让人无从回答. 处理这些问题, 会带来大量无法标记 open/close 状态的 issue, 影响项目的管理. 相反, 上文提到的两大类 issue 是可以较为客观的定义 close 的: 一旦一个 bug / feature request 被解决, maintainer 一般都可以自主决定是否将 issue 关闭.
Not Actionable: Maintainer 希望实现 "每个 open issue 都是一个 todo item", 以方便对项目的管理. 然而, question 以提问者为中心. 从开发者的角度来看, 并没有明确的 action items.
太宽泛: 两大类 issue 都可以设定很清晰的格式 / 模板, 因此 maintainer 可以通过 issue 的类别和模板来声明自己的义务范围 (后文会详细介绍). 然而, question 的范围太广了, 用户的问法常常会让人不明白到底想问什么, 或者不明白该怎么处理.
例如, 有一类我很头疼的问题是 "这个项目能不能做 xxx?". 这种过于自由的问法让 maintainer 不明白这个问题到底是:
- feature request: 建议未来在项目里实现 xxx, 还是
- 寻求写代码指导: 希望使用这个项目自己实现 xxx 但不知道怎么做 --- 这往往不是 maintainer 的义务范围.

总而言之, question 大多以用户为中心, 处理它们的沟通成本更高, 而对项目的 contribution 却更低. 混杂在以项目为中心的另两类更重要的 issue 中会分散 maintainer 的精力. 因此很多大的项目都希望将 question 剥离出 issue.

然而, 用户确实有问问题或进行其他交流的需求, 这样的需求可以用 github discussions / 论坛来满足.

Github Discussions / 论坛 ¶

Github 近两年推出了 "discussions" 版块.Discussions 在功能 / UI 上与 issues 有所区别, 各方面都更像传统的论坛: 例如没有 open/close/assign 的状态, 可以 "顶帖", 可以 "mark as answer", 等等. 简单来说, github discussions 就是提供一个 简化版的论坛.

在内容上, github 并没有给 discussions 和 issues 定义明确的边界, 这个边界由每个项目自己定义:Maintainer 应通过 issue category 和 issue template 来 声明自己愿意支持解决的 issue 有哪些 (例如 bug report, feature request), 并告知用户 "其他" 讨论 / Question 可以发到 discussions 中. 如果发错了地方, maintainer 可以通过 github 提供的按钮一键在 issue/discussion 之间转换.

我们以 PyTorch 为例. 在 PyTorch 的 issue 列表点击 "new issue" 后, 进入 PyTorch 的 issue 类别页面.

可以看到:

PyTorch issue 就只包含上文提到的两大类: bug 与 feature (只是细分成了更多类).
实践上把 documentation 细分出一类是很有用的. 因为 documentation 的勘误到底是属于 "bug" 还是 "enhancement" 可能会有歧义.Documentation 被细分后, maintainer 就可以将 "bug" 定义为狭义的代码 bug, 将 "enhancement" 定义为 "feature request", 使得类别的定义更清晰.
所有 "其他讨论" 都通过最后一行的按钮被引导到 PyTorch 的官方 Discourse 论坛上. 曾经, PyTorch 甚至专门有一个 "question" issue template 的内容就是 "不要发 question, 请用论坛". 由于避免了 question, PyTorch issue 始终维持了高质量的技术讨论, 也达到了管理开发任务的 "tracker" 功能.
Github discussions 的定位就是一个项目自带的简易论坛, 毕竟不是所有项目都有资源自己搭建一个论坛.

再以 TensorFlow 做个反面教材: 我由于曾经是深度 TF1 用户, 在早期还是很喜欢看它的 github. 然而 TensorFlow 长期没有对 issue 进行分流. 可以观察到大约在 18 年前后, 估计由于 issue 的噪声太大, 性价比太低, TensorFlow issues 里已经很少再有 core developer 回复, 导致真正有价值的 issue 也更难以得到重视了. 我就多次需要靠手动 at 对应领域我认识的 developer 才能有人回应我报的 bug. 直到 2021 年, TensorFlow 才终于开始在 issue template 里把用户引导至自建 Discourse 论坛.

最后还是要提醒: discussions / 论坛仅适用于规模较大, 问题较多的项目. 对小项目, 额外一个讨论平台引入的 overhead 可能得不偿失.

Maintainer 的义务范围 ¶

在第一篇文章中说到,maintainer 自己决定自己有哪些义务, 决定自己的 commitment, 也即自己愿意对用户提供哪些 "support". 很多 maintainer 与用户沟通上的问题, 源于没有划清自己的义务范围. 一旦这条线划清了, maintainer 就无需为乱七八糟的 issue 头疼: 项目不 support 的问题不必操心, 关闭或者移至 discussions 都可以.

Maintainer 应该通过 issue template 的选项表明哪些类 issue 是允许的. 可以通过blank_issues_enabled: false 来禁用 "无 template" 的 issue. 可以通过contact_links 引导 "其他问题" 到别的地方. 如果用户依然发了不支持的 issue, 可以以 "不支持" 为由关闭 / 移至 discussions.
Issue template 的内容里可以更清楚的声明哪些常见情形是不支持的, 例如:
- Bug report template 可以声明 "bug report 必须要包含复现步骤".
- Detectron2 的 issue template 声明了 "你自己的模型 train 不好我们不管". 但凡有人报告自己的模型性能不好 / 不收敛, 我就直接引用这句话然后关闭 (为了引用方便, 我依然使用了 saved replies).
用户应该认识到 "支持 / support" 到底是什么意思:
- "We don't support X" 更多是关于服务范围的声明, 而不是关于项目功能的声明: "We don't support X" 不代表 "X doesn't work". "We don't support X" 的意思是 "We won't help you about X", 也即 "我们不管 X 能不能用".
  - 例如 detectron2 一直以来都在 windows 上可用甚至还有 CI 测试, 但是从不 "support windows". 对于与 windows 有关的 issue 我们也就不提供帮助.
- 开源社区中的 "support" 一词大多数时候都是这个含义.
  - 例如 Ubuntu 的 LTS (Long-term support) 中 "support" 的意思, 官方是这么解释的:"commitment to update, patch and maintain the software".
对于 maintainer 职责之外的 issue, 即使 maintainer 个人愿意帮助, 也可以立刻关闭 / 移至 discussion, 再进行评论. 这样的情况下, 我一般会关闭 issue 并说:
Because of ABC, this issue is unsupported/unrelated, therefore closing the issue.
I think doing XYZ might solve/help the issue.
在这里, "close issue" 表明了 issue 不被支持, 这样提前避免用户由于 "得到了评论" 而对于 support 有不切实际的预期. 也避免了 (其他) maintainer 在下次处理 issue 列表时再看一次.
同时, 也在不需要花自己太多时间的前提下给了简单的建议, 但至于是否能解决问题我就不再管了.

处理 Bugs/Unexpected Issues¶

这一节说说对于 bugs/unexpected issues 的常见处理流程和注意事项.

使用 Issue Template: 上篇文章中说了用户报告 unexpected issues 时需要提供的几类信息: expectation, unexpected observation, environment, reproducible example.Maintainer 应该使用 issue template 来告知 / 引导用户提供这些信息.

Detectron2 的 "unexpected problems" issue template 可以作为参考. Facebook AI Research 的其他一些 project 也参考了这个 template (如 pytorch3d,vissl).

检查必要的信息: 还是有不少用户不尊重 issue template, 不提供需要的信息. 以下几个方案可能有帮助:

使用 github 的 saved replies 功能, 一键发送常用回复. 我的 saved replies 里就包含这样一句话:
If you need help to solve an unexpected issue you observed, please include details following theXXX issue template (link).
手动评论还是麻烦的, 所以我实现了一个 github bot 来检测一些特别明显的信息缺失, 并自动评论.
我和 bot 都会为缺少信息的 issue 打上 "needs-more-info" 标签. 这个标签可以告诉其他 maintainer 不必再查看这个 issue.Maintainer 也可以以这个标签为依据在一段时间后关闭 issue.
未来的 issue form, 有希望通过更严格的格式来缓解这个问题.

分析, 解决 issue: 任何一个有足够信息的 unexpected issue, 应该 有且仅有 如下几种结果:

Issue 不存在或无法确认 (例如: Expectation 不正确, 程序 working as expected, 用户自己错了, 无法 reproduce 等等):
- Maintainer 应解释原因并关闭 issue
- Maintainer 应考虑是否有提升用户体验的机会, 来避免类似的问题被重复. 包括:
  - 优化文档: 让用户更容易发现正确的信息, 有正确的 expectation
  - 优化程序的 logging: 让用户理解程序的行为
  - 加入一些 early check 来更早的发现 error
  - 提供更清晰的 error message
  - 让 error message 更 actionable -- 不仅说哪里错了, 还告诉用户该怎么办
  - 例如, 1, 2,3,4 就是我看到用户在 issue 中的困惑后, 对 logging/error 的一些微小优化. 任何系统的文档 / logging 都永远有提升空间, 用户体验必须经过这样的迭代才能得到提升.
  - 一个很有用的技巧是, 在回复这类 issue 时, maintainer 应尽量尝试仅通过 直接引用 log 或文档 来回答. 如果做不到, 那常常能发现 log 或文档的不足之处.
Issue 重复
- 应关闭并链接到另一个 issue, 把同一个问题的对话集中到一处
- 没有什么检测 duplicate 的好办法, 希望未来 NLP 技术能有所帮助
Issue 由于项目以外的原因产生 (环境, 依赖)
- Maintainer 依据其严重性决定是否关闭 issue
- 虽然 issue 不来自项目自身, maintainer 应考虑是否值得加入 workaround / warning 来增加项目的可用性
- 对于依赖的问题, maintainer 应指向上游依赖的对应 issue. 如果上游没有这个 issue, maintainer 应向上游报告
Issue 来自于项目自身的 bug
- Issue 应永远保持 open, 直到被修复
Issue 可以确认存在, 但无法判断原因
- Issue 应永远保持 open, 直到发现原因后变为以上几种情况之一

可以看到, 以上几种结果基本都是对项目有 contribution 的. 甚至即使 issue 最终不存在, maintainer 也可能从 unexpected issues 中看到提升用户体验的机会. 因此 unexpected issues / bugs 对项目有很大价值.

各类 bot¶

介绍一些管理 issue 的 bot:

上面提到过的检查 issue 是否包含必要信息的 bot. 然而为了用户体验, 这个 bot 是 precision-driven 的, 只检测最明显的情况, recall 并不高.
自动关闭 "needs-more-info" 的 issue: 如果 issue 有了 "needs-more-info" 的标签, 等待用户提供必要的信息, 却长时间没有 update, 就会被 bot 自动关闭. 当有了 update 时, 标签会被这个 workflow 自动移除.
自动锁定古老 issue: 如果项目一直在活跃开发, 那么一个古老的, 已解决的 bug 很可能没有任何值得 follow up 的信息: 即使类似的 bug 又出现了, 大概率也和旧的 bug 没什么关系. 那么可以对此类 issue 设定为静默一年后自动锁定 (禁止评论).
自动 label:Github 支持按照 issue template 来自动 label, 但是那样的粒度太粗. 如果对于特定类的 issue 能够根据内容来精准匹配的话, 也可以用这个 bot 添加 label. 但是需要注意自然语言处理是很困难的, 给这个 bot 写规则并不容易.
自动订阅 label: 巨型项目中, 开发者想要自动 subscribe 特定模块相关的 issue. 这个 bot 按照 issue 的 label 自动添加 "@username" 来 subscribe 感兴趣的开发者.
Stale bot: 自动关闭一段时间没有 activity 的 issue. 这个 bot 很常见, 但 不应该被使用, 因为没有 activity 不代表 issue 解决了. 参考:
- Github stale bot considered harmful
- Soumith 对 stale bot 的批评
- 高效参与开源的诀窍也批评了 stale bot.
注意这里假设了 issue 和 question 是被区分开的. 如果 question 也被包括在 issue 里, 自动关闭 question 是可以接受的.

谈谈Github上如何交流(2): 如何科学的报bug

2022-05-07T12:00:00.000Z

本系列文章

报告错误 / 报 bug 是用户与开发者间最常见的一类交流, 也是常见的 github issue. 但是很多用户并不会科学的报 bug, maintainer 对此也缺乏引导. 因此这篇文章讨论如何科学的报 bug.

如何报 bug, 不仅适用于开源社区, 也适用于任何软件开发. 上一篇提到, 开源社区的交流难度比一般的团队合作更大. 如果掌握了在开源社区中报 bug / 修 bug 的交流方式, 在公司里处理类似的事情也会更轻松.

Unexpected Issues¶

首先, "报 bug" 是一个较为狭义的说法.

在有的项目里, 用户容易确定一个问题是不是 "bug". 但在有些项目里, 用户未必有能力判断问题到底是不是由于项目的 bug 产生的. 程序的错误可能来自于用户自己, 用户的环境, 或其他依赖.

这时候, 报告 "unexpected issues" 是个更合适的说法: 用户报告的是未预期的行为 (unexpected observations/behaviors, 不一定是 error), 然后由更了解情况的人判断它们是不是 bug.

What is Expected/Unexpected¶

要报告 unexpected issue, 用户应首先一定 确保对方明白自己的 expectation.

Expectation 有时候是很显然的, 比如 expect 程序正常运行但是它崩溃了. 然而, 很多时候, expectation 也许对问题的报告者显然, 对别人却未必.
- 例如: 一个常见情况是用户写了一大段文字描述自己做了什么, 程序做了什么输出了什么, 看完根本不明白到底哪里是 unexpected. 通过反复询问才了解到, 用户的 expectation 是 "程序不输出 XXX". 这样的 expectation, 未必那么显然.
  人类语言往往是模糊的. 要确保对方明白你的 expectation, 以 "我 expect ..." 为开头造句最清楚. 上面的例子里, 如果用户能在流水帐的信息之外, 清楚的说出 "我 expect ...", 则避免了低效的交流.
因为用户的误解, expectation 本身可能是 错误的, 没有根据的, 或不被支持的. 例如:
- 用户: "我 expect 这个 API 输出这样的格式". 维护者: "请看文档, 它输出的是另外的格式".
- 用户: "我 expect 方法 A 比 B 快". 维护者: "这个 expectation 没有根据, A 和 B 时快时慢, 不好说".
- 用户: "我 expect 训练我这个模型不炸". 维护者: "想法很好, 下次不要问了. 我们不负责这个".
由误解产生的 expectation 可能就更不显然了. 只有清楚的说出来才能尽早澄清这类误解.

要说清楚 expectation, 一般要包含两个部分:

做了什么: 运行了什么命令, 写了什么代码, 点了什么按钮, 等等.
期待看到什么现象: 期待程序不崩溃, 期待程序输出特定内容, 等等.

Describe Observations, Not Presumed Behaviors¶

用户应描述自己看到了什么 现象 (observations) , 而不 (仅) 是自己以为程序做了什么 (presumed behaviors). 因为用户未必理解程序到底做了什么, 也未必有能力描述好程序的行为.

作为一个用户, 你 expect 程序做 X, 但是程序好像没做 X / 做了 Y, 因此你想报告 unexpected issue. 这时候, 不要下结论说程序做了 / 没做什么, 因为:

这个判断可能是错误的. 程序可能已经做了 X, 或者程序做了 Z (而不是 Y). 声称程序做了什么可能会误导别人.
你的描述可能是模糊, 不好理解的. 想象一个不懂电脑的人问你 "电脑打不开了怎么办", “不能上网怎么办 "--- 你的第一反应肯定是" 什么叫打不开 / 不能上网？". 当你描述一个自己不太了解的程序的行为的时候, 在别人眼里可能也是类似的.

如果你觉得程序做了错误的事情, 当然可以提供自己的判断和分析, 但最需要提供的是能够支持你的判断的 observations, 例如原始的 logs (如果 observation 与图片有关, 截图).

相比描述 "behavior" 来说, 提供 observation 有这些好处:

更简单: 你只要复制粘贴. 不需要了解这个程序
无歧义: 复制粘贴可以更完整的还原你的 observation, 避免了人类语言的歧义性.
提供 完整的 observations 的话, 其他人就可以跳过用户的判断, 独立判断 到底发生了什么. 这对分析 unexpected issue 是至关重要的. 用户自己的判断可能是错的, 举几个例子:
- 用户判断程序跑的慢, 这时候用户应该提供自己跑 benchmark 的代码 / 工具, 和它们的输出. 真实情况也许是, benchmark 的方式不对, 或测量的单位变了.
  - 在 deep learning 里太常见了: 正确的 benchmark 并不容易做; 测量单位在有的系统里会随着 batch size 变化.
- 因为 log 里有 error X, 用户判断程序由于 X 崩溃了. 但是可能 log 里另外的 Y 才是崩溃的 root cause. 用户应该提供完整的 log, 让别人独立做出判断.
- 用户打开feature_A=True 之后触发了 failure X, 因此判断feature_A 导致了 X. 但事实可能是, feature_A=False 也会触发 failure X, 只是由于其他原因 X 没有暴露出来.
与此相对的, maintainer 不要过度相信用户声称的 behavior. 应该从用户提供的信息中判断用户声称的 unexpected behavior 是否真的发生了.

我一般都会在 issue template 里要求用户提供 完整的 log . 这是性价比最高的信息: 不仅能够用来判断程序的行为, 还能够帮助 debug, 用户也很容易提供. 但还是总有人在报告 error 的时候只给一行 error message, 连 stack trace 都没有, 让人很头疼. 希望未来的 github issue form 能够通过强制必填的表单来更好的教育用户.

重要的事情再说一遍: maintainer 需要 全部的, 完整的 log, 而不仅仅是 error 发生前的 log. 在用户看来没有用的信息对 maintainer 可能是有用的, 不要省略它们.

另外, 既然在报告 unexpected issue, 用户提供的 observation 当然应该清楚的包含 "unexpected" 的部分. 用户需要让 maintainer 能够从 observations 中看到这个 unexpected issue 确实发生了.

Minimal Reproducible Example (MRE)¶

Stackoverflow 的 "How to ask a good question" 里有提到 "Minimal Reproducible Example (MRE)" 的概念, 建议阅读.

在开源社区的场景下, 报告一个 unexpected issue 的时候, 用户也应该尽量以代码, 命令, 数据的形式提供 minimal reproducible example. 其意义在于:

帮助判断 issue 是不是 "项目的问题", 因此使这个 issue 对项目有 "contribution".
- Reproducible, 或 verifiable, 意思是别人能够复现这个问题.
- Minimal 的意思是, 用来复现这个问题的代码 / 数据特别少. 因此很容易判断是用户自己用错了, 还是项目错了.
帮助 maintainer debug, 研究 issue 的解决方案.

反过来:

如果一个 issue 不 reproducible 的话, maintainer 很难相信这个问题存在, 或即使存在也很难去 debug.
如果用户提供的 reproducible example 过于复杂的话, maintainer 不愿意也没有义务花时间理解用户的代码, 更不愿意帮着找用户自己的 bug.

为了提供一个高质量的 MRE:

用户应该问自己: 别人按照我提供的步骤能够独立的 reproduce 这个 issue 吗? 有没有漏什么关键的步骤, 数据? 能不能把我的私有数据换成公开数据或者 fake/mock 数据?
- Maintainer 也应该为项目的不同模块提供样例输入数据
用户如果愿意配合, "reproducible" 大部分时候可以满足.
- 除了那种本身需要大量时间 / 计算资源才能复现, 或随机出现的 issue -- 那样的难题没什么好的办法, 要依赖用户自己做大部分的 debug 工作. Fight Against Silent Bugs in Deep Learning Libraries就记录了我怎么 debug TensorFlow NCCL 里的一个随机出现的计算错误.
而 "minimal" 则会需要用户投入一定的时间, 因为用户发现问题时, 也许自己的程序代码太复杂, 并不 minimal. 为了达到 "minimal",Stackoverflow 提供了两个有用的建议, 一般需要交替使用:
1. Start from scratch: 如果对问题的触发条件有了猜想, 可以从头写一个简单版本看看是否能触发 issue
2. Divide and conquer: 如果对触发条件没什么头绪, 可以开始删代码 / bisection / 简化无关的部分, 直到 issue 消失
用户发表前, 问问自己: 这个 example 里还有哪里可以删掉?
- 剩下的代码越少, 对 maintainer 的帮助就越大
- 删掉看似无关的代码之后, 务必再确认一下 issue 仍然可以 reproduce
- 不要省略有用的代码, 例如 python 里的 import: maintainer 为了复现还得手动把它们加回来呀. 而且 import 也有 side effect, 可能导致 bug, 例如这个
项目本身的良好设计也能帮用户提供 MRE.
- 如果 library 有非常清楚的接口, 没有什么内部状态, 那么用户只要把提供给 library 的输入输出记下来, 就能够复现问题
- 反过来, 如果项目是一个 "framework", 提供了很多复杂的 semantics, 就很难简化 issue

Environment Information¶

用户应提供 maintainer 要求的环境信息 (项目的 version, 依赖的 version, 系统软硬件等等). 它的重要性在于:

决定了 expectation: 程序在不同环境下的 expected behavior 可能是不同的
有助于 reproducibility: issue 可能只在特定环境下能够 reproduce

Maintainer 最清楚哪些环境信息是需要的, 因此 maintainer 应当以 issue template 等形式告知用户如何提供环境信息. 例如, 在 detectron2 中我提供了一个collect_env.py 脚本, 运行后会输出如下的结果, 比用户自己能想到的信息要详细得多.

python collect_env.py

----------------------  -----------------------------------------------------------
sys.platform            linux
Python                  3.10.1 (main, Dec 18 2021, 23:53:45) [GCC 11.1.0]
numpy                   1.21.5
detectron2              0.6 @/home/xxx/xxx/detectron2/detectron2
Compiler                GCC 11.1
CUDA compiler           CUDA 11.5
detectron2 arch flags   6.1
DETECTRON2_ENV_MODULE   
PyTorch                 1.10.1 @/usr/lib/python3.10/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0                   NVIDIA GeForce GTX 1070 (arch=6.1)
Driver version          495.46
CUDA_HOME               /opt/cuda
Pillow                  8.4.0
torchvision             0.11.0a0+7947fc8 @/home/xxx/xxx/torchvision/torchvision
torchvision arch flags  6.1
fvcore                  0.1.5.post20211023
iopath                  0.1.9
cv2                     4.5.5
----------------------  -----------------------------------------------------------
PyTorch built with:
  - GCC 11.1
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.4 Product Build 20200917 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.5
  - NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_62,code=sm_62;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
  - CuDNN 8.3
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.0, CXX_COMPILER=/usr/bin/c++, CXX_FLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection -Wp,-D_GLIBCXX_ASSERTIONS -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=1, USE_CUDNN=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=ON, USE_GLOG=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

Maintainer 实现这样的脚本时, 需要注意:

最好允许它可以独立执行, 不依赖项目是否成功安装
Python 的包管理十分混乱, 应了解我的这篇文章里的注意事项. 例如, PyTorch 的collect_env.py 里使用{conda,pip} list 就是不科学的做法.
多多捕获异常: 用户的环境里可能有各种错误, 不要假设所有信息都能被 collect 到.

有时候, 用户仅仅提供自己的环境信息还不足以复现问题, 因为难以确定是环境中的哪个因素导致了 issue. 为了保证 issue 的 reproducibility, 可以考虑使用 docker 或 Colab notebook 提供更完整的环境. 这种情况并不少见: 我在 PyTorch 里有 4 个 bug report 是自带 docker 来 reproduce 的.Maintainer 也应提供官方的 docker/Colab, 方便用户在报 issue 时排除环境问题: 用户可以把自己的 MRE 在官方的环境中测试.

Summary¶

这篇文章更多从用户的角度说了如何报告 unexpected issues. 用户最好应提供:

Expectation
Unexpected observations / Full observations
Environment
Minimal reproducible example

在 maintainer 给予了足够的引导的情况下, 1-3 的代价都很小, 用户应尽可能提供.4 有时会有一定难度, 文中已介绍.

在第一篇文章中说到, maintainer 自己决定自己的义务 / commitment 有哪些, 那么也就可以要求 unexpected issue 必须包含特定信息, 并决定对于缺少信息的 issue 不予处理. 一个很有趣的极端例子是, you-get 项目直接禁用了 issue 功能, 要求所有的 bug report 必须以 "失败的单元测试" 的 PR 形式报告, 直接满足了以上四点. 对于这种接口简单的工具来说, 不失为一个好办法.

大多数具备规模的项目会通过 issue 类别和 issue template 表明什么样的 issue 是 maintainer 愿意支持的. 为了高效管理, 往往都会对用户提供的信息有硬性要求. 如果项目有 issue template, 而你又没有自信到觉得自己提供的信息比 template 更好, 那么请务必 follow issue template -- 要获得 maintainer 的帮助, 应该首先尊重 maintainer 的要求, 提供必要的信息. 下一篇文章会更详细的说 maintainer 的管理方式.

谈谈Github上如何交流(1)

2022-05-02T07:00:00.000Z

本系列文章

相比传统的邮件列表 / bugzilla/sourceforge 等开源平台, github 把开源社区交流的成本 / 门槛降的很低, 因此交流的质量也常常随之下降.

我计划写几篇文章, 从 用户 (User) 和 维护者 (Maintainer) 两者的角度写写开源社区中如何使用 issue/PR 进行沟通, 希望能够:

让普通用户学会提问, 学会以 maintainer 更容易接受的方式使用 issue/PR 做贡献.
给 maintainer 提供一些管理 issue/PR 的经验, 不要再为 issue 头疼.

作为主要开发者和维护者, 我曾经管理过 detectron2 和 tensorpack 等项目.2016-2021 年里我一个人处理过这两个项目里约 5000 个 issue/PR，作为用户, 我也参与了 PyTorch / TensorFlow 等不少项目的社区讨论. 在这个过程中, 看到了开源项目中各种不同的沟通, 管理方式. 现在我已经基本离开了这些项目, 于是想把这些经验总结一下.

这篇文章作为第一篇, 只讨论一些基本的原则.

理解用户和维护者的 Common Interest¶

在一个项目中, maintainer 和用户的目的常常并不是完全一致的. 有效交流的基础, 是要理解对方与自己 Priority 上的相同和不同.

大多数开源项目的资源都很有限, 用爱发电. 因此, maintainer 自己决定自己的义务 有哪些.

通常, maintainer 不以满足某个用户为目标, 不当 "客服". 这是因为, 相比于其他可以做的事情而言, 给网上的路人提供个人化的 support 对一个项目能够带来的贡献是非常非常小的. 相反, maintainer 通过做其他事情 (例如修 bug) 让项目发展得更好, 来间接的帮助所有用户. 通常来说, maintainer 的 priority 是围绕项目为中心, 而不是特定用户.

但是, 用户的诉求很多时候就是要解决自己的问题. 这时候用户一定要认识到: maintainer 对解决你的问题并不一定有兴趣.maintainer 愿意与用户交流, 本质是因为用户的 feedbacks 可能让项目变得更好.

让项目变得更好 是用户与 maintainer 的 common interest, 基于这一点的交流才是最有效的, 二者才能有效合作.

有效的交流基于 "Contribution"¶

让项目变得更好, 换句话说就是 "make contribution to the project".

"Contribution" 这个词在 github 上主要出现于 "contribution calendar", 这是一个记录用户每天的 "contribution activity" 的日历:

在 contribution calendar 上, 不仅与代码相关的行为 (commits, PR, reviews) 算作 "contributions", 有些奇怪的是, 创建 issue 也算作 "contributions". 这可能正是因为, 在 github 设计者的眼里, issue 理应是为了让项目变得更好, 而不仅是解决自己的需求.

用户如果能够理解这一点, 将自己的个人需求转化为对项目的 contribution, 才能把交流变得更有效. 概括来说的话:

当用户报告一个问题 (problem, 不是 question) 的时候, 要能够将它描述成 这个项目的问题, 而不仅是 "某个用户的问题".
当用户 request feature 的时候, 要 justify 这个 feature 对项目的价值, 而不仅是对自己的价值.
当用户提交 patch 的时候, 要 justify 这个 patch 对项目的长期价值 (设计合理, 代码美观, 可维护性, 单元测试等等), 而不仅仅是它是否解决了自己的短期需求.

以上三点在实践中意味着什么, 应该怎么做, 会在后面几篇中再说明.

开源社区 vs. 公司内的交流 ¶

开源社区的交流和公司同事间的开发交流在很多方面是相似的, 开源社区中 Maintainer/User 的关系, 也与公司内部 Code Owner/User 的关系类似. 但是, 开源社区里的沟通难度一般会更大:

交流效率: 开源社区的交流以原始的 email/forum 形式为主, 而不是即时消息和视频会议. 由于延迟更高, 就要在每条消息中传达尽量多的有用信息.
缺少信任: 用户不一定了解维护者, 维护者通常不了解用户, 互相都没有什么 obligations. 缺少信任就不愿意投入时间合作.
没有共同的 context: 公司同事间, 大家用一样的 terminology, 一样的开发环境等等, 免去了很多交流障碍.
Common Interest: 公司内部有更多的 Common Interest, "用户的问题" 和 "项目的问题" 最终都是 "公司的问题". 因此, 开发一个项目的 Code Owner 常常是有明确的义务去解决公司内用户的问题的. 然而, 如上所说, 在开源社区中不是这样.

因此, 开源社区中的交流方式, 对公司内的交流有参考意义, 但不一定完全适用. 例如下一篇"如何报bug"就更通用.

人类语言的模糊性 ¶

人类语言是模糊, 容易歧义的, 上面提到的开源社区中交流的障碍, 会把人类语言的歧义放大.

为了能够在消息中传达更多的有用信息, 在交流中要意识到人类语言的局限性. 交流中的每一方如果可以花少量额外时间, 使用代码, 复制粘贴等方式, 将信息尽量组织的更客观, 消除歧义, 就会使交流更有效. 毕竟交流的延迟很大 (至少以小时为单位), 如果更精确的表述能够为双方节省一次 round trip, 就已经赚了.

例如, 在开源社区的交流中:

不要说 "我在 A 函数里做了 B", 复制粘贴代码更准确
- 到底怎么样 "在 A 函数里加入 B", 每个人的理解可能不一样
不要说 "我的 libA 的版本是 X", 要提供检查版本的代码 / 命令及其输出
- 怎么确认版本, 可能有多种方法
不要用 "我在 xx 模式运行程序" 来代替 ./main --mode=xx. 后者更准确
不要用语言描述来代替可以复制粘贴的 log. 后者更准确
报 bug 时, 如果尝试了方案 X 没用, 不要说 "It still doesn't work after X". 因为这是模糊的人类语言, 没有区分下面两种情况:
- 如果程序的输出和之前相同, 那就说 "输出和之前完全相同"
- 如果程序的行为有任何变化, 那么应复制粘贴新的 log
代词常常自带歧义, 应注意或减少代词的使用.
- 这篇文章中有个例子:"I started FooApp. It put up a warning window. I tried to close it and it crashed." 这句话里每个 "it" 到底指什么?

类似的例子还有很多. 尽量使用更准确的语言来交流技术问题是个重要的好习惯.

Effective Use of Python 'logging' Module

2022-04-23T07:00:00.000Z

In large systems, logs can be terrifying: they are huge in volume, and hard to understand.This note lists some suggestions and common misuse of Python's logging module,with the aim of:

Reduce redundant logs & spams from libraries.
Allow more control of logging behaviors.
Make logs more informative to users.

Libraries must not configure the root logger¶

Loggers are globally identified by a dot-separated name given to logging.getLogger(name), such as library.module.submodule.The logger named by an empty string is the "root logger".

Libraries must not call logging.basicConfigor configure the root logger in any way, unless requested by users.

Configuration of the root logger affects all logs, which is beyond the responsibility of anysingle library. Only application developers, i.e. those who create the program that interacts withusers, should determine how to configure the root logger.

Libraries must not write to the root logger¶

Never call functions like logging.{info,error} from within a library, because they write to theroot logger. I've added this advice into CPython's official documentation.

When a library writes to the root logger, applications that use the library lose control over thelibrary's logging behavior: they cannot turn on/off the logs from the library, apply custom filter/formatter, or redirect thelogs from the library.

Instead, A library should write to a logger with easily and uniquely identifiable name, using

logger = logging.getLogger(good_name)
logger.{info,warning,error,...}

This way, caller of the library will be able to reconfigure its logger using the same name.

Choice of logger name¶

__name__, i.e. the current module name, is often a good logger name.Occasionally, __name__ is not good enough:

In some deep, internal submodules, a suffix of __name__ may beuninformative and can be removed, e.g. my_lib.submodule._internal._impl.Note that there is a trade-off between the name simplicity and the granularity of control.
In a monorepo, many modules may share a common prefix in __name__, e.g. company.company3.organization.Removing such a common prefix can simplify the logger names while still keeping them unique.

The "current function/class name" is often a bad logger name because:

The name is likely not unique.
The name could be the library's internal detail and therefore not identifiable to users.
The name may change more frequently (compared to module names), breaking users' logger configuration.

Fix such issues automatically¶

I wrote a simple scriptthat processes Python source files to automatically replace all logging.xxx by logging.getLogger(__name__).xxx.This script has created PRs in a few projects that misuse the root logger, such aspytorch/72649,tensorboardX/662.

I hope someone could create a linter that performs this check.

Libraries should not add a handler that's visible to users¶

Handlercan be attached to loggers to decide where/how to log a record.

Unless requested by users,a library should not add a handler anywhere (even not to its own logger),if the handler has a visible effect on users.This is because the application developer should make the final call onhow each library's logs are processed.Pre-existing handlers may cause issues such as duplicated logs.This suggestion is present in CPython's documentation here.

Examples of invisible handlers that libraries may add to their loggers are:

logging.NullHandlercan be added to disable logs by default.
Within a company, one library may add handlers that send logs to a centralized service monitored by the library's developers.

Be good citizens¶

Libraries should try to be good citizens in reducing the amount of duplicate/unwanted/useless logs theyprinted. Some tips include:

Use levels correctly.
- In particular, don't use INFO for debugging.
Think about the frequency and ask the following questions when writing logs:
- How often will this line be printed?
- Can I move logging to an earlier/later place so it's printed less often? e.g. from a method to__init__.
- Is there a condition we can check and skip printing this message? e.g. if valid(): log(...).
Use helpers to control frequency of logs:
- log_first_n:log only for the first times. It limits the total number of certain logs.
  - Example: use it to print deprecation warnings.
- log_every_n_seconds:limit the frequency of certain logs to be less than once every seconds.
  - Example: use it to print progress in a loop whose iteration speed is unknown.
- log_every_n:log once every calls. It reduces the total number of certain logs by a factor of .
- I ported these helpers from abseil-py's logging module to work with the logging module directly.

Use a structured logging service¶

Logs are not only strings. logging.LogRecordis a rich structure with useful attributes, and users can even tag logs with custom attributes through the extra= argument.

Large, distributed systems should not rely on printing as the sole method of logging.Whenever logs are printed (to terminal or files), they have to be converted to strings.A lot of useful attributes, such as stack trace and line number, are often lost. The lack of structure also makes it difficult toparse and analyze the logs.

In addition to printing, we can also use an additional Handler to send structured logs to a logging service,such as Google Cloud Logging or Humio.The advantages include:

Can log a much larger volume than what a terminal can hold.
Can keep the rich structure of logs, enabling many features like querying by filters,rendering as different views, etc.Here is a list of logging featuresGoogle Cloud provides.
Can persist and aggregate logs across different instances/services/users, allowing developers to find patterns from big data.

Reduce duplicated logs among workers¶

In an MPI-like distributed job (e.g. many data-parallel deep learning training), workers often print almost identical logs.We should avoid printing them all to the terminal.

A good strategy could be:

Print all logs from the main process / first rank. Print some logs (e.g. only errors) from other workers.
Save all logs from all workers to separate files.
Save all logs from all workers to a structured logging service, tagged by worker rank.

Detectron2's setup_logger implements (1) and (2).

Use colors in terminal¶

When logs are printed to terminal, they are more readable ifseverity is represented by colors rather than strings.I often use this formatter:

from termcolor import colored
import logging

class ColorFormatter(logging.Formatter):
    def __init__(self):
        super().__init__()
        self._debug, self._info, self._warn, self._error = [
            logging.Formatter(
                colored("[%(asctime)s%(name)s]: ", c) + "%(message)s",
                datefmt="%m/%d %H:%M:%S",
            )
            for c in ["blue", "green", "yellow", "red"]
        ]

    def format(self, record):
        if record.name == "root":
            record.name = ""
        if len(record.name):
            record.name = " " + record.name

        if logging.INFO > record.levelno >= logging.DEBUG:
            return self._debug.format(record)
        elif logging.WARNING > record.levelno >= logging.INFO:
            return self._info.format(record)
        elif logging.ERROR > record.levelno >= logging.WARNING:
            return self._warn.format(record)
        else:
            return self._error.format(record)

Attach this formatter only when the handler writes to terminals (check sys.stdout.isatty),and we'll get outputs like:

[04/23 18:24:43 test]: debug[04/23 18:24:43 test]: info[04/23 18:24:43 test]: warning[04/23 18:24:43 test]: error

`logging` module is not enough¶

Be aware that it's insufficient to only rely on logging module.A Python program may produce useful logs bypassing the logging module, e.g.:

print statements: they should be avoided, but may still exist.
Logs from C libraries linked to the process.
Logs from the interpreter itself, e.g. stack traces.
Logs printed by the shell, e.g. "Segmentation fault".

To not miss important logs, a comprehensive logging solution needs to integrateboth the structured logs from Python and the less common unstructured logs from the above sources.

How To Do Ablation Experiments

2022-04-08T07:00:00.000Z

本系列文章

延续上一篇文章, 再说一说怎么科学的在 paper 里做 ablations.

Ablation / 控制变量 ¶

一组理想的 ablation 实验, 应当所有实验尽量使用一份代码实现, 和相同的实验 recipe, 这样才算是真的 ablation. 其中尤其不要忽视实现的重要性, 因为同一个 feature 在不同的实现里可能会有重要的区别. 例如, 一个 TensorFlow 跑的实验和一个 PyTorch 跑的实验就不能放到一组 ablation 里. 我的 Where Are Pixels? -- a Deep Learning Perspective 也说了很多底层实现细节对模型的影响.

反例: EfficientNet 和类似的不少文章设计了新的网络, 却没有跟已有网络结构的 ablations, 只有在不同 recipe 下的 system-level 结果.Revisiting ResNets: Improved Training and Scaling Strategies 一文就说, 其实 ResNet 在加强的 recipe 下仍然很 competitive, 而那些看似很厉害的新模型, 很大程度上受益于它们使用的 recipe. 这篇文章的开头和结尾写的很好, 摘抄一下:

Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies.
...
We hope our work encourages further scrutiny in maintaining consistent methodology for both proposed innovations and baselines alike.

开头吐槽只想 claim 大新闻; 结尾吐槽别人实验做的没有 scrutiny.

反例: FCOS 是一个 system-level 效果很好的 detector, 然而它并没有充分的 ablation 来说明它的效果 为什么好. ATSS 一文就把 FCOS 和 RetinaNet 之间的所有区别进行了 ablation, 发现 FCOS 的性能提升有不少得益于与其中心思想无关的改动.

变量的相关性 ¶

"控制变量" 并没有看上去的那么美好: 深度学习作为没有太多理论的科学, 不同变量之间常常存在潜在的, 未知的相关性. (其他缺乏理论的学科, 例如医学, 心理学, 社会学也有类似问题). 这种相关性会带来如下一些后果:

相关性让 ablation 的结论更可疑. 虽然一个实验支持了 claim, 但是这个 claim 可能跟实验里被控制的变量相关, 那么也许换一组变量后, claim 就不再成立了. 对于深度学习 paper 的读者, 这也是一个常见的的 concern: ablation 证明了在这个 baseline 下你的方法有用, 可是换个 baseline 呢?
要缓解这个 concern, 应当选择 常用的, 有代表性的 实验 recipe (包括 baseline, hyperparameter, evaluation protocol 等). 一个好的 baseline 并不需要是 SOTA, 但是需要是一个领域内大家公认具有代表性的结果. 如果实验并不是在一个读者熟悉的条件和设定下, 读者更容易怀疑 ablation 的结论是否换个 recipe 仍然通用, 是否有意或无意中选择了对 baseline 不利的设定, 是不是拿着锤子找钉子. 这些都会弱化结论的可信度. 标新立异的选择往往是需要 justify 的, 要说明 "这个锤子为什么适合这种钉子" (见最后一节).
反例: 某 paper 发明了一个新的 layer, 然后: "在一个 (自己设计的) 30 层 ResNet 上做了实验, 实验设定和参数见附录".
反例: 某 paper 发明了新的优化方法, 然后实验是卫星图像分类或医疗图像分类这种小众领域.
相关性使得不同变量带来的效果常常不可叠加: A 变量和 B 变量可能各自能够将结果提高 1%, 但是合在一起也只能提高 1%.
举个直观的例子: 假如某 paper 发明了一个新的 loss function 能够提高结果, 但也许这个 loss function 的主要原理是改变了 gradient magnitude. 这时候, 把旧的 loss function 的系数调一调也能得到一样的效果. 在这里, "loss function" 和它的系数就是两个相关的变量, 他们带来的效果是可以互相替代的. 如果研究者不注意, 写了这样一篇 loss function 的 paper, 被人发现跟调系数没区别, 那 paper 的价值就消失了.
这是另一个我们要使用 常用 recipe 的重要原因: 一个常用的 recipe 往往是已经被 well-tuned, well-studied 的. 这意味着如果能在这个 recipe 上做出 improvement, 这个 improvement 没法通过简单的 tuning 得到. 这也会让结论更强. 即使在 ResNet 早已不是 SOTA 的今天, 我如果要做 CNN 结构相关的实验, 可能仍然会选择从 ResNet 出发.
相关性导致新的方法需要改变 (而不是控制) 变量才有效. 例如, 很多新的模型可能需要找一个新的 learning rate. 这时候, "learning rate" 这个变量就没有控制. 在改变了这个变量的同时还要 convince 读者这个实验是有效的, 是需要做额外的工作的. 下一节会详细解释.

Recipe 的进化 ¶

前面说 ablations 要使用常用的 recipe. 但是, recipe 也要与时俱进: 一个曾经不常用的 trick 可能在未来会进化成大家都在用的标准 recipe, 一个新的方法可能需要一个新的 recipe. 如果每篇文章都严格 "控制变量", 只使用旧的 recipe, 领域可能会陷入 local optimum. 那么, recipe 的进化要如何发生呢?

假设 A, B, C, ... 是一些与 ablation 的主要 claim 没有紧密关联的 recipe (例如 hyperparameter / tricks. 为方便理解, 可以把它们当作几个不同的 learning rate), 且 baseline + A 是 baseline 的 "标准" recipe. 当我们在开发一个新的方法 "proposed method" 时, 也许会发现用 B 来做实验比 A 更好 (proposed + B > proposed + A). 这时, 作者可以展示下面这些实验:

proposed + B > baseline + A. 这是不足以 claim proposed > baseline 的.
当然, 作者也可以选择将 "B" claim 为 "proposed method" 的一部分 -- 但是这会弱化文章的价值, 因为它让 "proposed method" 更复杂了. 读者也会疑惑: 也许只有 B 就够了, "proposed method" 里剩下的部分也许价值不大.
proposed + B > max(baseline + A, baseline + B) : 这样来 claim proposed > baseline, 读者一般是接受的.
在此基础上, 读者会好奇 proposed + A 表现如何. 如果proposed + A < baseline + A, 则说明 proposed 依赖 B. 如果 B 是某个复杂的 trick 的话, 这种依赖也会降低 proposed 的价值.
要注意到, 以上的结果无法排除下面这种可能性: 存在一个 C, 使得 max(proposed + B, proposed + C) < baseline + C. C 的存在会使得 baseline 看上去比 proposed 更好. 但是, 由于我们假设 baseline + A 已经是一个常用的标准 recipe 了, 如果存在这样的 C, 那 C 大概率是 nontrivial 的, 不太可能是简单的调参. 这也是为什么要尽量使用标准 recipe.
为了尽量降低 C 存在的可能, 在计算资源允许的情况下应当对 recipe 进行公平的搜索: 如果 proposed 使用的 hyperparameter B 是 grid search 找出来的, 那么也应对 baseline 的 hyperparameter 进行类似的 grid search, 看看是否能找到一个更好的 C.
例子: DETR 的训练代价比主流 detection 模型都大得多 (100-500 epochs), 这点在技术上难以避免. 这个区别导致公平的实验不容易做, 因为主流模型 (Faster R-CNN) 还没有一个常用的, 训练这么长时间的 recipe. 据我了解, 作者们当时尝试了不少方法提高 Faster R-CNN 在这个训练长度下的性能, 尽量让 baseline 更强. 这是很负责任的做法.

Baseline 无法 reproduce¶

前面两节都提到了, 使用一个常用的, 有代表性的, 标准的 baseline 是很重要的. 这样的 baseline 在文中的结果应该至少与别人 paper 相同实验的结果接近. 如果 baseline 比别人差, 说明 baseline 里一定有某些因素与那个常用的 recipe 不同, 因此会弱化结论的可信度.

反例: 某 paper 提出了 ResNet 的小改动, 但是文中的 ResNet baseline 比 pytorch 官方样例显著的差. 在这个 baseline 上有 1% 的提升, 并不意味着在常用的 baseline 上能有提升: 因为如前文所说, 不同的因素常常是不可叠加的. 事实就是, 有许多方法只在弱的 baseline 上有效.

如果 baseline 确实无法 reproduce 怎么办? 这种处境很遗憾. 一个 research topic 如果没有大家公认的 reproducible 的代码和 baseline 设定, 就容易陷入乱象. 例如 A Metric Learning Reality Check,Deep Reinforcement Learning that Matters 都是在吐槽各自领域里的问题. 这正是为什么要做开源高质量 codebase.

锤子和钉子 ¶

前面说到, 实验设定一般使用 "常用, 标准" 的 recipe, 否则有拿着锤子找钉子的嫌疑. 而有的时候, 如果我们恰好要 claim"我的锤子适合特定的钉子", 那么巧妙的改变 recipe 也许会有更好的效果. 下面举几个正 / 反面例子.

正例: ResNet paper 多次 report 了 training error (Fig. 4, 6), 这也许会显得奇怪, 毕竟 training error 不是一个大家常用的 metric. 这是因为文章的大 claim 是关于 residual connection 对训练 / 优化有帮助, 而 deep plain network 难以优化.Training error 才是跟这个 claim 直接相关的 metric, validation error 变好只是 training error 变好的一个副产品.

正 / 反例: Optimizer 的根本目标是降低 training loss, 所以比较不同的 optimizer (例如 SGD/Adam) 的时候不能不看 training loss. 这篇 paper Sec. 5 就吐了这个槽: 有的 optimizer 跑出来的 validation error 更低就声称自己更好, 但是实际上发现它的 training loss 更高.

反例: 我 review 过的某 paper claim 一个方法能够提高模型的 capacity 或表达能力, 但是实验是拿 ResNet 在 Cifar10 上看 validation error. 虽然 validation error 是一个常见的 metric, 但是 ResNet 在 Cifar10 上严重 overfit (training error = 0), validation error 跟模型 capacity 没什么关系.

正例: detection 里有很多可用 metric, 如大小物体的 AP, 不同 IoU 的 AP, 等等. 当有合适的 justification 的时候 (例如模型设计上对大物体更友好), 比较其中某个特定的 metric 能够帮助文章的 claim. PointRend paper 里为了证明 "边界结果更准确" 这个 claim, 设计了一个新 metric: 拿 COCO 训练的模型在 LVIS 的高质量标注下算 AP. 这样得到的结果比使用标准的 metric 更有说服力.

正例: Mask R-CNN 的 Table 2 (d) 使用了一个很少见的 recipe: 基本没人用的 ResNet-Conv5 backbone. 这是为了证明关于 RoIAlign vs. RoIPool 的 claim: RoIPool 的 feature map 不对齐, stride 越大, 影响越大. 通过 Conv5 (stride=32) 上的实验更加强化了这个 claim. 当初之所以在 detectron2 里保留 Conv4 这些性能并不好的模型, 就是因为它们在许多实验中仍然有研究价值.

正例: 我的 Rethinking “Batch” in BatchNorm 实验很多, 里面做了对 BatchNorm 的各种魔改. 这些魔改里, 大多数的目的不是为了 propose 一种新方法, 而是通过改变 BatchNorm 的行为来验证某个 claim. 如何找一个好钉子, 设计一个实验来巧妙的突出 claim, 是一项技术活.

Where Are Pixels? -- a Deep Learning Perspective

2021-06-11T07:00:00.000Z

Technically, an image is a function that maps a continuous domain, e.g.a box , to intensities such as (R, G, B).To store it on computer memory, an image is discretized to an array array[H][W], where each elementarray[i][j] is a pixel.

How does discretization work? How does a discrete pixel relate to the abstract notion of the underlying continuous image?These basic questions play an important role in computer graphics & computer vision algorithms.

This article discusses these low-level details, and how they affect our CNN models and deep learning libraries.If you ever wonder which resize function to use or whether you should add/subtract 0.5 or 1 to some pixel coordinates,you may find answers here.Interestingly, these details have contributed to many accuracy improvements in Detectronand Detectron2.

Formation of Discrete Image¶

Sampling theory tells us howa continuous 2D signal is turned into a discrete array by sampling and filtering.

We choose a rectangular grid of points, from which we will draw samples.In order to make the best use of the produced samples, we have to know the exact location where every sample onthis grid is chosen.
Values on these sampled points are not directly retrieved from the original signal, but come from afiltering step that removes high-frequency components.A bad choice of filters can lead to aliasing effects.

Sampling and filtering are both important in basic image processing operations, such as resize.Resize operation takes a discrete image, resamples it, and creates a new image.The choice of sampling grid and sampling filter will then affect how such a basic operation is implemented.

For example, the paper On Buggy Resizing Libraries and Surprising Subtleties in FID Calculationstudies the filtering issues, and shows that the resize operations in many libraries(OpenCV, PyTorch, TensorFlow) don't take into account the low-pass filtering. This then leadsto incorrect deep learning evaluation.

In this article, we ignore the issue of sampling filter, and only study the coordinates of sampling grid.We'll see that this choice is also inconsistent among libraries, and can affect the design and performance of CNN models.

Choices of Sampling Grid¶

Pixels are located on a sampling grid we choose.Naturally, we would like to only consider rectangular grids where pixels are spaced evenly.But there are many other factors to be concerned with:

Offset: where is the first pixel located relative to the beginning of the signal?
Stride: what's the distance between two neighboring pixels?
Resolution: how many pixels are there?

(These terminologies may have a different meaning elsewhere, but this is how I define them in this article.)

For simplicity, we look at the one-dimensional case instead.We want to answer this question: for a 1D signal defined on , what is the sampling grid with stride=1?There are a few different choices:

In this figure, the green bars represent the 1D signal of length , and blue dotsrepresent the locations where point samples are taken.On top of each sample we labeled their coordinates,while on the bottom are their zero-based pixel indices.More formally, given a stride (which equals to 1 here), the offset and resolution of the grid are defined by the following table (assume is an integer):

$① ② ③$

They (or at least the first two) are all valid interpretations when we are given an array of pixels.The interpretation we choose affects how we implement operations and models,because they each have some unique weird properties.To understand them more, let's check how a 2x resize operation should be implemented under each interpretation.

2x Resize Operation¶

We'll now see that a simple "2x resize" operation has many possible implementations.

A unique undesired property of ① is that, stride is not the inverse of resolution.So a 2x resize is ambiguous: we have to be clear about whether we want half of stride, or twice more pixels.The new grids after resize look like these:

Resize for grid ② & ③ aren't ambiguous:

You can easily verify that the 4 different resized grids still match thecorresponding definition in our table above.

For 2D case, the 2x resize in ①(twice more pixels) and ② look liks this (image credit: here),from which you can see why ①(twice more pixels) is also called align_corners:

These 4 different versions of 2x resize have some issues:

Extrapolation: ② and ③ both need extrapolation outside the border of the original grid to perform resize, but ① only needs interpolation.Extrapolation is sometimes undesirable.
Asymmetry: ③ is asymmetric, and it's probably a good reason to never use it. One consequence is that resize(flip(x)) != flip(resize(x)). All others are symmetric.
Information Loss: in ①(half of stride) and ③ , about half of the points on the new grid exist in the old grid.By not having to interpolate their values, we minimize the loss of information.However, in ①(twice more pixels) and ②, most or all of the new pixels need to be recomputed.
For resize with other arbitrary scale factors, all versions have information loss. But 2x/0.5x resize aremost common in deep learning.

The DeepLab series of segmentation models are famous for using grid ①(half of stride) for all the 2x resize.See here for words from its author.This matches the inconvenient image shapes they use, such as 321x513.I've heard opinions that the benefits of "no information loss" and "no extrapolation" may let itoutperform ② in segmentation, but I have yet to see more evidence.

Libraries¶

What do libraries use? Situation is a bit messy. I'll list what I know and look forward to your help to add more.No guarantee they are all correct, since I didn't check the source code for all of them.

Library & Operation	Pixel Grid Convention
OpenCV `cv2.resize`	`interpolation=LINEAR/CUBIC`: ② `interpolation=NEAREST`: buggy, none of the above. issue `interpolation=NEAREST_EXACT`: ②
Pillow `Image.resize`	②
scikit-image `transform.resize`	②
PyTorch `F.interpolate`	`mode=linear/cubic, align_corners=False`: ② `mode=linear/cubic, align_corners=True`: ① `mode=nearest`: buggy like OpenCV. issue `mode=nearest_exact`: ②
PyTorch `F.grid_sample`	`align_corners=False` which I requested: ② `align_corners=True`: ①
TensorFlow `tf.image.resize`	TFv1 `method=BILINEAR/NEAREST, align_corners=False`: ③ TFv1 `method=BILINEAR/NEAREST, align_corners=True`: ① TFv2 `method=BILINEAR/NEAREST`: ② (In TFv2, `align_corners` option was removed)
TensorFlow `tf.image.crop_and_resize`	none of the above. issue I reported

It seems the mess is unique in the deep learning world. How come?From what I can tell, the history looks like this:

TensorFlow is the first place that introduces ③, in its initial open source.This was later considered as a bug and fixedin v1.14 using a new optionnamed half_pixel_centers=True that follows grid ②.
align_corners=True(①) appeared in TensorFlow 0.7 in 2016.I guess this was probably intended for DeepLab development and not for general use.
In TensorFlow v2, grid ② becomes the only version of resize, but it was too late.During all these years, the uncommon version (①) and the wrong version (③) have propagated to people'smodels and other libraries.
PyTorch's interpolate comes originally from upsample operation. Nearest upsample was buggy when it's firstadded in LuaTorch in 2014.Bilinear upsample was first added in LuaTorch in 2016 andused grid ①. Grid ② was added in 2018 to PyTorch under an align_corners=False option,and became the default since then.
Due to this mess, resize operator in ONNX has to support5 versions of coordinate transform!Kudos to ONNX maintainers.

Literature¶

Many computer graphics textbooks and papers talk about this topic and choose ②, for example:

Sec. 3.2, "Images, Pixels and Geometry" of Fundamentals of Computer Graphics
Sec. 7.1.7, "Understanding Pixels" of Physically Based Rendering
"What Are the Coordinates of a Pixel?" from Graphics Gems. Explained again in Real-time Rendering.
A well-known tech memo, A Pixel is Not a Little Square

(Note that some of them uses ② but defines the continuous signal in the range .We'll discuss this more.)

Given all the graphics literature, computer vision and deep learning libraries promoting grid ②, we use ② as the convention.

Choices of Origin¶

We pick ② as the convention for grid locations, but this is not the end of the story!We now know the grid locations relative to the beginning of the signal are 0.5, 1.5, , but what are their absolute coordinates?In other words, where is the origin (0, 0) ?

This is just a choice of convention and has no substantial effect on any algorithms.Two of the graphics literature I listed above put the origin on the first pixel.This has the benefit that all pixel locations have integer coordinates, but then it's weird that the signal lies oninterval .This convention is sometimes referred to as "integer centers".

Another convention, "integer corners", or "half-integer centers", puts the origin at the beginning of the signal, so the first pixel is centered at (0.5, 0.5).The two conventions are demonstrated in this figure:

We choose "integer corners", and then willhave the following relationship between continuous coordinates and discrete pixel indices:

The choice doesn't matter for resize because absolute coordinates are not part of its API.However, for functions that accept or return absolute coordinates, we should be aware of their convention. For example:

cv2.findContours returns integer polygons represented by indices. So we always add 0.5 pixel to itsresults to obtain coordinates that match our convention.
cv2.warpAffine uses "integer centers" and this is complained about in this issue.In fact most OpenCV functions use the "integer centers" convention.
pycocotools.mask.frPyObjects renders polygons as masks. It accepts polygons in "integer corners"convention.Same for PIL.ImageDraw.polygon, but its results are 0.5 pixel "fatter" due to howits implemented. This has affected cityscapes annotations.
RoIAlign in torchvision takes a box in absolute coordinates that match our "integer corners" convention.
scipy.ndimage.map_coordinates takes coordinates in "integer centers" convention.

If a dataset is annotated with coordinates, we also need to know its choice of coordinate system. Thisinformation is often not provided by dataset owner, so we make guesses. For example, in COCO itappears that polygon annotations match our convention, butkeypoint annotations do not and should be incremented by 0.5.

Now that we have a convention for the coordinate system, it's a good practice in computer vision systems toalways use coordinates rather than indices to represent geometries, such as boxes and polygons. This is because indices are integers, andcan easily lose precision during geometric operations.Using indices for bounding boxes has caused some issues in Detectron.

Improvements in Detectron & Detectron2¶

Models in Detectron / Detectron2all involve localization of objects in images, so the convention of pixels and coordinates matters a lot.Various improvements and bugfixes in the two libraries are related to pixels.

Box Regression Transform¶

In detection models, bounding box regression typically predicts "deltas" between the ground truth (GT) box and a reference box (e.g. anchor).In training, GT box is encoded to deltas as training target. In inference, the predicted deltas are decoded to become output boxes.

Boxes in Detectron often use integer indices, instead of coordinates. So the width of abox is given by instead of . Its box transform code looked like this for a long time (showing only one dimension for brevity):

ref_x0, ref_x1: int                     # the reference box
ref_w = ref_x1 - ref_x0 + 1
ref_center = ref_x0 + 0.5 * ref_w

def encode(x0, x1):                     # given reference box and gt box
  w = x1 - x0 + 1
  center = x0 + 0.5 * w
  dx = (center - ref_center) / ref_w    # delta between centers
  dw = log(w / ref_w)                   # delta between widths
  return dx, dw

def decode(dx, dw):                     # given reference box and deltas
  center = dx * ref_w + ref_center      # undo the encoding
  w = exp(dw) * ref_w
  x0, x1 = center - 0.5 * w, center + 0.5 * w
  return x0, x1

As innocent as the code seems, the two functions are not inverse of each other: decode(encode(x0, x1)) != (x0, x1).x1 is incorrectly decoded: it should be center + 0.5 * w - 1 instead.

This bug appeared in the py-faster-rcnn project around 2015, and is still theretoday.It was carried into Detectron and negatively affected results in the Mask R-CNN paper.Then it's fixed in late 2017after I found it, and contributed to an improvement of 0.4~0.7 box AP.Detectron went open source in 2018 with this fix.In Detectron2, we adopt the rule to always use floating-point coordinates for boxes, so the issue nolonger exists.

Flip Augmentation¶

How to horizontally flip a geometry? Although pixel indices should be flipped by ,we should follow the rule to always use coordinates, and coordinates should be flipped by under "integer corner" system.

Detectron isn't so rigorous on this and it uses for coordinates. IIRC, fixing the issue leads to ~0.5 mask AP improvement.

The augmentation library "imgaug" also made this fix.

Delay Conversion to Mask Representation¶

COCO's instance segmentation data is annotated with polygons that have sub-pixel precision.Converting polygons to binary masks loses the precision due to quantization,and the lost might become more severe during augmentations.Therefore it's preferrable to keep the polygon representation and delay the conversion as much as possible.

In both Detectron and Detectron2, polygon representation are kept during flipping, scaling, and RoI cropping.Masks are not created until the second stage's box predictions are made, where the boxes are used to crop the groundtruth polygonsand generate the mask training target.

On the contrary, in TensorFlow's detection code hereand herepolygons are turned to binary masks immediately at dataset creation time.

Anchor Generation¶

The code to generate anchors in Detectron is quite long,because it tries to generate integer-valued anchor boxes.By adopting coordinates for all boxes in Detectron2, integer boxes are not needed.This simplifies all the logic to just a few lines of code.

This does not affect accuracy, because the exact values of anchors are not that important as long asthe same is used in training & testing.

RoIAlign¶

The RoIAlign operation crops a region from an image and resize it to certain shape.It's easy to make mistakes becausetwo images and two coordinate systems are involved.Let's derive how to perform RoIAlign.

Given an image and a region (the green box), we want to resample a K K output imagethat corresponds to the region.W.l.o.g. we assume the input image has stride=1.Since we know the resolution and the absolute length of output,the output stride derived from the definition of grid ② is .Because grid offset is 0.5stride, the location of output pixel [i,j] isLet's call it .To compute resampled values at location , an easy way is to do bilinear interpolation with its 4nearest pixels (this corresponds to RoIAlign with sampling_ratio=1).We show the 4 neighboring input pixels of output[0,0] in the figure.The indices of 4 nearest pixels of are obtained after subtracting 0.5 to align their coordinate system:

The original implementation of RoIAlign in Detectron doesn't subtract 0.5 in the end, so it's actually not very aligned.It turns out this detail does not affect accuracy of R-CNNs, because RoIAlign is applied on CNN features, and CNNis believed to be able to fit slightly misaligned features.

However, we have new use cases of RoIAlign in other places, e.g. to crop mask head training targets from the ground truth mask, soI fixed it in the detectron2 / torchvision RoIAlign with an aligned=True option.Its unittestdemonstrates how the old version is misaligned.

Btw, once we figured out the coordinate transform formula, it's easy to implement RoIAlign using grid_sample.This shows that RoIAlign is nothing more than a fused bilinear sampling + averaging.Using grid_sample is about 10%-50% slower than the RoIAlign CUDA kernel.

Paste Mask¶

Mask R-CNN is trained to predict masks of fixed resolution (e.g. 28x28) restrained inside given boxes(we call it "RoIMask").But in the end, we often want to obtain full-image masks.A "paste mask" operation is needed to paste the small RoIMask into the given region in the image.

This operation is an inverse of RoIAlign, so it should be implemented similar to our derivation above.In Detectron, this was implementedwith some magic rounding & resize that are not exactly the inverse of RoIAlign.Fixing it in detectron2 increases the mask AP by 0.1~0.4.

Point-based Algorithms¶

Obviously, the paste mask operation can introduce aliasing in the results due to the low resolution RoIMask.This is the motivation behind our work ofPointRend.

PointRend is a segmentation method that focuses on point-wise features, where a "point" isnot necessarily a pixel, but any real-valued coordinates.Pointly-Supervised Instance Segmentation, also from our team, uses point-wise annotations to train segmentation models.Both projects involve heavy use of point sampling and coordinate transforms.Having a clear and consistent convention of pixels and coordinates was important to their success.

Summary¶

Due to some sloppy code in the early days of deep learning libraries,today we're facing multiple versions of resize functions.Together with the two different coordinate system conventions, they easily cause hidden bugs in computer vision code.

This article revisits these historical technical debts and shows how these fun details matter in modeling and training.I hope they will help you make proper choices.

Deep Learning Experiments and Claims

2021-05-23T07:00:00.000Z

本系列文章

这几年来, 从 FAIR 的几位大佬身边学习到的最多的是对待 research 的态度. 因此说说写 paper 和做实验的体会.

实验与 claims¶

实验是为了证明或强化文章里给出的 claim/hypothesis 的.

Ross ICCV 2019 tutorial 最后谈了谈怎么写 paper. 第 126 页说, 文章中所有的 claim, 理想情况下都应该要么是文献中已有的 claim, 要么是有实验能够证明的 claim.

举个例子, BatchNorm paper 的实验可以 claim 很多东西, 包括 BatchNorm 让结果很好, 对初始化不敏感, 大 learning rate 也不炸. 但是文章说 BatchNorm "reduce internal covariate shift", 就遭到了一些人的质疑. 如著名的 Ali Rahimi Neurips 2017 test-of-time award presentation(B 站) 里的五连问 (第 17 分钟).BatchNorm paper 中把 "internal covariate shift" 粗略定义为 feature distribution 的变化, 但唯一一个与此相关的实验是 Fig.1 (b)(c). 虽然实验是合理的, 但结果确实不算不太显著. 甚至 GroupNorm paper 的 Fig.6 展示的结果可能都更强.

Ross 在 tutorial 中的建议是: 如果一个 claim 没有得到实验的有力支持, 在表述上可以弱化一些, 例如写 "intuitively/hypothetically, 如何如何...".

claim 结果 vs. claim 原理 ¶

类似于这个 BatchNorm 的例子, 很多 paper 中一个常见的问题是, 实验只证明了结果好, 文字里却讲了个故事并 overclaim 了结果好的原理.

这里的 claim 并不仅仅指明显的 "We claim ...". 还可以有其他表现形式:

文中任何 statements 都可以认为是 claim. 作者有时容易把自己猜想的原理 (speculated stories) 当做事实来陈述.
作者容易把自己的 intuition 当做方法最后 work 的原理. 虽然 intuition 是有用的, 但是写作时, 心里要清楚哪些发现是有证据的, 哪些是凭感觉的.
起名字的时候, 用词容易使用一些 suggestive language, 将原理暗示给读者."attention" 就是一个例子: 它给人一种模型会主动 "关注" 重要的 feature 的感觉, 但是到底是否如此, 是需要实验证明的. 毕竟它说到底还是在做矩阵乘法.

Troubling Trends in Machine Learning Scholarship 的 talk 里也指出了这些问题.

由于整个领域还是以实验驱动的, 对原理的研究不深, 所以原理常常都是 speculation / intuition, 在写作时容易 overclaim. 因此要注意弱化表述.

从 claim 结果到 claim 原理 ¶

从结果深入到原理, 对应的实验大约有这样的几类:

System-level 实验, 就是直接跟已有的结果比. 这类实验可以没有控制变量: 两个非常不同的方法 (e.g. SVM vs. CNN) 照样可以在尽量相似的设定下比较结果. 它证明的 claim 是 " 这整个系统能够达到好的性能 ".
Ablations, 也即控制变量, 用来证明 " 系统由于这个方法 (而不是其他因素) 性能得到提高 ". 这就需要严格控制其他因素.
深入分析, 试图解释 " 为什么某种方法能够提高性能 ". 但是在 deep learning 中, 由于理论工具的缺乏, 这类实验往往不容易设计.

如果一篇论文提出了全新的系统, 结果还特别好 (例如 AlexNet, Bert, AlphaGo 这种 breakthrough 级别), 那么即使仅有 system-level 的实验也没关系: 以后总会有别人去更深入的研究的.Yann LeCun 针对 Ali Rahimi 的 presentation 曾经说过, 历史上 engineering 往往比 science 快一步, 很多科技的发展过程都是先做 work 了再去研究为什么的.

另一个极端是, 如果一篇论文对结果毫无提高, 但是通过详细的分析帮助读者理解了更多原理, 或提供了理解原理的视角和工具, 那同样也是很好的工作.

当然, 大部分工作既没有 breakthrough 级别的结果, 也很难给出令人信服的分析 (毕竟炼丹), 因此往往需要多种类型的实验结合: 有什么样的实验, 决定了论文能做出什么 claim, 进一步才能 justify 论文的价值.

举个例子, 每个人都熟悉的 ResNet paper, 同样的模型, 文章可能有下面几种不同写法, 对应不同的实验和 claims:

我们设计了一类 VGG 的变种, 有五个模型叫做 "MSRANet {18,34,50,101,152}", claim 它们达到了 SOTA. 实验内容是跟以前的 SOTA 比一比.
其实 AlexNet 文章就是这么写的, 但是 AlexNet 的 SOTA 结果本身就是一个巨大的 breakthrough. 如果 ResNet 也这么写, 影响力会小得多, 毕竟 SOTA 是短暂的. 说的不好听一点, 这个模型被下一个人拿过去改一改, 再换个名字成了新的 SOTA, 可能就没有人记得 "MSRANet" 了.2013 年的 ZF-Net 大概就是这么一个地位.
我们设计了包含 residual connection 的 "bottleneck block / basic block" 方法, 能够提高模型性能. 实验做一些 ablations, 确认了这个 claim.
这个方法有一些能够自圆其说的 intuition, ablations 证明有效, 同时也能达到 SOTA. 这就类似于大多数的好 paper.
而 ResNet 原文的层次就更高一些了: 文章标题说的是 "residual learning", 内容强调的是 residual connection 对优化的好处. 其他的各种 block, ResNet-50, 只是测试这个 idea/claim 的手段.
这个大 claim 对实验和分析的要求就更高了, 以至于 ResNet 没分析完, 到了 ResNet-v2 (pre-activation) 继续把这个故事讲了下去, 专门分析 residual connection 的重要性.
最后, 实践证明 residual connection 确实是 deep learning 今天为止最重要的发明之一, 几乎统治了所有领域, CNN 和 Transformer 都离不开它. 这远比 ResNet-50 到底长什么样, 里面的 block 到底是什么要重要得多.

马后炮 / Harking¶

人们一般希望探索结果背后的原理, 因为这是科学研究的本质, 也让科研工作更有价值. 这导致了上面提到的那种 overclaim 现象.

还有另一种现象, 俗称马后炮, 或 "Harking" (Hypothesizing After Results are Known). 也即先做实验, 在有结果之后 "看图说话 / 强行解释", 找一个可以被这个实验证明的结论, 或可以解释实验结果的猜想 / 原理. 然而在写作时, 先 "We hypothesize/claim ...", 再 "设计实验" 证明自己的 hypothesis/claim.

"Harking" 这个词最早出现在心理学研究里. 在 deep learning 中也有对此的批评: HARK Side of Deep Learning. 它之所以是一种不太好的 research practice, 是因为它背离了科学研究的目标. 科学与迷信的一大区别, 是科学应当不仅能够解释已知, 还能够预测未知. 如果一个工作仅仅追求找到一个解释, 来与现有的少量实验结果兼容的话, 这个解释未必在其他实验中适用, 因此可能不是一个科学的结论.

然而, 如今 deep learning 是以实验为基础的科学. 其研究过程确实经常要先做实验, 看到结果, 才能提出猜想或决定下一步的实验. 因此马后炮行为一般都存在. 但是, 一个科学的马后炮研究者应当在制造出一个猜想或结论之后, 再去新的实验里尝试预测一下未知, 打一个马前炮.

一个我自己深有感触的例子, 是我参与的这篇 Feature Denoising for Improving Adversarial Robustness. 起初我们猜测 non-local 会帮助 adversarial training, 并很快得到了实验验证. 这时候, 一个中规中矩的 paper 写法就是 claim "non-local 对 adversarial training 有用", 实验就是有 / 无 non-local 的 ablations.

有意思的是, 我们的 claim 是 "denoising layers 对 adversarial training 有用". 这是一个更大的 claim, 而且其实是一个马后炮的结论, 因为它仅仅是 intuitively 能解释已有的实验结果:non-local 在传统 vision 里用作 denoising, 而对抗样本的扰动可以看做为 noise.

为了支持这个更大的 claim, 我们要打几个马前炮, 用它去预测更多的实验:

用 visualization 展示了 non-local 确实对 feature 有 denoise 的效果.
使用了其他的 denoising layer (median pooling, mean pooling, bilateral filter), 来验证他们都比加普通的 conv layer 更有用.

这些实验是在有了 claim 之后, 专门为了验证这个 claim 而设计的实验. 它们的正面结果让我们对这个 claim 更有信心, 即便它最初是靠马后炮和直觉猜出来的. 从马后炮到马前炮, 是 researcher 的自我要求.

好的研究应当能经受时间和实践的检验, 因此一个好的研究者应自己先审视自己的 claim, 并真心的尝试用实验检验它们. 有机会再详细写写怎么设计科学的实验.

Patching STB_GNU_UNIQUE of Buggy Binaries

2021-04-02T07:00:00.000Z

开源工具链里有很多陈年小 "feature", 最初由于各种原因 (例如作为 workaround) 实现了之后, 即使语义模糊或设计不合理, 也因为兼容性被留到了今天.

STB_GNU_UNIQUE 就是 ELF 中一个不太好的设计, 带来了不少语义冲突. 拥有 STB_GNU_UNIQUE binding 的符号, 即使在被用 RTLD_LOCAL 方式装载的时候, 也会拥有 global linkage. 另外它还会导致 dlclose 无效. 网上对此有很多吐槽, 例如这里, 这里.

这个 binding 最初的引入似乎是由于一些全局符号的内在状态不能重复多次, 因此把这些符号标记为 unique, 即使从多个 plugins 里装载了多次, 符号也只有一个定义. 但是另一方面, 程序也会有一些全局符号的状态必须是 local 的. 到底哪种行为是用户需要的, 编译器是不知道的. 结果是, gcc "聪明" 的自动把 template function & inline function 里的 static variable 标记为了 unique

其实 C++ 标准确实规定了这样的 variable 必须是 "single entity". 理论上说 gcc 没做错, 但这并不总是用户的预期行为, 而 C++ 标准也没提供别的办法. 如果要禁用 unique binding, 可以使用 -fno-gnu-unique 重编译, 或者暴力 patch 编译好的 ELF binary.

STB_GNU_UNIQUE 导致了 PyTorch 1.8.0 最近的一个严重 bug, 影响了所有 R-CNN 模型, torchvision / detectron2 / mmdetection 里都有用户报告. 重新编译 PyTorch 太麻烦了, 为了以后更快验证此类问题, 我就写了一个暴力 patch ELF 的脚本:

import sys
from elftools.elf.elffile import ELFFile

def process_file(filename):
    with open(filename, 'rb') as f:
        elffile = ELFFile(f)

        dynsym = elffile.get_section_by_name('.dynsym')
        dynsym_offset = dynsym.header.sh_addr
        dynsym_idx = []  # addresses of Elf64_Sym
        for idx, sb in enumerate(dynsym.iter_symbols()):
            bind = sb.entry.st_info.bind
            if "UNIQUE" in bind or "LOOS" in bind:
                print("Found UNIQUE symbol: ", sb.name[:60])
                dynsym_idx.append(dynsym_offset + idx * 24)  # 24=sizeof(Elf64_Sym)

    print(f"Patching {len(dynsym_idx)} symbols ...")
    with open(filename, 'rb+') as f:
        for sym_idx in dynsym_idx:
            f.seek(sym_idx + 4)  # 4=sizeof(st_name)
            old = ord(f.read(1))
            assert old // 16 == 10, hex(old)  # STB_GNU_UNIQUE==10
            f.seek(sym_idx + 4)
            f.write(bytes([old % 16 + 2 * 16]))  # STB_WEAK==2
            f.write(bytes([2]))  # STV_HIDDEN=2

process_file(sys.argv[1])

以上脚本把所有 STB_GNU_UNIQUE 符号的 binding 改成了 WEAK, visibility 改成了 HIDDEN. 符号表的 entry 结构可参考 /usr/include/elf.h::Elf64_Sym.

用这个脚本 patch 了一下 libtorch_cuda_{cpp,cu}.so 之后, 以上 bug 就消失了. 同时, 这样我也能够方便的确认另一个看似相关的 bug 还是跟 STB_GNU_UNIQUE 有关系.

PyTorch 写 Model 可以用 IfElse? 幻觉

2021-01-25T08:00:00.000Z

吐个小槽. 很久以前有次我在知乎上的一个回答里夸了 TensorFlow 1.x, 然后被人抱怨说 graph mode 写不了 IfElse 不能忍.

然而, PyTorch 就可以写 IfElse 了?

最近 detectron2 遇到的产品 / 部署的需求越来越多, 看看 PyTorch 五花八门的部署 / 加速方案里对 IfElse 都有什么限制吧:

Tracing: 遇到 IfElse 只留一支, 另一支直接消失
Scripting: IfElse 支持的还可以 (但是循环就限制很多). 另外由于 Feature 支持的太多, 速度和优化都会更有难度.
FX: 最近 PyTorch 新发明的用来做 quantization 的一个工具. 直接说不支持 IfElse.
LazyTensor: pytorch/xla 里用来把模型跑在 TPU 上的 eager + graph mode 组合机制. 敢用 IfElse, 就慢 10 倍给你看.

虽然 researcher 用 PyTorch 写 IfElse 很开心, 欠的技术债终究要换一种方式还回来的. 坚持对 researcher 友好的后果就是对产品不友好.

本质上说, 如果用户写了 IfElse, 就意味着这段代码只能在单一进程的 Python 解释器里运行 -- 这本身就是一个巨大无比的 limitation. 需要用各种方式来 workaround:

Tracing / FX: 先用 Python 跑一下试试再编译, 不保证对或者根本跑不了
Scripting: 实现一个半成品的 Python 编译器
LazyTensor: 用 Python 吧, 慢也没办法

虽然 TensorFlow 的 autograph 没深入用过, 但是从原理上看比以上方案都更合理. 当然, autograph 的实现是建立在 TensorFlow 已经有了足够多的 control flop operator 的前提下, 可以把 IfElse 变成 tf.cond. 而 PyTorch 在面向用户的 API 里仍然一个这样的 operator 都没有 (虽然 torchscript IR 里有), 并且可能以后也不会有.

UPDATE: 关于这件事写了一篇详细的文章: TorchScript: Tracing vs. Scripting

Fight Against Silent Bugs in Deep Learning Libraries

2020-08-28T10:30:00.000Z

TL;DR: How to find out if your favorite deep learning library is occasionally giving youwrong results? Such bugs happen from time to time, and are extremely difficult to notice, report, and debug.

Three years ago, I wrote an article Unawareness of Deep Learning Mistakes:buggy code can still train and appear to work, so it's difficultfor users to realize that their code is wrong.

What's apparently more difficult to find out, is when the bug comes from the deep learning library we use.Imagine, what if the library unfortunately computes wrong results for certain parts of our model during training?The training will probably still work to some extent thanks to the magic of SGD,so how could we ever possibly find out such bugs?I'll share some experience and lessons.

Silent Bugs in Deep Learning Libraries¶

"Bugs" in this article specifically refer to silent bugs that lead to wrong computation results,but no errors.

Such bugs exist in deep learning libraries and will continue to exist, because these librariesare young, and new features such as operators and training paradigm will continue to emerge in them as the researchdevelops.

Such bugs in deep learning are very hard to notice.A model typically contains billions of floating point operations (FLOPs) grouped into hundreds of operators.Even with small bugs, it may still train, converge, andappear to work well. Maybe it works slightly worse, or it fails occasionally, but it's extremely difficultfor a user to associate a suspicious result to a concrete bug in libraries.After all, there are many other explanations of a bad result that need to be ruled out:the model simply does not work; incorrect model implementation; bad hyperparameter; bug in users' training code, etc.

The situation gets worse when the buggy part of computation is not even explicitly written byusers, but implicitly generated.Auto-generated computation such as auto differentiation and graph optimization are often notwell exposed to users at all, making it more difficult to observe the bug.For example, pytorch/5801 is a bug in gradientcomputation that's found during the development of ELF OpenGO at FAIR.Models can still work to some extent with the bug, which hides the bug for a long time.It has unfortunately wasted many months in the project.

PyTorch has a "silent correctness"issue label, which shows many bugs of this kind.Most of these issues are also labeled as "high priority",which says a lot about the severity of such bugs.

Why Tests Don't Catch Bugs¶

Compared to user's training code that may also have many silent bugs,deep learning libraries have some advantage in test-ability.They provide well-defined small building blocks (e.g. operators and their gradients),so they are more testable than an end-to-end training.But I've seen a few limitations of unittests in the context of deep learning:

A test only covers a tiny input space, but other inputs may cause bugs.
As an example, pytorch/36485computes softmax incorrectly only if number of classes (C > 1024) && (C % 4 != 0), which israre in real applications.It is found in the development of MoCo which uses 65537 classes.After noticing regression in model's accuracy, the root cause is later found by bisection.
Behaviors under combinations of context are hard to test exhaustively.
Deep learning libraries usually separate the definition of computation from its execution.As a result, a computation may run under different combinations of runtime context:graph/eager mode (TensorFlow), eager/tracing/scripting mode (PyTorch), eager/jit/pjit mode (JAX),fusion with other computations, the device to run on, the level of parallelismto use, the underlying compute library and algorithm to choose from, etc.Unittests are often insufficient to cover such a huge space.
This issue gets worse in higher-level interface (e.g. Keras).TensorFlow is well-known for its many high-level ways to do the same thing:users can write a model under graph or eager mode, using either object-oriented or functionalstyle, with either raw TF APIs or Keras/Estimator interface, and Keras has many more modes within itself.Handling these combinations gets more challenging,because a high-level component has much richer semantics (therefore more side effects),that are often not strictly defined and harder to test than a pure-math operator.
For example, tensorflow/25175 andtensorflow/40638are two silent bugs in Keras causing models to not train properly. Both are due to unconventional combination in waysTensorFlow / Keras interact with each other.
Concurrency bugs that happen nondeterministically.
Deep learning software and hardware stacks by design have a high degree of parallelism, whichprovide room for concurrency bugs.Concurrency bugs such as a race condition may happen only in certain program or hardware, ormay not be reproducible at all. They are difficult to notice, report, and debug.
As an example, pytorch/18465 is a use-after-free concurrency bug I found.The only symptom I observed is that some tensor values in my model are unexpectedly modified.Drawing any conclusions beyond that is challenging, because any simplication I applied to the model can cause the bug to disappear.A lot of hours were put to track down and reproduce it with minimal examples.And there is little chance that a unittest can guard against such bugs.

Two Debugging Stories¶

I'll share stories of two more silent bugs that I found in TensorFlow and PyTorch,where they both compute wrong gradients for some operators.Both bugs stayed unnoticed for > a year waiting to be discovered by me,presumably because users can hardly blame bad training on wrong gradients, rather than their own models.

Wrong gradients in PyTorch's `nn.SyncBatchNorm`¶

Notice the bug
I started to try out PyTorch's nn.SyncBatchNormin the summer of 2019 due to the need of this layer in the MoCo project.To gain some trust in this layer(I knew that BatchNorm is often implemented wrong, seethis later paper of mine),the first thing I did is to try it on some baselines I'm familiar with:a Mask R-CNN in detectron2.
Luckily, this was before TensorFlow introduced the next bug I would find later. So when I compared itwith my TensorFlow implementationof Mask R-CNN that also supports SyncBatchNorm, I can see that most results in detectron2 were a few AP (average precision) worse.
I know every detail of the two implementations since I wrote both of them, and their gap isnegligible when not using SyncBatchNorm.So I'm relatively confident that such a large gap is a library bug in PyTorch.
Confirm the bug
Next, we decided to just reimplement a correct SyncBatchNorm.It turned out to be quite easy, and this was later releasedin detectron2.Comparing the results of the two implementations further confirmed the bug is related to nn.SyncBatchNorm.
Narrow down the bug
From the experiments in various models, I noticed that suboptimal results only appear ifSyncBN is added in Mask R-CNN's mask head. Adding it to all other components is OK.Therefore I hypothesized that there are wrong computation results when batch size is differentacross workers, since that's where mask head is different from others.This hypothesis can be verified quite easily.After sharing our findings with the code owner, the root cause in gradient computation wasfixed.

Wrong results in TensorFlow's `nccl_ops.all_sum`¶

NCCL is widely used to reduce gradients among GPUs.However, it turns out that TensorFlow can do it wrong sometimes.This bug may affect all NCCL-based multi-GPU data-parallel training.Interestingly, it also affects SyncBatchNorm in TensorFlow if using NCCL.

Notice the bug
In the summer of 2020 I gave TF v1.15 a try.I planned to just do some basic benchmarks of my code, but a few Mask R-CNN training blowed up with NaNs after 10~20minutes of training. This has not happened before.
Confirm the bug
My first thought was that I broke my Mask R-CNN implementation at some commit.But after trying a few combinations of code versions,it became clear that TensorFlow was to blame, because the same code can train in TF v1.14,even if I make sure they use identical version of CUDA/CuDNN.
Narrow down the bug
I know that no one in TF team would use my entire training code to debug, so I have to narrow it down myself.But this was never easy, because wrong results in any step in the whole training system can lead to NaNs, andthere is nowhere to start looking.Moreover, the bug does not happen deterministically, and when I tried to simplify my code,it started to happen less frequently.
Luckily, there is still a painful but practical way to go: bisection. So I:
1. Made up a criteria that a "good" TF version must successfully survive 30 minutes of training for 3 times.
2. Figured out where to download historical nightly TF releases, becausecompiling TF by myself is too slow.They are stored in public GCS buckets like
  gs://tensorflow-nightly/prod/tensorflow/release/ubuntu_16/gpu_py37_full/nightly_release/
3. Performed a bisection between TF v1.14 and v1.15, first with the nightlies, then with my own builds, untilI found the offending commit.
Unfortunately, the offending commit seems correct to me. This means the commit which increases parallelism in NCCL probablytriggers a bug that dates back even earlier.
Further narrow down the bug
After playing with the offending commit a bit, given the non-deterministic behavior of the bug,and the content of the commit,my hypothesis was that the way TensorFlow uses NCCL contains concurrency bugs.
My original code only uses NCCL's all_sum to all-reduce gradients.To add a simple check of its results, I used tf.add_n to all-reduce the gradients again,and added tf.debugging.Assert to ensure that the two results have to match.Unsurprisingly, the results don'talways match -- a large discrepancy appears once a while between the results of tf.add_n andnccl_ops.all_sum.
This is where the heavy-lifting ended: I've turned the silent training bug into an obvious error.The bug is no longer about a failed training which "I think" should succeed,but is now about something that's obviously wrong in TensorFlow: weadded tensors in two different ways and results don't match!No one is obligated to trust the correctness of my training code,but every one has to admit that nccl_ops.all_sum and tf.add_n must not produce differentresults.

The rest is easy: I started to simplify my training code for better understanding of the bug,removed all depenencies, and eventually made a small enough self-contained reproducible script andreported a bug.Beyond that, it is no longer my responsibility.

Keys to Fight Silent Bugs¶

Summarizing from my own experience, the following are important to fight silent bugsin deep learning libraries:

Reproducing known resultsis the only way to discover silent bugs in model training.This is how we have an "expected output", so that we can notice if anythingunexpected is happening.
Narrowing down is necessary at least in the open source environment.Unless a small enough code clearly demonstrates a bug in the library, it's not the libraryowners' responsibility to understand and debug user code.After all, a bug often lives in user code rather than the library.The general guidelinesabout how to ask good questions/bug reports can apply to deep learning.
Bisection is slow, costly, but effective.When there is no obvious clues, and its cost is affordable, do a bisection.If anything can be better than bisection, that would be a trisection or k-section to reduce itslatency, because verifying whether a commit works or not may require training a model for quite a while.
Bisection is not always applicable. If there isn't a good history version as a reference,other more creative debugging methods will be needed.
Know the library well, understand its internals so we can make reasonable hypothesis andinvestigate them.It's often helpful to dig into library code: a few lines of debugging code at the right place can provide valuable informationthat cannot be easily obtained in user code.

Takeaways¶

Silent bugs exist in deep learning libraries, and are extremely hard tofind. What does this mean for everyone working on deep learning?

As a downstream library owner, do regression tests.Retrain the most important models once a while, just in case any regression appears in the stack.For example:

I re-train a few important models in tensorpack exampleswhenever I upgrade the TensorFlow version I worked with.
A few most representative baselines in detectron2 model zooare trained once a month using PyTorch master.Some smaller onesare trained once a week.

As a researcher, be skeptical. Use more precaution to prevent silent bugs, otherwise we'll drawwrong conclusions from wrong experiments. Some strategies include:

During the period of a research, stay in the same dev environment andavoid frequent upgrade of dependencies.After any upgrade, re-train some models to verify correctness.
For important research results, reimplement/reproduce them in different frameworks independently by different peopleto avoid hidden bugs in the stack (including hidden bugs that improve results).For example, both GroupNorm andMoCo were reproduced and released in >1 frameworks.MoCo was even implemented 4 times, by 3 different authors in the team.

As an average user, follow what the experts are using.Silent bugs exist but are hard to find. Without enough confidence onour own ability to always discover such bugs, follow the experts.
A library without years of battle testing may have many sharp edges or hidden bugs.Using a mature library like PyTorch or TensorFlow, the bug you may run into is more likelyto have been discovered by others already.This applies not only to libraries as a whole,but also to different features of a library, modules within a library, extensions of a library, etc.
This is not saying we should use the most popular thing.On the contrary, high-level frameworks that build over-simplified APIs to gain popularity among non-experts (e.g. Keras)are something a serious researcher would rather avoid:they may have silent bugs buried underneath simply because the intended user group is not capableof noticing them.
To make your code/library popular, reproduce known results to increase credibility."Following the experts" tends to create monopoly. To break that,deep learning training libraries can earn trust by reproducing known results, rather than justprovide examples of arbitrary toy models.This is a core principle in tensorpackthat I follow since the beginning,and is probably the most effective way to convince a user that your library/implementationdoes not have hidden silent bugs.

Yuxin's Blog

为什么应该使用 Stacked Diffs / Stacked PRs

场景 ¶

Code Review 的基本单元: branch vs. commit¶

Code Review 单元应尽可能拆小 ¶

如何管理 code review 间的依赖关系 ¶

Commit Identifier¶

From Stack to DAG¶

让 git + github 更好 ¶

Registration Does Not Scale Well

Registration for configs¶

Pay for what you use¶

Global states¶

Name conflicts¶

Overwrites¶

Pickle & multiprocessing¶

Obscure Provenance¶

Alternative: module name + variable name¶

Safe Static Initialization, No Destruction

Terminology¶

Static Initialization Order Fiasco¶

Safe Destructions¶

Summary¶

Some Useful Terminal Escape Sequences

Color & Rendering¶

Clipboard (OSC52)¶

Hyperlink (OSC8)¶

Kitty Graphics Protocol¶

Desktop Notification (OSC9 / OSC99)¶

Window Title (OSC0)¶

Demystify RAM Usage in Multi-Process Data Loaders

Motivation for In-RAM Data¶

In-RAM metadata is needed for flexibility¶

Measure RAM Usage¶

Copy-on-read Overhead and "Memory Leak"¶

Copy-on-read of forked CPython objects¶

Serialize to a Numpy Array¶

More on compression (not important)¶

Pickle Overhead in Spawn / Forkserver¶

Serialize to a torch.Tensor¶

Per-Process Import Overhead¶

Share Datasets among Multiple GPU Processes¶

Summary¶

Not Every Model Has a Separate "Loss Function"

"Loss Function": Separation of Logic¶

Problems of a Forced Separation¶

Duplication between "Model" and "Loss"¶

"Loss" is not Independent of "Model"¶

No Clean Separation¶

Trainer Does Not Need to Know about the Separation¶

Summary¶

How to Maintain Clean Core APIs for Research

Core does not aim to implement all use cases¶

Core should allow most features to be implemented out-of-core¶

Criteria for feature inclusion¶

Concern of New Arguments¶

Simple Flags / Options¶

Callbacks¶

Prefer forks over new arguments¶

Accept duplication, but aim to reduce them later¶

How to reduce duplication¶

Prefer branching at shallower code path¶

Automatically Flatten & Unflatten Nested Containers

Nested Containers Are Useful Abstractions¶

Bidirectional Conversion¶

Implementation of unflatten¶

Applications¶

JAX Pytree¶

Detectron2 TracingAdapter¶

TensorFlow tf.nest¶

TensorFlow FetchMapper¶

DeepMind tree library¶

TorchScript: Tracing vs. Scripting

Terminology¶

The Cost of Scriptability¶

Impact on Most Projects¶

Impact on Detectron2¶

Recommendation¶

Make a Model Trace and Generalize¶

The Cost of Traceability¶

Serialize to a `torch.Tensor`¶

Implementation of `unflatten`¶

Detectron2 `TracingAdapter`¶

TensorFlow `tf.nest`¶

TensorFlow `FetchMapper`¶

DeepMind `tree` library¶

`logging` module is not enough¶