Common Python Reference Cycle Patterns

In Python, when a set of objects constructs a reference cycle, none of them would reach a zero refcount. In this case, even if these objects all go out-of-scope and are no longer accessible, they will not be immediately released.

The Python ecosystem typically accepts reference cycles as an inevitable issue, and relies on garbage collection (GC) to avoid leaks. A GC is triggered by the Python interpreter from time to time; it will detect all non-reachable objects, and release them regardless of their refcount.

However, in high performance deep learning systems, GC is not always a good choice.

Read more

Registration Does Not Scale Well

People have many different opinions about config systems. Having worked with various styles of configs, I also want to write about what a great config subsystem in a large-scale (in terms of system complexity, number of users, etc.) system should look like.

The design space is complex, so in this article I'll start with a smaller topic: registration in config systems. I'll show why this common pattern, though works fine for small-scale projects, does not scale well in the long term. I'll also discuss an alternative.

Read more

Demystify RAM Usage in Multi-Process Data Loaders

A typical PyTorch training program on 8 GPUs with 4 dataloader workers per GPU would create at least processes. A naive use of PyTorch dataset and dataloader can easily replicate your dataset's RAM usage by 40 times. This issue has probably affected everyone who has done anything nontrivial with PyTorch. In this post, we will explain why it happens, and how to avoid the 40x RAM usage.

Read more

Effective Use of Python 'logging' Module

In large systems, logs can be terrifying: they are huge in volume, and hard to understand. This note lists some suggestions and common misuse of Python's logging module, with the aim of:

  • Reduce redundant logs & spams from libraries.
  • Allow more control of logging behaviors.
  • Make logs more informative to users.
Read more

On Environment/Package Management in Python

Python's package management is a mess. I'm involved in a few open source projects and I often help users address their environment & installation issues. A large number of these environment issues essentially come down to incorrectly / accidentally mixing multiple different python environment together. This post lists a few common pitfalls and misconceptions of such.

Read more

Classify WeChat Audio Messages using Speaker Recognition

Problem

微信的聊天记录导出一直是挺麻烦的事, 尤其是在 iphone 上. 前几天想导出一部分语音聊天记录, 就到 iphone 的文件系统里去找了一下, 发现微信的语音记录存放在/var/mobile/Applications/{app id}/Documents/{user id}/Audio/{friend id}/*.aud

问题是, 微信将两人互相的对话音频存在一个目录下, 不知道如何区分, 去逆向微信的聊天记录格式恐怕比较困难, 于是想到使用上学期做的说话人识别 (Speaker Recognition) 系统来自动处理这个问题.

Read more