Registration Does Not Scale Well

People have many different opinions about config systems. Having worked with various styles of configs, I also want to write about what a great config subsystem in a large-scale (in terms of system complexity, number of users, etc.) system should look like.

The design space is complex, so in this article I'll start with a smaller topic: registration in config systems. I'll show why this common pattern, though works fine for small-scale projects, does not scale well in the long term. I'll also discuss an alternative.

Registration for configs

Configs are often constrained to include only primitive types (numbers and strings), and there are a lot of good reasons to keep this property.

A global registry in a config system is typically a Mapping[str, Any]. Its purpose is to allow users to refer to complex objects through simple strings in a config, overcoming the constraint of primitive types.

Objects can be added to a registry like this:

# Maps string to a class/type.
@ModelRegistry.register
class MyModel():
...

# Maps string to an object.
DatasetRegistry.register("my_dataset", MyDataset(...))

This allows users to select which model / data to use by setting cfg.model.name = "MyModel" or cfg.dataset.train = "my_dataset".

Registration has a clear benefit, but at a larger scale, some of its downside could become serious.

Pay for what you use

Users should only pay (compute cost and mental cost) for what they use is a general design philosophy I found pretty important in almost all aspects of software design.

The registration pattern breaks this philosophy by running unnecessary registration code: users will only provide one (or very few) string in their config, but they have to pay the overhead of registering many candidate strings that users might need.

To make matter worse, the overhead has to happen very early in a program, typically at import time. Import speed is crucial for developer ergonomics: unlike other code that may run async with development, import often blocks developers.

The registration overhead includes:

  1. Cost to import extra Python modules that contain registration code, and all their dependencies.
  2. For registries that map strings to non-trivial objects (not just types/functions), the cost to create these objects.
    • A better practice is to avoid such registries: don't store objects in the registry, but store functions that create these objects if possible. However this does not always solve the problem: the function may have to be a closure that close on non-trivial objects, in which case the objects still have to be created at registration time.

These costs are negligible for small-scale projects, but they can become quite bad when there are hundreds or more objects to register. Bad patterns are guaranteed to appear at larger scale: There will be some users doing non-trivial registration (e.g. register objects in a for loop) that's slow or even has unintended side effects. I had to work with many projects that take > 10s to import and the most common reason of slow import is registration.

The import overhead is also greatly magnified by Python's multiprocessing module: all subprocesses will have to spend the time and RAM to rerun the imports.

Global states

Registries are typically defined as a global dictionary, so they share many inherent problems of using global states.

Name conflicts

It's not uncommon that different users register different objects under the same name -- at a large scale that's guaranteed to happen.

Such conflicts can live in two users' own code for a long time, unnoticed, until one day someone needs to depend on both. The only viable solution is usually to rename one, hence break all its users.

Overwrites

To complicate the issue even more, people sometimes decide to resolve name conflicts by overwriting what's already registered. For example, an "overwrite" option is provided in the registry of iopath, mobile_cv, and paxml. Using this option may introduce hard-to-debug problems, because now an innocent import statement may silently change the behavior of user code.

Despite of this, note that overwriting is actually necessary when working in notebooks, where it's common to "reload" code (therefore re-register objects) on the fly. Here is some code I use to always enable overwrite during reload.

Pickle & multiprocessing

When running a function using a multiprocessing.Process created with a safe start_method like "spawn", the child process receives a pickled closure from its parent, so it knows what to run. However, this pickle does not include any global states. This implies that if a function access global states, it may behave differently depending on if it runs in the subprocess or the parent process. Python's documentation has a clear warning about this:

if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called.

The ray framework can run a pickled Python function remotely, and therefore it has similar (and even more counter-intuitive) issues.

Obscure Provenance

Since the registration is globally accessible, it's not easy to find where in the code an object is registered (or modified, if overwrite is allowed) just by reading the code. When a user sees cfg.dataset.name = 'dataset_X' and is curious what is "dataset_X", a global string search is almost the only way to find it out without running the code. And the search does not always work: if the name is programmatically generated, then the string cannot be found directly in source code, e.g.:

for dataset in ["ds1", "ds2"]:
for r in [0.1, 0.5]:
DatasetRegistry.register(f'my_{dataset}_ratio{r}', MyDataset(dataset, r))

In this case, users will have to be more creative about what strings to search.


In C++, registries cause more trouble because construction and destruction of global objects are very tricky. In Safe Static Initialization, No Destruction I talked about a few PyTorch's C++ bugs related to this. Luckily, in Python, there are better alternatives.

Alternative: module name + variable name

If the only goal of registration is to provide a name → object mapping, then a simple alternative in Python is to use obj.__module__ + '.' + {variable name} as the name, which may look like some_library.some_module.MyClass. "Variable name" can be obj.__qualname__ for classes & functions.

Given this string, one can then call a simple function such as the builtin pydoc.locate to obtain the object it refers to: modules will be imported on-demand by importlib.

Use registration: Use full qualname:
# my_lib/my_module.py:
@ModelRegistry.register()
class MyModel(...):
...

# main.py --name=MyModel:
from my_lib import ModelRegistry
import my_lib.my_module # import to register
model = ModelRegistry.get(args.name)
# my_lib/my_module.py:
class MyModel(...):
...

# main.py --name=my_lib.my_module.MyModel
model = pydoc.locate(args.name)

This pattern has some obvious advantages over registration:

  1. No need to import any unused modules. Modules are imported on-demand.
  2. No global states.

There are some common concerns of this pattern, but they are not hard to address.

  1. It's slightly harder to dynamically create candidates: there is no "registry" to add objects to, and the only equivalence is to edit the globals() dictionary directly.

    for dataset in ["ds1", "ds2"]:
    for r in [0.1, 0.5]:
    globals()[f"my_{dataset}_{r}"] = create_dataset(dataset, r)

    I don't consider this a big issue because it's actually discouraging bad practice: in a proper config system (e.g. one that's based on recursive instantiation) there should be no need to dynamically generate candidates like above. I hope to get to this in a future article.

  2. The names in config have to match the names of classes/functions in code.

    This has the benefit of clarity on one hand. But on the other hand, code owners have more responsibility to maintain backward compatibility, especially after renaming their classes and files. The standard good practice suffices to address this: distinguish private vs. public symbols; keep an alias from the deprecated name to the new name; etc.

  3. The names are too long.

    This is a real problem. Here are some possible ways to address it:

    • A $PATH-like mechanism can be used to specify which modules to search for names. The search path can include common prefixes like "my_lib.my_module" so that users only have to provide "MyModel".

    • There can be a registry-like Mapping[str, str] that maps from "MyModel" to "my_lib.my_module.MyModel" so that users don't have to write long strings. This mapping doesn't have to be global and doesn't introduce import overhead. This can help with problem (2) as well.

    • This is just a UI-level issue. Having a better config frontend, e.g. using Python code as the config language, can make this issue disappear! Let me save this for a future article.

Comments