Getting to the bottom of the Python import system

Mon Jan 16 2023E.W.Ayers

What happens when you type import foo.bar.baz in Python? The answer is really complicated! Read this if you've ever found yourself asking:

"Oh my goodness why can't python find my project?"
"Argh how do I import stuff in test files?"
"Why am I getting inscrutible import errors?"

The complexity comes from:

Modules don't have to be backed by a python file.
Modules can have names that are different to their path on disk.
The same module can be broken across multiple packages.
There is no standard way of thinking about python environments
There is no standard way to package python projects into reusable libraries.
A lot of the implementation details of the module importing system are changed between different versions of python. All of the deprecated constructs are still in there cluttering up importlib and the docs. In this guide I'm going to pretend the deprecated stuff doesn't exist.

Recommended reading is Chapter 5 of the Python Language Reference.

0.1. What is a module?

A python module is a python object with type ModuleType. Every module has a __name__ attribute. Modules live in a dictionary called sys.modules.

0.2. What is a package?

A package is a module with a __path__ attribute. The idea is that a package is a module that can contain other modules. If a module m is a member of a package p, we have m.__package__ == p .

1. What happens when you import?

We'll come back to relative imports. When you type import foo.bar.baz as x, this is semantic sugar for x = importlib.import_module('foo.bar.baz'). If we were to reimplement import_module, it would look something like this:

Check the sys.modules cache to see if it's already there.
Resolve the module by calling importlib.util.find_spec(name), to return a thing called a ModuleSpec. A module spec is a load of metadata about the module and a Loader object that decides how the module object is created and initialised.
Create the module using the given Loader object
Add metadata attributes like __name__ to the module
Add it to sys.modules
Initialise the module.
Return the module

In pseudo-python:

(1)

def import_module(name : str):
	# if the module is already loaded in the
	# sys.modules cache, just return that
	if name in sys.modules:
	  m = sys.modules.get(name)
	  assert m is not None
	  return m
	# resolve the module name
    spec : ModuleSpec = importlib.util.find_spec(name)
    if spec is None:
      # we couldn't find the module with that name.
      raise ModuleNotFoundError()
    # create the module
    module = spec.loader.create_module(spec)
    # add metadata attributes to module:
    # ie __name__, __spec__, __package__, __file__, ...
    _init_module_attrs(spec, module)
    sys.modules[name] = module
    # initialise the module
    spec.loader.exec_module(module)
    return module

Caveats:

If sys.modules.get(name) is None, then it will always throw a ModuleNotFound error
If spec.loader.exec_module(m) raises an exception, we delete the module from sys.modules before reraising.
Implementation of _init_module_attrs.
Python docs on loading
It's possible to make a spec without a loader or without the loader having create_module. Eg. legacy loaders use load_module. There is some omitted logic for dealing with these cases. If you need to create a module from a spec (ie everything before the sys.modules[name] = module line), you should use importlib.util.module_from_spec (source).

1.1. What is `importlib.util.find_spec` doing?

How this works is really complicated. The basic task is to take a module name and spit out a ModuleSpec, which is all of the information needed to load a module into the python runtime.

1.1.1. Summary

Let's start by stating the usual path that find_spec takes:

Start with the module name "foo.bar.baz"
Make sure parent modules foo and foo.bar are imported.
If there is a parent module set paths = foo.bar.__path__ or use sys.path otherwise. The paths are directories that the import system should look in to find modules. Eg for me numpy.__path__ = ['~/.pyenv/versions/3.10.6/lib/python3.10/site-packages/numpy']. sys.path is your site-packages directory and the paths of any folders you have done pip -e on.
The system looks in all of the paths directories for either baz.py or baz/__init__.py.
If it finds one of those it returns a ModuleSpec with the loader being a SourceFileLoader.
In the case of __init__.py, the module is a package (ie the module's __path__ attribute is set to be the directory of the file)

1.1.2. Longer Summary

Start with the module name "foo.bar.baz"
Make sure parent modules foo and foo.bar are imported.
If there is a parent module set paths = foo.bar.__path__ or sys.path otherwise.
For each 'meta finder' in sys.meta_path, run find_spec("foo.bar.baz", paths).
Usually, this falls through to the last finder in the sys.meta_path list called PathFinder.
PathFinder runs for each p in paths and each 'hook' hook in sys.path_hooks: hook(p).find_spec("foo.bar.baz") and returns the first one that doesn't throw an ImportError or return None.
Usually, this falls through to a FileFinder(p).find_spec('foo.bar.baz') which does the following.
Get the tail module: "baz". We succeed if any of the following files exist in the p directory: baz.py or baz/__init__.py (or a directory baz/ (called a 'namespace module'), we'll come back to this case)
A ModuleSpec is returned with the loader being a SourceFileLoader. If the extension above was .pyc then a SourcelessFileLoader is used.

1.1.3. The Gory Details

There is a list of MetaPathFinder objects living in sys.meta_path. You can modify sys.meta_path to include your own things. A MetaPathFinder has one method find_spec that returns a module spec given a module name and an optional list of filepaths to look at to find the module

importlib.util.find_spec will run through all of the finders in sys.meta_path, making sure that parent packages (ie, modules with a __path__ attribute) are imported first. If there is a parent module (eg foo is the parent package of foo.bar), foo.__path__ is passed as the path argument to the finder. The pseudocode for this is below.

(2)

def util.find_spec(name, paths = None, target = None):
	parts = name.split('.')
	# parts = ["foo", "bar", "baz"]
	if len(parts) > 0:
		parent_name = ".".join(parts[:-1]) # 'foo.bar'
		parent_module = import_module(parent_name)
		paths = parent_module.__path__
		# the __path__ field on parent_module is
		# a list of file-paths that are used to resolve the module.
	for finder in sys.meta_path:
		spec = finder.find_spec(name, paths)
		if spec is not None:
			return spec
	return None

There are lots of MetaPathFinders in sys.meta_path that do various things, and libraries like to add their own too. The main, fallback finder is called PathFinder (source) and essentially does the following (+ caching + error handling + legacy + 'namespaces'):

(3)

class PathFinder:
	@classmethod
	def find_spec(cls, fullname, paths = None):
	  if paths is None:
	    paths = sys.path
	  for path in paths:
	    # find the first hook that doesn't throw
		for hook in sys.path_hooks:
			try:
			  finder = hook(path)
			  break
			except ImportError:
			  continue
		if not finder:
		  continue
		spec = finder.find_spec(fullname)
		if spec is None:
	        continue
	    return spec

So, there is a list of functions called sys.path_hooks of type List[Callable[[str], PathEntryFinder] where each returned PathEntryFinder is yet another abstract class that you have to call find_spec on, this time with no path argument.

In sys.path_hooks, the default two of these 'path hooks' are a zip importer and a FileFinder (source). FileFinder is the main one. A FileFinder is initialised with a path : str which is the directory that the finder is in charge of searching. FileFinder is also initialised with a list of extension suffixes (x = ".py", ".pyc") and loaders (SourceFileLoader, SourcelessFileLoader). FileFinder looks for a file p/baz.x or p/baz/__init__.x and returns the ModuleSpec with the relevant loader.

1.1.4. How to extend `find_spec`?

So, if you want to extend the module loading system with your own stuff, you can:

set sys.path_hooks to use your own PathEntryFinders. Do this when you want to be given a path p to the package, but do some extra logic beyond looking for baz/__init__.py or baz.py. Or if you want to return custom loaders for your own fancy extension.
set sys.meta_path to use your own MetaPathFinder. Do this when you want to add custom logic for finding modules. Eg if you wanted to make a finder that downloaded from URLs instead of files.

1.2. Why is this so complicated?

Caching: Each of the stages I outlined above also has a caching stage. Additionally, you need mechanisms to invalidate the cache so you can do live-reload operations.
Legacy: there used to just be one finder class called Finder, but this wasn't good enough because you need to be able to use different finders for different cases, so an extra layer of meta-finders was added to find the finders.
Nitpicky edge cases:
- namespace modules
- packages
- loading modules from non-python source
- loading modules direct from archives
- lots of different places where packages can be stored: environments, conda, the internet etc.

2. How does the import system decide to add `path`?

Given any module, you can make it a package by simply adding a __path__ attribute. However if your module is an __init__.py file, it will automatically add __path__ to be the parent directory.

3. What about relative imports?

A relative import is an import where the module name being imported starts with a dot. For example import .foo. In the above case, you take the current module m that is running import .foo; and you take the parent module name: m.__package__ (caveats); and you prepend that to .foo and do an absolute import.

If there are multiple dots as in import ..foo, you repeat the parent-finding process for the number of dots present.

This definition of relative import sucks because it means that in order to use them your python files need to be inside a package in order to import from each other. The shortcut way to do this is to just add __init__.py folders everywhere.

I recommend never using relative imports except inside of __init__.py files. It's just not worth it.

4. What are namespace packages?

A namespace package is a python package that doesn't have an associated module (ie no __init__.py). The idea is you can split a package across multiple files. See this Stack Overflow answer for more detail. Adding namespace packages complicates the logic for find_spec.

5. Sadly, `main`.

When you execute a python file with python foo.py, the given file is not loaded as the module foo. Instead, it is loaded as a special module called __main__. The main problem that this causes is that it breaks relative imports, since the __main__ module does not have a __package__ attribute set. The main recommendation seems to be that you should just avoid using relative imports.

6. Importing resources

[todo] this section is still under construction [todo]

Another cool thing that you can do with the Python import system is 'import' files that are not Python files. You can import data files or executable binaries.

Usually, if you want to get a file from a Python script you will call open('path/to/file'), but this assumes that you know where the file is on disk. By 'importing' files, you can ensure that the files are present wherever your Python package is called from, even if it is downloaded from PyPI.

There are two sites that told me this existed:

importlib-resources which looks semi-official. I think what happened is it used to be its own library that got integrated into core.
Python 3 docs page

I'll try to keep with the example given in 'importlib-resources'. We have some folder structure:

(4)

mypkg/
  __init__.py
  resource.txt
  foo.py

Now in foo.py I can write:

(5)

from importlib.resources import files

7. Module resolution failures that always get me

7.1. Basic importing from a directory is broken

Suppose our working directory looks like this:

(6)

asdf/
  b.py # ← from asdf.c import X; Y = 5; print(X)
  c.py # ← X = 4
a.py # ← from asdf.c import X; print(X)

If I run python asdf/b.py, it will refuse to resolve c.py (no module named asdf). If I run python a.py, it will be ok!

One answer is to replace the import in b.py with from c import X. Then you can run python asdf/b.py and it's ok. But, now, if add a line from asdf.b import Y to a.py, we will get "no module named c.

I can't see how this is anything other than a flaw in Python. There is no way to import between the directories that doesn't break.

I usually get around this by making the root project folder a package with a pyproject.toml, and then running pip install -e .. But it's so miserable that I have to do that.