This post is is a write-up of my talk at PyCon ‘22. I will explain how we distribute our internal libraries, what we tried in the past and what we ended up using now. At my company medaire, our main product is setup using a dockerized microservice architecture. Keep that in mind, as a lot of the experiences are specific to our kind of setup. Hopefully it will be helpful for other kinds of setups as well.
At first we started out with just a handful components, say: retrieve data,
process (ML), generate report, and finally send that out. These would have
some external dependencies like numpy
, matplotlib
, or sqlalchemy
, which
we store in a requirements.txt
and pull from PyPI, the official
Python Package Index.
Over time, we added more and more components, a work queue manager, telemetry, etc. These started to share more and more code, so it made sense to extract that into a common library. But where do we store that?
A short detour/rant: Python gives us a beautiful easter egg, explaining some of
its core design philosophy. Just type python -m this
. And for packaging
and package distribution, line 13 falls short of it’s expectations.
There should be one—and preferably only one—obvious way to do it.
While languages like Rust or Go have their canonical package manager built-in, in the Python ecosystem there are many options to choose from:
pip
, which used to beeasy_install
is almost official, but needs to be installed separately- Conda is very popular in data science and also handles many other
kinds of packages, it was designed specifically because
pip
can be pretty hard to set up for beginners - Poetry is the new kid on the block
- and there is even more
And at least for pip
you can also install packages in various formats from
many different kinds of sources in requirements.txt
- local path like
../../mylib
- a wheel, which can contain compiled extensions
- a egg, which is an older version of wheel basically?
- a git URL like
git@github.com:user/repo#4398f09345a3004
- a URL to a zip file of a package directory
And because you need to separate the dependencies for different projects, so you can install different versions for different projects you also need to use some sort of environment separator like
- [
venv
][venv] which is in the standard library - but the Python Packaging Authority actually recommends using the
virtualenv
wrapper - but also recommends using
pipenv
- and Conda has its own thing
condaenv
So as you can see this ecosystem can be quite confusing, especially for beginners.
So in the end, we decided that most of this logic is pretty generic and could
very well be open source. We just put it on a public Github repository, used
tags for versioning and pointed our requirements.txt
to the git URL as shown
above. Would not need to set up any kind of authentication to Guithub on our
workstations or CI, because it was out in the open anyway.
That worked pretty well for a while, but we added more classification components that contained confidential business logic. So while these components also contained shared code that would be worth extracting into a library, we did not want to make this one open source. So our old solution was just not cutting it.
We decided to stick with what we know: PyPI works great for external
dependencies, so why not use it for internal dependencies? We set up our own
private PyPI instance, built wheels for our internal libraries and pushed them
there. Now we can add those to our requirements.txt
just like the external
ones. But over time, we found more and more problems with this approach.
- Authentication: We do not want have our private PyPI be accessible from the open internet. In the end that was the whole point of setting this up. So we put it behind a VPN. But now we need to setup the VPN on all the workstations that need to be able to build our Docker images. And on top of that we have to set the VPN connection up on our CI. That is a lot of work and you have to take care to not accidentally expose those credentials.
- Speed of iteration: We found that a PyPI is well suited to serving stable versions of libraries, think of it like an archive of packages that you can go back to and rebuild old image versions if you ever need to go back. But iterating the library while working on a feature in another project can be a hassle: Make some changes to the library, build a new WIP wheel, rebuild the project to pull the new version, find the type, rinse and repeat. Afterwards clean up all thos WIP wheels—oh wait I don’t have permission to delete here (which is a good thing)… You see where I am going.
- Single point of failure: Now we have this critical service, if the PyPI is down, no-one can build images, we are all blocked. We had problems with the availability and performance of the instance, but we actually just want to write code, not host infrastructure.
- Version compatibility: In a lot of cases, you need to rebuild your wheels if you upgrade your Python version, so that can be a hassle, = especially if you have to rebuild old legacy versions that some old project still depends on.
- Security: Dependency confusion attacks are very real, as demonstrated
in this blogpost and
news coverage. If someone adds a package
by the same name of one of your internal dependencies to the public PyPI,
pip
will silently favor that one (with default settings). That can be a route to inject malicious code into your codebase!
After trying out different approaches to handle this problem, we settled with not using a package manager at all, but instead include all the internal dependencies explicitly, directly in the project’s repository. And git submodules give us.
please check back in a couple of days for the complete version