OCI Artifacts Explained
The OCI (Open Containers Initiative) manages a few specifications and projects related to the storage, distribution, and execution of container images. If you’ve ever run a docker container, you’ve interacted with these specifications, whether you know it or not.
OCI registries are typically used for container images, but it’s actually possible (and even a good idea in some cases) to use them for storing and distributing other types of data. There are a couple techniques for doing this, and one of them is commonly referred to as “OCI Artifacts”. Even though they’re in relatively widespread usage, I’ve seen a lot of confusion recently around what an OCI Artifact actually is. Here’s my take:
OCI Artifacts are not a new specification, format, or API. They’re a set of somewhat-contradictory conventions for how to store things other than images inside an OCI registry.
This post explains what these conventions and contradictions are, why you might want to consider using them, and also explains some of the activity and proposals going on in the OCI right now.
What even is an OCI?
To fully understand what an OCI Artifact is, we first have to understand what an OCI image is. Let’s start with the confusing part — you’re probably not actually using OCI images! Yes, even if you run hundreds of thousands of containers pulled from an OCI-compliant registry into an OCI-compliant runtime, deployed across a thousand-node Kubernetes cluster, you’re probably not using OCI images. What? OK fine, this is a bit pedantic, but this entire blog post is pedantic. Let me explain.
First — Docker images and OCI images are only mostly the same thing. In the early days of the OCI, the most widespread container format was by far Docker. As vendors pushed for a neutral specification, reusing the Docker image manifest format made the most sense to start with. The only problem was that this format contained a bunch of hard- coded references to Docker the company.
The Docker manifest (known as Image Manifest 2 Schema Version 2) looks roughly like this:
If you’ve ever worked with Docker images, this should look pretty familiar. There’s some boilerplate at the top, a
config section, and a list of
layers. The OCI format could use it’s own blog post or series, so I won’t dive too far in here. The boilerplate is the important part to look at to understand what an OCI image is and why this isn’t one, specifically the
mediaType field refers to the entire manifest. Here, we can see that it is set to
application/vnd.docker.distribution.manifest.v2+json with the
docker string still present. The OCI reserved some new
mediaTypes with IANA, those are:
If you’re paying close attention, you’ll see that there are only two
mediaTypes included here, but three earlier by Docker. I’ll explain that in a bit. Nothing else substantially changed in the format after the transition from Docker to OCI other than these strings.
Most registries added support for the new
mediaTypes over time, but tooling that wanted to remain portable needed to continue to support the old formats. This led to a cycle, which is still present somewhat today. Tooling worked fine so no one was begging for support for these new formats, meaning registries were slow to add support, meaning tooling continued to remain backward compatible with the old types, meaning you’re probably not actually using OCI images, you’re using Docker images.
So we now know what OCI images are, what are OCI Artifacts? The Artifacts initiative set out to make the registry format a bit more generic, to allow for users to store and distribute arbitrary files. We’ve already had blob storage and binary distribution systems for years, so why would we want to reuse a container format? Well, it turns out the container format has some really nice features that make it suitable for other environments (mostly the content addressable API).
It’s also widely supported by platform vendors, and finally happens to be the only real dependency to bring up a production Kubernetes environment. You need to host the k8s images themselves somewhere. While it is possible to run a registry in a Kubernetes cluster, that starts to bring in some chicken-and-egg problems.
So we have a pretty good API that’s widely supported and can reasonably be assumed to be available. Why not stick other things in here? Kubernetes-adjacent projects like Helm, TektonCD, OPA, and more started to pick up on this trend, allowing for configuration, charts and policy modules to be stored in an OCI registry. Other package managers followed suit too, like Homebrew and even WASM.
These other formats figured out ways to fit their data model into the OCI image manifest, mostly successfully. The
manifest gives you one
config blob and then a list of
layer blobs, but clients can choose to interpret these however they wish. Most seem to use just one
layer for all package data, but some make use of multiple.
Things mostly work out fine in this model, but if you have a single registry with different types of data inside of it, things can get confusing for clients. This is normally what
mediaTypes are for — you can indicate what type of data is contained in a blob. Clients read this before interpreting data and either accept or ignore the data they want. This works great for the layer fields, but the OCI image manifest does not actually allow users to set the
manifest.mediaType for the overall object.
Remember how we had three
mediaTypes in a docker image (one for the manifest, one for the config, and one for the layers), but OCI images only had two (one for the config and one for the layers)? This is due to backwards compatibility issues with older docker clients. The manifest.mediaType must either be the legacy docker one or unset. This means there’s no great place to indicate what is contained by the overall manifest, so most artifact types just don’t set this.
To try to improve on this model, the OCI Artifacts project set out to define some conventions here. They decided that the best place to indicate the type of the overall artifact would be the
This isn’t ideal, but it’s the best option really available. That means a random artifact type might look like:
You can see that the config blob is set to a custom type to indicate the format for the entire object. That’s what an OCI artifact is! Something:
- other than an image
- stored in a registry
- that sets a custom type in the
Here’s the contradiction: not enough registries support this field for tooling to be able to rely on it. The OCI specification even recommends against using this if you care about portability:
Many registries still don’t even support OCI
mediaTypes at all, or have an allowlist preventing custom
config.mediaType values. This leads back to our circle of stale specifications — tools can’t rely on being able to set this field, so they just don’t. People either communicate information about what type of content is in each image out-of-band, or use other conventions/heuristics to figure out whether they understand how to work with the content of an image.
Tool authors can’t rely on this feature working, so they don’t technically use OCI Artifacts per the OCI definition, they just store arbitrary artifacts inside of OCI (or even Docker) manifests and work out how to fetch them some other way.
I hope this post explains things a bit more clearly. An OCI Artifact is not a new type, or a new specification registries and tools need to add support for. It’s a documented convention around how to use a specific field inside of a manifest to indicate what the manifest contains, that’s unfortunately not widely supported enough to be used yet. Some specifications rely on it (WASM, mostly), while the others don’t.
So what’s next? I do think that storing other things inside a registry is a great idea, and even maintain a bunch of tools that do this. I’d love to see some changes to the specification to make this easier. Requiring registries to allow other
config.mediaTypes might be one place to start. There are a few other proposals around completely new API versions floating around that seem useful. There are also some other proposed changes in flight around linking stored data together in ways not possible today.
These all seem great, but getting these rolled out widely enough to be useful to tooling will take years if it happens at the same pace as the last round of specification changes. Hopefully we can speed things up!