Getting Serious About Open Source Security
A not so serious look at a very serious problem
A Blast From the Past
2019 was a crazy time to be writing software. It's hard to believe how careless we were as an industry. Everyone was just having fun slinging code. Companies were using whatever code they found laying around on NPM, Pip, or Maven Central. No one even looked at the code these package managers were downloading for them. We had no idea where these binaries came from or even who wrote most of this stuff.
And don't even get me started on containers! There was no way to know what was inside most of them or what they did. Yet there we were, pulling them from Dockerhub, slapping some YAML on them, and running them as root in our Kubernetes clusters. Whoops, I just dated myself. Kubernetes was a primitive system written mostly in YAML and Bash that people used to interact with before Serverless came and saved us all.
Looking back, it's shocking that the industry is still around! How we didn't have to cough up every Bitcoin in the world to stop our databases from getting leaked or our servers from being blown up is beyond me. Thankfully, we realized how silly this all was, and we stopped using whatever code had the most Github stars and started using protection.
We’re Under Attack
No, really. Every time you
go get, or
mvn fetch something, you’re doing the equivalent of plugging a thumb drive you found on the sidewalk into your production server.
You’re taking code from someone you’ve never met and then running it with access to your most sensitive data. Hopefully, you at least know their email address or Github account from the commit, but there’s no way to know if this is accurate unless you’re checking PGP signatures. And let’s be honest, you’re probably not doing that.
This might sound like I’m just fear-mongering, but I promise I’m not. This is a real problem that everyone needs to be aware of. Attacks like this are called supply-chain attacks, and they are nothing new. Just last month, an active RCE vulnerability was found in an open source package on PyPi that was being used to steal SSH and GPG credentials.
There are lots of variations on this same play that make use of different social-engineering techniques in interesting ways. One attacker used a targeted version of this to steal cryptocurrency from a few specific websites. Another group performed a “long-con” where they actually produced and maintained a whole set of useful open source images on Dockerhub for years before slowly adding, you guessed it, crypto-mining.
The possibilities are endless, terrifying, and morbidly fascinating. And they're happening more and more often. If reading about attacks like these is your kind of thing, the CNCF has started cataloging known instances of them. Snyk also just published a post detailing how easy it is to inject code like this in most major languages — Github even hides these diffs in code review by default! Russ Cox has also been writing about this problem for a while.
OK, there's a bit of hyperbole up there (Kubernetes doesn't have that much bash in it), but open source is under attack, and it's not OK. Some progress is being made in this area — GitHub and others are scanning repositories, binaries, and containers, but these tools all only work on known vulnerabilities. They have no mechanism to handle intentional, malicious ones before they are discovered, which are at least as dangerous.
The brutal fact is that there is no way to be confident about the code you find on most artifact repositories today. The service might be compromised and serve you a different package from the one the author uploaded. The maintainer's credentials might have been compromised, allowing an attacker to upload something malicious. The compiler itself might have been hacked, or even the compiler that compiler used (PDF warning)! Or, the maintainer could have just snuck something in on purpose.
For any given open source package, we need to be able to confidently assert what code it’s comprised of, what toolchains and steps were used to produce the package, and who was responsible for each piece. This information needs to be made available publicly. A reliable, secure view of the supply-chain of every open source package will help make these attacks easier to prevent and easier to detect when they do happen. And the ability to tie each line of code and action back to a real individual will allow us to hold attackers accountable.
How Do We Get There?
We need to work as an industry to start securing open source software, piece by piece.
Artifact repositories need to support basic authentication best practices like 2FA, artifact signing, and strong password requirements. DockerHub, PyPi, and NPM support 2FA, but there’s no way to see if a maintainer of a package is using it. Most container registries don't support signatures yet, though work is ongoing.
Software build systems need to make reproducible, hermetic builds possible and easy. Debian has started doing some great work here, but they're basically alone. Every docker build gives you a new container digest. Tar and gzip throw timestamps everywhere. It's possible to get reproducible builds in Go, Java, and most other major languages, but it’s not necessarily easy. See the recently published whitepaper on how Google handles much of this internally for more information.
SCM providers need strong identity mechanisms so we can associate code back to authors confidently. Git commit logs can be easily forged, and signed commits are not in common use. Even with them, you still have no idea who is on the other end of a PR, only that the signature matches. This isn't just an issue for security. It can also be a licensing nightmare if you don't know the real author or license of code you're accepting.
There is value in allowing developers to work anonymously, but there is also a cost. We need to balance this with systems that apply a higher level of scrutiny to anonymous code. We also need to allow other individuals to "vouch for" patches that they’ve examined, maybe similar to how Wikipedia handles anonymous edits.
And finally, all of this needs to be tied together in secure CI/CD systems and platforms that implement binary transparency for public packages. Putting the packaging steps in the hands and laptops of developers leaves way too large an attack surface. The ability to push a package that will run in prod is the same as having root in prod. By moving the build and upload steps into secure CI/CD systems, we can reduce the need to trust individuals.
OK, but What Can I Do Now?
First, start by securing your code as much as possible. Make sure you have copies of every dependency you're using stored somewhere. Make sure you review all code you're using, including OSS. Set up and mandate the use of 2FA across your organization. Publish, and actually check the signatures and digests of the software you're using.
Log enough information in your build system so you can trace back every artifact to the sources. And every deployment to the artifacts. Once you've done all of this, you'll be pretty far ahead of everyone else. You're not completely safe, though.
That's where we need to work together. If you're interested in helping out, there are many ways to get involved, and I'm sure there are a lot of efforts going on. We're just getting started on several initiatives inside the Continuous Delivery Foundation, like our new Security SIG. We're also hoping to make it easier to build and use secure delivery pipelines inside the TektonCD open source project.
We would love your help, no matter your expertise! For example, I'm far from a security expert, but I've spent a lot of time working on developer tools and CI/CD systems. Feel free to reach out to me directly if you have any questions or want to get involved. I'm on Twitter and Github.