Making “many eyes” a reality with automation
Many advocates of open source software claim it is easier to make secure because it has “many eyes” looking at it. It is probably true that in the limit enough eyes can catch all bugs, but there’s no evidence we’ll ever hit that limit. Thankfully, we can help add more eyes to OSS with automation. This post explains how I built a platform to hunt for malicious behavior hiding in plain sight.
The specific type of malware I’m looking in this post embeds itself in open source packages, and exploits the fact that many package managers allow packages to invoke custom code during installation. This is required by many high-level languages to do things like compile native extensions, but leaves a pretty wide attack vector open for developers.
I got the idea to use Falco for this after @DanPopNYC sent me a few links to on Twitter. I had been meaning to play with Falco for awhile, but never had the perfect project to try it out on. Finally an idea struck!
I’ve been working to help Jordan Wright productionize the malware scanner he built for his PyPI analysis, and I hadn’t come up with a way to easily scale out the process monitoring components. I was hoping to find something that would allow us to run each analysis job in a Kubernetes
Pod to make it easier to setup ephemeral environments and scale out the workload to many machines, and Falco seemed to perfectly fit that bill. Jordan’s original code used sysdig to generate capture data during installation and to filter through the logs later. and Falco provides basically that exact feature for an entire Kubernetes cluster. I got Falco up and running in a small GKE cluster using the Helm chart, then wrote up a couple rules to look for suspicious behavior.
My high level plan was to install each package in an application directory, and alert for any file access that happened outside of that application directory. Some files are expected here (things like shared libraries, common build tools), but other types of access might indicate malicious behavior. I decided to start with a broad Falco condition for anything outside a few allowed directories:
You’ll also notice Falco supports filtering based on Kubernetes metadata, like the
Labels. I used these to cut out some of the noise.
The next question was how to get data out of Falco to where I eventually wanted to analyze it. Falco supports a few methods here out of the box, like httpOutput rules and FalcoSidekick. I decided to go with
httpOutput to send events to another small Go server running next to Falco. I wanted to do a bit more filtering/cleaning of the data before storing it, so I was going to have to write another program anyway.
That was pretty easy to get up and running in Go, and deployed into my cluster with Ko. The code is basically just an http.Server that exposes a single handler. Falco sends
json events there in real-time, and that server unmarshals into structs, groups by pod name and then uploads to GCS after the job is done. You can see the full code here.
Next it was time to test it out! My goal was to be able to analyze multiple languages, and nothing I had done yet was language-specific. I started with a few test runs, using
pip on a few hundred packages each.
I got a list of the top 100 depended on packages from NPM using their UI and a list of the top 100 downloaded Python packages here. Then I put together a couple short bash scripts to create a
Pod for each of these in a loop using
kubectl run , and watched it all work!
My tiny 3-node shared-core cluster was able to churn through each dataset in around 5–10m each. Adding nodes to the cluster would make it scale pretty well horizontally for bigger workloads. If you’re interested in the data, it’s all available publicly on GCS. For this initial dataset, you can browse and access the JSON results with
gsutil ls gs://ossf-malware-analysis-results/v01/.
Up next, hooking this up to feed parsers to automatically run the job for new package uploads to popular registries, figuring out how to configure Falco to watch for interesting network activity, and getting this data into Bigquery or somewhere else for analysis!
I plan on posting anything interesting I find on Twitter, so follow me there for more updates.