> So how do we guard against this type of attack? How do we know this hasn't already happened to some of us? What is the potential fallout from this hack, it seems quite significant.
Verified builds. That means deterministic builds (roughly, from a given git commit the same binaries should result no matter who compiles them. It requires compiler support and sometimes changes to the code) plus trusted build infrastructure.
To verify that you haven't been compromised do a verified build from two independent roots of trust and compare the resulting binaries. Add more roots of trust to reduce the probability that all of them are compromised.
Establishing a trusted root build environment is tricky because very little software has deterministic builds yet. Once they do it'll be much easier.
Here's my best shot at it:
Get a bunch of fresh openbsd machines. Don't network them together. Add some windows machines if you're planning to use VS.
Pick 3 or more C compilers. Grab the source, verify with pgp on a few machines using a few different clients. For each one, compile it as much as possible with the others. This won't be possible in whole due to some extensions only available in a particular compiler used in its source, but is the best we can do at this point. Build all your compilers with each of these stage-2 compilers. Repeat until you have N-choose-N stage-N compilers. At this point any deterministic builds by a particular compiler (gcc, llvm, VS) should exactly match despite the compilers themselves being compiled in different orders by different compilers. This partially addresses Ken Thompson's paper "reflections on trusting trust" by requiring any persistent compiler backdoors to be mutually compatible across many different compilers otherwise it'll be detected as mismatched output from some compiler build ancestries but not others. Now you have some trusted compiler binaries.
Git repository hashes can be the root of trust for remaining software. Using a few different github client implementations verify all the hashes match on the entire merkle tree. Build them with trusted compilers of your choice on multiple machines and verify the results match where possible.
At this point you should have compilers, kernels, and system libraries that are most likely true to the verified source code.
Make a couple build farms and keep them administratively separate. No common passwords, ssh keys, update servers, etc. Make sure builds on both farms match before trusting the binaries.
The good news is that most of this can be done by the open source community; if everyone starts sharing the hashes of their git trees before builds and the hashes of the resulting binaries we could start making a global consensus of what software can currently be built deterministically and out of those which are very likely to be true translations from source code.
To me it sounds like they hacked the editor/code signing tools to insert malicious code on save/commit by devs. Having iron-clad CI toolchains don't help you with that. Need to focus on how to defend the devs.
That's the point of a trusted build farm. Devs commit changes to git, and either request a build or the build farm polls for commits and builds the latest commit on trusted hardware+toolchain.
A malicious attack could change the code but it would be detectable because git would preserve the malicious parts in the repo, and further tie a specific malicious binary to a particular commit making it easy to find the malicious code itself.
As long as not all developers are compromised then whoever is doing the code review would see the malicious code when they pull the branch to review it.
> further tie a specific malicious binary to a particular commit
Git uses SHA1 for hashes, right? Aren't there demonstrations that SHA1 hashing is cracked, so you could craft a replacement commit that hashed to the same value, in theory.
SHA1 hash collisions are hard, especially when the data you can inject needs to look like code to a human and compile correctly. But the concern is valid so it's good that git is improving in this way.
Verified builds. That means deterministic builds (roughly, from a given git commit the same binaries should result no matter who compiles them. It requires compiler support and sometimes changes to the code) plus trusted build infrastructure.
To verify that you haven't been compromised do a verified build from two independent roots of trust and compare the resulting binaries. Add more roots of trust to reduce the probability that all of them are compromised.
Establishing a trusted root build environment is tricky because very little software has deterministic builds yet. Once they do it'll be much easier.
Here's my best shot at it:
Get a bunch of fresh openbsd machines. Don't network them together. Add some windows machines if you're planning to use VS.
Pick 3 or more C compilers. Grab the source, verify with pgp on a few machines using a few different clients. For each one, compile it as much as possible with the others. This won't be possible in whole due to some extensions only available in a particular compiler used in its source, but is the best we can do at this point. Build all your compilers with each of these stage-2 compilers. Repeat until you have N-choose-N stage-N compilers. At this point any deterministic builds by a particular compiler (gcc, llvm, VS) should exactly match despite the compilers themselves being compiled in different orders by different compilers. This partially addresses Ken Thompson's paper "reflections on trusting trust" by requiring any persistent compiler backdoors to be mutually compatible across many different compilers otherwise it'll be detected as mismatched output from some compiler build ancestries but not others. Now you have some trusted compiler binaries.
Git repository hashes can be the root of trust for remaining software. Using a few different github client implementations verify all the hashes match on the entire merkle tree. Build them with trusted compilers of your choice on multiple machines and verify the results match where possible.
At this point you should have compilers, kernels, and system libraries that are most likely true to the verified source code.
Make a couple build farms and keep them administratively separate. No common passwords, ssh keys, update servers, etc. Make sure builds on both farms match before trusting the binaries.
The good news is that most of this can be done by the open source community; if everyone starts sharing the hashes of their git trees before builds and the hashes of the resulting binaries we could start making a global consensus of what software can currently be built deterministically and out of those which are very likely to be true translations from source code.
EDIT: https://wiki.debian.org/ReproducibleBuilds is Debian's attempt at this.