View Single Post
Brad
Selfish Heathen
 
Join Date: May 2004
Location: Zone of Pain
 
2021-08-07, 11:00

Quote:
Originally Posted by tomoe View Post
Can someone ELI5 the matching hash comment and implications?
turtle already touched on some MD5 examples earlier, and I'm pretty sure I'm repeating what some of the previous posts have said, but hopefully I can expand and recap the whole technical situation and concerns all at once.

Hashing is a common process that takes some file of any type and of any size and produces a new fixed-length (and usually relatively small) number. MD5 is a good example to demonstrate this hashing process because it's been around for ages and most computers have a built-in program that can make MD5 hashes. If you open your Terminal app, type "md5 " (with the space), drop any file (not a folder) into the window, and press enter, you'll see it quickly spits out something like this:

Code:
$ md5 /Users/bradsmith/Downloads/IMG_3363.JPG MD5 (/Users/bradsmith/Downloads/IMG_3363.JPG) = 36ff331972ac66f4c555628ee19b99b5
That value "36ff331972ac66f4c555628ee19b99b5" is a number (in hexadecimal instead of decimal) that was calculated based on the file. Repeating the MD5 command on the same file will always produce the same output. If you run the command with many different files, you'll see the length of the generated number is always the same but the content of the number changes always dramatically. If you give it two text files that are very similar but maybe only different by one letter, though, the output hashes are still very different. For example, MD5 hashing the phrase "hello world" versus "hallo world" will produce:

Code:
$ echo "hello world" | md5 6f5902ac237024bdd0c176cb93063dc4 $ echo "hallo world" | md5 c092aa310a370d3d1b6ecf5eae0a0ce4
Note that even though these inputs changed by only one letter, the generated hash is totally different. Hashing algorithms are sometimes called "cryptographically secure" when they do a very good job of this, as generating and comparing hashes is an essential part of modern secure computing and communications.

However, Apple's not just using any standard, open hashing algorithms like MD5 or SHA for this system, and some of the discussion points about MD5 don't exactly apply here.

What Apple has built for hashing appears to be much more complex than MD5 and has some interesting benefits and potential flaws. Where the MD5 hash just looks at the input as raw data and doesn't attribute any "meaning" to one part over the other, Apple's hashing is trying to look at the input specifically like we humans look at a picture, and it generalizes the image content into what are effectively "features" to a human eye before it calculates an output value. In their technical overview, they give an example of a color photo of a tree and a black-and-white version of the same picture, and their algorithm gives these two images the same hash even though they are obviously two very different photos.

While that is a clever way of preventing people from making slight changes to try to bypass naive tools like MD5 (changing one pixel would make a totally different MD5 hash, like my "hello world" example), it does present some possibly massive problems.

Remember that hashes are "fixed length"? That feature is a good thing because it means you can't infer much about the size of the original data that was hashed. A one-byte file's hash is exactly as long as a trillion-byte file's hash. However, that means you also introduce the very real possibility of two completely unrelated files producing the same hash. A good hashing algorithm is sufficiently complex and generates a sufficiently large number to make this extremely unlikely, but since Apple's algorithm is by design trying to generalize maybe-similar images to generate the same output, there is a very real risk that the likelihood of false positives skyrockets.

A clever individual could reverse-engineer the algorithm and hash to produce a perfectly safe and innocuous image that the feature detector thinks matches a feature set that has been reported in a hash as CP. This clever individual could then distribute that image around and cause a bunch of false positive reports. Or an unsuspecting user might take a perfectly safe and innocuous photo that just happens to fall into the right part of the feature detection and then they get a surprise visit from the FBI or Apple gets subpoenaed to hand over her data or she gets put on some kind of watch list.

In turtle's earlier example, his photo could be confused for Hitler (sorry! but you invoked Godwin's Law here first! ) due to some unknown-to-us arrangement of features in the photos even though the raw data making up the image is completely different than the "matching" photo of the führer.

Fooling AI/ML-based image processing systems is already a small but growing area of interest. Disrupting self-driving car systems is a related area that's been getting lots of research and press in recent years. It's only a matter of time before people try to figure out and exploit this system too.

The quality of this board depends on the quality of the posts. The only way to guarantee thoughtful, informative discussion is to write thoughtful, informative posts. AppleNova is not a real-time chat forum. You have time to compose messages and edit them before and after posting.
  quote