Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Deduplication is a hard problem: what constitutes a "dupe" is somewhat arguable.

My first approach just used metadata, but that doesn't aggregate the files stripped of their original metadata, like what you get from a Google Takeout, so I had to add image hashing as well. I actually generate three mean hashes in L*a*b color space for PhotoStructure (many hashes ignore color). I've also found that metadata needs to be normalized, including captured-at time, and even exposure metadata. It's a lot of whack-a-mole, especially add new cameras and image formats are released every year.

I described more about what I've written for PhotoStructure (which does deduplication for both videos and images) here: https://photostructure.com/faq/what-do-you-mean-by-deduplica... -- it might help you avoid some of the pitfalls I've had to overcome.



Thank you. It looks interesting. I was heading to much the same place of the order of precedence for matches, silently wondering if there was a class of bad edit which made the post-modified file bigger not smaller. Seems unlikely but not impossible.

A lot of my dupes are google dupes but across about 4 cameras with a mixture of original/compressed size.

A lot of my local copies had jhead run on them to "fix" time. So have modified EXIF

A small number have me playing with ITPC to try and auto-name things for tag matching.

Your program looks to be the one which understands the corner cases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: