Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You don’t have to screen shot every page… convert the PDF to a PNG/TIFF image for every page, and OCR those. This is very easy to automate. If this is working with Unicode code points, you’re not blocking OCR, you’re obfuscating text. Anything that renders the PDF to a raster format will produce an OCR-able document.

If you’re a divorce attorney who used this to convert documents in response to a discovery request, and the opposing side had a valid reason for needing the unobfuscated text, then you’re probably going to end up having a nice conversation with the judge about acceptable formats.

Sending compressed TIFFs would probably be just as good. A bit larger file sizes, but it would be just as effective as stopping automated scraping of text. Also, less likely to piss off a judge. Any opposing firm that would be sophisticated enough to automate scrapping the text from a normal PDF would be able to OCR these files just as easily.

Or maybe you have a second site that sells the decoder, so you get to sell to both sides. Not a bad business model, if you can work it.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: