Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Web scraping is generally a legally precarious proposition.

The copies of content that are stored in RAM are eligible for copyright protection, which means that just downloading a page without authorization, even before you've done anything with it, is an act of infringement. If you download 1000 pages from a site and it's determined that the license doesn't apply to you, you've got 1000 acts of copyright infringement just by storing them in RAM, let alone any "derivative works".

Then there is the issue of storing the page's contents for analysis and re-hosting portions of it for display. If you do image search, Perfect 10 v. Amazon said that although the storage and rehosting of images were unauthorized copies, they were a "fair use" under copyright law and therefore not infringing. Because fair use is an affirmative defense, this is not a very portable ruling, and it should be expected to be applied unevenly (and already has been wrt to RAM copies, at least in Ticketmaster v. RMG). The judge may find your non-Google image search non-transformative and therefore unfair and infringing. If Google had been sued on this point when they were a smaller company, they almost surely would have lost.

Then there is the CFAA, which makes it unlawful to access any server "without authorization" or in a way that "exceeds authorized access". The CFAA prescribes both criminal and civil penalties. Aaron Swartz was being prosecuted under this law for scraping publicly-funded research out of a paywalled database.

It can be, and frequently is, argued that accessing the site with an "automated method" like a scraper automatically "exceeds authorized access" due to clauses in the Terms of Service, which require users not to do this. This is often a successful argument, though there are notable exceptions.

On the flip side, robots.txt has been considered legitimate authorization for CFAA purposes in some cases, so if you're always obeying robots.txt, you will at least have that to fall back on. Search engines are well-understood and most people are probably not going to be haughty about your trying to index their site unless your scraper is super-aggressive and causes performance problems. In most cases, search engines usually don't need to crawl any single site in particular to be successful, so if someone really doesn't want you in there, you can just move on to the next one.

Google was designed for a self-published, late-90s web where the whole internet was accessible via URL. A modern search engine would recognize that the common man's voice is no longer accessible through the links they've placed on common-mans-site.com, but rather through their social media footprint on sites like Facebook and Twitter, and that accessing these profiles is critical to actually assessing what normal people find valuable online. The old method leaves you stuck in corpo-vision, since no one has a personal web site anymore.

Now, if Facebook or Twitter learned that accessing their data streams was critical to a major competitor's algorithm, they could sue under this cluster of anti-scraping laws and get access permanently shut off (unless, of course, you're big enough that the judges decide everything you do is fair and authorized, as they did in Perfect 10 v. Amazon, but you can't expect that to apply). If Facebook and/or Twitter wanted to launch a web search product, I have every confidence that they would have access to a much better dataset than Google.

This has a lot of negative implications for the future of the internet as an open platform, and drives the momentum back toward the bad old days of a keyword-based web controlled by a few central services.

I am not a lawyer and this is based on my layman's understanding. Consult an experienced lawyer specialized in this area before you believe anything I say.



Thanks for the detailed response. It would be nice to have some way to "iterate quickly" (and cheaply) in the legal realm, as a lot of laws seem to have unintended consequences, probably because the are not foreseeable by the folks writing the law.


I also enjoyed cookiecaper's post, thank you for your sharing. Is it possible to compete against the facebook/twitter inbound data-stream knowledge and against google's corpo-vision, if you're operating from another country with non-restrictive rights? Geo-fences are the most annoying political invention ever, we know that, but by obeying to unhealthy laws, we encourage it. Not every law makes sense and should be obeyed blindly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: