As a web data solutions company, we often encounter questions about the legality of web scraping. Before we go into answering that question, let’s first understand the term “web scraping”. Simply put, it is a part of web crawling (finding the web pages and downloading them) that involves data extraction from web pages to gather relevant information. The key factor here is that a bot (similar to Google bot) performs this activity in an automated fashion and thereby eliminating the manual activities of a person. When bots hit web pages to grab content, they act quite similarly to the way browser-agent makes calls to the pages. So, why do we have so much hoopla around “scraping”? The reason behind this can be primarily attributed to disrespect for the established protocols.
Here are some of the ground rules that must be followed by anyone looking to crawl data from the web:
- Robots.txt file
This file specifies how a website would like to be crawled. It includes the list of accessible pages, restricted pages, request limit apart from explicitly mentioned bots that are allowed or blocked from crawling. Check out this post to learn more about reading and respecting the robots.txt file.
One more important checkpoints are the terms & conditions page that talks about the specifics of how that data should be gathered and used along with other guidelines. Ensure that you’re not violating anything mentioned on this page.
- Public content
Unless you have permission from the site, stick to the data that is available to the public. This means if the data can be accessed only by logging in, it is meant for the site users, not for the bots.
- Crawl frequency
The robots.txt file mentions the crawl frequency and rate at which bots can hit the site. Hence, you must stick to this and in case this has not been mentioned, the onus is on you to ensure that the site server is not overloaded by hits. This is required to make sure that the scraper is polite; the server does not exhaust its resources and fails to serve the actual users.
Apart from these mandatory rules, there are other best practices for web scraping which have been covered in this post. Coming back to our first question, i.e., if web scraping is legal or not?—we can safely say that if you’re adhering to the above-mentioned rules, you’re in the legal perimeter. But, you must get this verified by a lawyer to be completely on the safe side. There have been several cases of lawsuits such as Facebook vs. Pete Warden, Associated Press vs. Meltwater holdings, Inc., Southwest Airlines Co. v. BoardFirst, LLC, and more.
That said, there is a larger question around us — should powerful companies that host petabytes of publicly available data (especially user-generated data) be selective while providing access to the same? This question basically looms around the recent events related to the legal proceedings involving LinkedIn (owned by Microsoft) and hiQ Labs. For the uninitiated, hiQ Labs is a startup that was scraping data from the public profiles on LinkedIn to train its machine learning algorithms. In May, LinkedIn sent a cease (C&D) letter to hiQ instructing them to stop scraping data from its social network. The letter had mentioned several cases including Craigslist Inc. v. 3Taps Inc., in which the verdict was against 3Taps and they were found in violation of the Computer Fraud and Abuse Act for bypassing IP-blocking techniques deployed by Craigslist. We should also note that LinkedIn had implemented technical measures to hiQ from accessing the public data. However, HiQ Labs responded by filing a suit against LinkedIn in June, citing that LinkedIn violated antitrust laws.
One of the major issues brought up by hiQ is about LinkedIn’s anticompetitive practices stating that LinkedIn wanted to roll out its own analytics and data science solutions that might get deterred by the former’s offerings. They also state that LinkedIn already knew about him for years and they had even accepted an award from hiQ at a certain data analytics conference.
Coming to the crux of the issues, we can see that “authorization” is not required to access the public profile pages on LinkedIn. Hence, LinkedIn’s claim that scraping this data may be in breach of the Computer Fraud and Abuse Act by bypassing an authentication requirement doesn’t have a strong foundation. What makes this case special is that hiQ is only scraping the data that is publicly available whereas in other cases the scrapers were in breach of users’ privacy or data usage without notice. If we just consider the manual activity, anyone could click on every profile and look at the data to copy all the info, and then feed the data to the computing system. Although theoretically feasible, this is an inefficient and error-prone way of data collection as this would demand huge time and manpower. That’s the primary reason why we have programmable bots to do this task in an automated and repetitive fashion.
LinkedIn allows search engines to crawl and index their public pages to promote their network. Then why shouldn’t the rest of the applications and websites get a level-playing field by getting benefit from the same data as well? Thus, the point to ponder is – do the power companies have the right to stop the robots from scraping the public data from their websites? Moreover, when the data has been made public by the users, how can the platform go to such an extent claiming rights to block others from accessing it?
Although the case is far from over, the latest ruling says that HiQ and its algorithms are free to crawl data and LinkedIn has to let it. The judge seemed to resonate with hiQ’s argument that hiQ’s public data collection could be an activity protected by the First Amendment and gave the following order:
To the extent LinkedIn has already put in place technology to prevent hiQ from accessing these public profiles, it is ordered to remove any such barriers.
Here is the link to download the copy of the court order if you are interested in learning more.
For now, we can consider this battle and the latest response of the court as a victory of free speech for the players in the data solutions business. This also lays the groundwork for internet companies that could have otherwise got entangled in criminal cases for accessing web pages that are public for the entire world to see. The ball is now in LinkedIn’s court and this might very well turn out to be a free-speech argument.
The final verdict will go beyond LinkedIn and hiQ Labs and could set the precedent on just how much control businesses will have over publicly available data that is hosted by their services. We believe that there should be absolutely no restriction on access to public data over the internet, and innovation must not be restrained by legal strong-arming or pursuing the anti-competitive agenda of a small group of powerful companies.