Stairway to heaven, If you’re in the business of web scraping, that is.
It Is Legal to Scrape Public Web Data. There is a massive amount of data available in the public domain of the web. However, when it comes to the utilization of the same, little has been done to date. But today, service-companies are providing data as a service, or building solutions that are backed by data. Say you want to know the prices of 20000 items across 5 different websites, some services can help you with that. Be it hiring recruits, or deciding what price would be right to list your house at, web scraping helps with all. However, even though web-scraping usually involves companies scraping data from the open Internet, many companies are opposed to this. Why? They claim data of the users as their own. And apparently, they are the only one who has any right to it. A big will for free and open access to public data was seen in the hiQ vs LinkedIn case recently.
Scraping data proved daunting for hiQ Labs – a data analytics company that had been scraping publicly accessible data from LinkedIn. The latter chose to invoke the Computer Fraud And Abuse Act (CFAA) and accused hiQ of accessing the information “without authorization”. However, in a landmark move, the U.S. Ninth Circuit Court of Appeals ruled in favor of hiQ Labs, thus paving the way for the “open internet”.
hiQ vs. LinkedIn
The CFAA is a federal cyber-security law that was created to prevent hacking of government security systems “without authorization”. But its vagueness of the term “authorization” meant that companies could mold it to fit their own needs whenever necessary as in the hiQ vs. LinkedIn case. What hiQ did was simple, it would use scraped data to create HR-related analytics products. For instance, Keeper identified flighty employees, while Skill Mapper would assess employees and find gaps in the workforce. But then LinkedIn launched a similar set of products in 2017 and that is when the scenario started going south.
On May 23, 2017, it sent a cease and desist letter to hiQ demanding that hiQ stop scraping data off of it. Two weeks later, hiQ filed suit for injunctive relief against LinkedIn.
It was clear to the court that hiQ would not survive as a company if not for the data from LinkedIn. Furthermore, the data on LinkedIn was publicly available, as users had not kept the information password encrypted. “There is little evidence that LinkedIn users who choose to make their profiles public actually maintain an expectation of privacy,” the court said.
hiQ claimed for tortious interference of contract- LinkedIn was simply trying to market its products while throwing its competitor under the bus. While LinkedIn deemed the aggressive competition legal, the court did not.
LinkedIn tried to play the CFAA card. According to the law, “whoever… intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains … information from any protected computer … shall be punished” by fine or imprisonment. Further, “any person who suffers damage or loss by reason of a violation” of that provision may bring a civil suit “against the violator to obtain compensatory damages and injunctive relief or other equitable relief.”
However, the data was not protected by a user ID and password and hence, LinkedIn’s argument became moot. The court ruled that CFAA did not apply to the case. The data was public; no unlawful “breaking and entering” took place.
The problem with CFAA
While it is a major win for data analytics, it also sheds light on a case of the Ninth Circuit that has managed to blur the outreach of the CFAA – the Facebook v. Power Ventures, a ruling that was also cited in the cease and desist letter of LinkedIn.
Power Ventures was a company that allowed an individual to manage all their social media accounts from one place. Unlike LinkedIn where the data was publicly available, Power Ventures would ask for consent from the user. Therefore, it was the user that granted Power Ventures access to the data and not Facebook. Hence, though the company was “within authorization” in a way, it was still found to violate the CFAA.
There lies the trouble with the CFAA. While in theory, it should prevent hacking, it has become nothing more than a tool for major corporates. Every large enterprise interprets the law in its way and uses it to its advantage. Power Ventures was just an add-on feature that the user chose for himself; hiQ created analytical products that LinkedIn set its eyes on, and since the bigger companies wanted these third parties out of their forte, they called on the mighty CFAA.
While the court has located the lock on invoking the CFAA anytime one saw fit, it has still not shut the door completely. The more recent Stackla v. Facebook found yet another platform that got into controversy via web scraping.
With new cases popping up every now and then, it will eventually fall on the court to clarify the CFAA and terms like “without authorization”. Data is present everywhere and creating a distinction between the legal and the illegal becomes of prime importance. The monopoly of data would be dangerous for innovation and in the world of the fast-paced Internet, innovation is everything.
With the win in its bag, hiQ has cleared the path for the application of open web data. Web crawling and extracting is the cheapest way to gather data and for far too long has been seen as a skeptical approach. One must understand that the only way small and big companies can compete in a level playing field is if the Internet and the data present on it remains free to use for all.
Can Google claim that the data it shows for a search result is its own? Can Wikipedia stop us from learning from its pages? After all, most of the information available on the public domain of the internet belongs to individuals or the market, and no company can claim to have a monopoly over it. What companies can compete on instead, is how well they can use the data and what services they can create. These services can digest the open-data and produce valuable output, that can be used by businesses.