Data Crawling Ethics and Best Practices

Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com

Janet Williams

February 5, 2015
Blog, Data

Table of Contents show

The shifting boundary between ethical and unethical practices in data scraping and crawling is a sensitive issue. Deciding what is Data Crawling ethics is quite subjective and debatable. Seeking ethical approval for crawling data could be a legal requirement in certain cases. For instance, web users are unable to notice crawlers and other programs that automatically download their browsing information over internet. Despite that, it is being increasingly used by people from different fields including web users, creators of email spam lists and others looking for monetary gains from information.

Ethical procedures can be set-up to assess suspicious and recurrent activities. In some cases, activities may not be harmful in a direct manner but they may lead to issues in the long run. In such a scenario, a set of guidelines or a code for professional conduct must be established to punish the offenders.

Technology can either be a boon or bane depending upon its usage. There has been growing consensus that Data Scraping has created new moral problems that require great intellectual effort to solve. Web crawlers are now used by a wide section of service providers. As a result, a huge chunk of internet subscriber base is getting affected by it. Data scrapers are especially being used by academic researchers to track the web, companies to gain information about their customers and potential segments, and various others as well. However, web crawling has no direct benefits and may have possible commercial and other disadvantages to the owners of websites being crawled.

Several crawler operations can be controlled by programmers. For instance, number of URLs visited per second can be controlled through programming. However, some aspects of web crawling cannot be handled by programmer. For instance, crawlers will be hindered by the network bandwidth affecting the speed of downloading web pages.

Data Crawling Ethics may cause the following Issues

1. Denial of Services

Major concern by owners of websites is that web crawling may slow down the web server due to repeated requesting of web pages. It may even use up limited bandwidth. This is quite similar to attack by viruses. It is quite evident that increased utilization of a limited resource from a random source will result in deterioration in services. A server responding to such requests from random source may be slow to respond to other users leaving behind its primary objective of serving the genuine customers.

2. Cost

Webpage crawlers incur costs upon owner of websites crawled using up their bandwidth allocation. Different web hosts offering different server facilities charge in different ways. The hosts restricting the amount of amount of web space may even face challenges.

Some of the websites claim unlimited web space along with a limited bandwidth, it is quite evident that problems can be caused quickly through the crawling of an entire website. The hosts restricting the amount of web space available may even face challenges. Exceeding bandwidth may have enormous consequences right from having to pay excess cost to having the website disabled.

3. Privacy

Everything that appears on web is in public domain. Privacy may still be invaded even if web information is used in certain ways. For instance, spam lists may be generated from email addresses in web pages, and Internet directories may be generated automatically. However, there is a need of informed consent; others disagree and emphasize the complexity of the issue.

4. Copyright

Crawlers are involved in illegal activities as they make copies of copyrighted material without the owner’s permission. Copyright infringement is one of the most important legal issues for search engines that need to be addressed upon. A particular problem with the internet archive is that it is making web pages freely available for usage. Owners of the websites are using robots.txt mechanism to keep their site out of the archive.

Guidelines for Crawler Owners

Researchers’ crawlers are designed to behave ethically in a pragmatic manner but they are being grouped together with the unethical crawlers as both of them violate the guidelines. Therefore, a new set of guidelines are needed to serve both types of crawler operations considering the privacy issues.

In the present scenario, a wide range of web hosting packages are available with constantly altering technological capabilities. Therefore, a list of rights and wrongs would become outdated quickly. A framework that maximizes utility and minimizes negative impact will help researchers take proper decision with regard to crawling web. Decisions regarding crawl parameters must be on crawl as well as site basis rather than uniform code of conduct.

Web crawling involves a number of participants whose requirements will be estimated upon:

The owner of the web site
The hosting company
The crawler operator’s organization
users of the resulting data

A website crawler operator needs to enhance the potential benefits and minimize the disadvantages.

Following guidelines highlight recommendations for ethical crawling based on Koster’s (1993) recommendations:

Keep in mind the social aspects associated with the information extracted. Especially, consider the privacy implications of information aggregated during crawling
Look for alternative sources that can fulfil your needs such as Google API and the Internet Archive
Avoid crawling websites for teaching or training purposes unless justifiable or necessary
Keep in mind the financial implication that would incur upon the website owners
Do not take advantage of the naïve site owners who will be unable to identify the causes of bandwidth charges
Be prepared to pay for the crawling costs if requested
Acquire an in-depth understanding of cost implications for crawling big and small sites and others as well
Balance the costs and benefits of each Web Crawling project and ensure the social benefits outnumbers the disadvantages
Email webmasters of large enterprises to inform them about crawling so that they can opt out if they want

These guidelines seem appropriate in most types of crawling leaving aside web competitive intelligence. In web competitive intelligence, the major objective is to gain strategic advantage over competitors. This type of crawling if becomes challenging needs to be addressed legally. Formulation of legislation becomes essential if the crawling usage rises to a particular level and invades privacy.