Data Crawling: Ethics & Good Practices
The shifting boundary between ethical and unethical practices in Data Crawling is a sensitive issue. Deciding what is ethical in Data Crawling is quite subjective and debatable. Seeking ethical approval for Data Crawling becomes a legal requirement in some cases. For instance, web users are unable to notice crawlers and other programs that automatically download their browsing information over internet. Despite that, it is being increasingly used by people from different fields including web users, creators of email spam lists and others looking for monetary gains from information.
Ethical procedures can be set-up to assess suspicious and recurrent activities. In some cases, activities may not be harmful in a direct manner but they may lead to issues in the long run. In such a scenario, a set of guidelines or a code for professional conduct must be established to punish the offenders.
Technology can either be a boon or bane depending upon its usage. There has been growing consensus that Data Crawling has created new moral problems that require great intellectual effort to solve. Data crawlers are now used by a wide section of service providers. As a result, a huge chunk of internet subscriber base is getting affected by it. Data crawlers are especially being used by academic researchers to track the web, companies to gain information about their customers and potential segments, and various others as well. However, Data Crawling has no direct benefits and may have possible commercial and other disadvantages to the owners of websites being crawled.
Several Data Crawling operations can be controlled by programmers. For instance, number of URLs visited per second can be controlled through programming. However, some aspects of web crawling cannot be handled by programmer. For instance, crawlers will be hindered by the network bandwidth affecting the speed of downloading web pages.
Data Crawling may cause the following issues:
Denial of Services
Major concern by owners of websites is that web crawling may slow down the web server due to repeated requesting of web pages. It may even use up limited bandwidth. This is quite similar to attack by viruses. It is quite evident that increased utilization of a limited resource from a random source will result in deterioration in services. A server responding to such requests from random source may be slow to respond to other users leaving behind its primary objective of serving the genuine customers.
Web crawlers incur costs upon owner of websites crawled using up their bandwidth allocation. Different web hosts offering different server facilities charge in different ways. The hosts restricting the amount of amount of web space may even face challenges.
Some of the websites claim unlimited web space along with a limited bandwidth, it is quite evident that problems can be caused quickly through the crawling of an entire website. The hosts restricting the amount of web space available may even face challenges. Exceeding bandwidth may have enormous consequences right from having to pay excess cost to having the website disabled.
Everything that appears on web is in public domain. Privacy may still be invaded even if web information is used in certain ways. For instance, spam lists may be generated from email addresses in web pages, and Internet directories may be generated automatically. However, there is a need of informed consent; others disagree and emphasize the complexity of the issue.
Crawlers are involved in illegal activities as they make copies of copyrighted material without the owner’s permission. Copyright infringement is one of the most important legal issues for search engines that need to be addressed upon. A particular problem with the internet archive is that it is making web pages freely available for usage. Owners of the websites are using robots.txt mechanism to keep their site out of the archive.
Guidelines for crawler owners
Researchers’ crawlers are designed to behave ethically in a pragmatic manner but they are being grouped together with the unethical crawlers as both of them violate the guidelines. Therefore, a new set of guidelines are needed to serve both types of crawler operations considering the privacy issues. In the present scenario, a wide range of web hosting packages are available with constantly altering technological capabilities. Therefore, a list of rights and wrongs would become outdated quickly. A framework that maximizes utility and minimizes negative impact will help researchers take proper decision with regard to web crawling. Decisions regarding crawl parameters must be on crawl as well as site basis rather than uniform code of conduct.
Web crawling involves a number of participants whose requirements will be estimated upon:
- The owner of the web site
- The hosting company
- The crawler operator’s organization
- users of the resulting data
A Crawler operator needs to enhance the potential benefits and minimize the disadvantages.
Following guidelines highlight recommendations for ethical crawling based on Koster’s (1993) recommendations:
- Keep in mind the social aspects associated with the information extracted. Especially, consider the privacy implications of information aggregated during crawling
- Look for alternative sources that can fulfil your needs such as Google API and the Internet Archive
- Avoid crawling websites for teaching or training purposes unless justifiable or necessary
- Keep in mind the financial implication that would incur upon the website owners
- Do not take advantage of the naïve site owners who will be unable to identify the causes of bandwidth charges
- Be prepared to pay for the crawling costs if requested
- Acquire an in-depth understanding of cost implications for crawling big and small sites and others as well
- Balance the costs and benefits of each Web Crawling project and ensure the social benefits outnumbers the disadvantages
- Email webmasters of large enterprises to inform them about crawling so that they can opt out if they want.
These guidelines seem appropriate in most types of crawling leaving aside web competitive intelligence. In Web Competitive intelligence, the major objective is to gain strategic advantage over competitors. This type of crawling if becomes challenging needs to be addressed legally. Formulation of legislation becomes essential if the crawling usage rises to a particular level and invades privacy.