So, you are writing an article on a “not-so-common” topic, and you can’t find much information on it, because it was a secret affair and was hushed up by the government. That does not mean that you have hit a brick wall. Maybe you are just searching in the wrong place.
As per recent studies, only about four percent of the internet has been indexed, meaning that ninety-six percent of it is not, and it would be very difficult to find something that has not been indexed. It would simply not show up in search engines. Say you are searching for the “Revolt of 1857”, and there are unindexed websites in the deep web having loads of information about the revolt. It just wouldn’t show up, no matter you use google or bing or duckduckgo.
The deep web is in itself a massive repository of information, mostly un-indexed by automated search engines, but readily available to those who can reach in, or know the tools that will help you to reach it.
On the other end of the spectrum is the Surface Web or Static Web, which is a collection of websites indexed by automated search engines. Whether it is a search bot or a web crawler that you use, it will follow urls, index the content and then relay the results back to the search engine’s central repository for consolidation and user query.
Ideally, the process is supposed to go through the entire Web but is, in fact, subject to vendor time and storage constraints. The pain point, be it searching or crawling, lies in the indexing. A bot, that you create cannot report something that isn’t indexable. This is why major search engines only cover 20% of the possible finds.
What makes it “DEEP”?
You will have difficulty scraping these categories of sites-
- Proprietary sites
- Sites that need registration
- Sites with scripts running
- Dynamic sites
- Ephemeral sites
- Sites that are blocked by local webmasters
- Sites that are blocked by search engine policy
- Sites with specific special formats
- Searchable databases
Proprietary sites generally require a fee, if you want to crawl them. As for registration sites, they require a login-id and password. A bot can index script code, but it can’t always depict what the script actually does. Dynamic websites’ data is created on demand and has no existence prior to the query and limited existence afterward. If you ever noticed an interesting link in a social media site or on a news site but found that the link was inaccessible later on, then you have encountered an ephemeral website. Most of the formats, not indexable before like pdfs are easily indexed now.
However, the most valuable deep learning resource of all are searchable databases. There are a huge number of secure databases with information worth billions. But they are all mostly un-scrapable. They serve as a back-end to front-end search bars in various sites- Sites which will let you view a part of the data at one go, but never the whole.
So how do you crawl the deep web?
There are academia specific search engines like Factbites, that have information sourced from dictionaries, encyclopedias, universities, and many other non-profit .org sites. The Deep Web is easily accessible to those who know how to navigate its mazes. Many individuals and institutions have helped put together invisible Web directories that can be used as a point for starting your web scraping search. Some examples-
- The University of Michigan’s OAIster, (pronounced as “oyster”) and it encourages people to do supposedly “find the pearls” in the Deep Web. They have millions of records from institutions ranging from African Journals Online to the Library Network of Western Switzerland. So, you can guess the diversity.
- LookSmart’s https://www.findarticles.com/ lets you search through print publications for articles, be it popular magazines or scholarly journals.
- The Library Spot is another collection of databases, online libraries, references, and other good information collected from the Deep Web. They also have a featured “You Asked For It” section, where they answer popular readers’ questions.
- UCLA online Library has a vast holding, including their special collections that are found only in the deep web.
- An interesting find is the www.infoplease.com and its searchable Deep Web databases. It displays results coming from encyclopedias, dictionaries, almanacs, and resources, extracted only from the Deep Web.
- The Central Intelligence Agency (Yes, the CIA, which you must recognize from the many Hollywood movies that you might have watched.) Has the World Factbook, which is a searchable directory of flags of the world, as well as reference maps, country profiles and so much more. It is a great resource if you are working on geographical content.
- The University of Idaho has a Repository of Primary Sources, which contains innumerable links to manuscripts along with archives as well as rare books and more. It contains information not only related to the US but also other countries and other places.
- In case you want to find plants with certain characteristics and you are into agriculture, you can probably find something that will catch your eye in the USDA’s Plants Database in the Deep Web.
- The Human Genome Database has a ton of information- almost everything discovered by humans about the human genome.
- For medical questions- The Combined Health Information Database is a subject directory which is user friendly and provides answers to almost any healthcare questions.
This article might end, but you know what? The deep web is an unending source of information, which might help you in your business pursuits and even personal enrichment. But in case you really want to leverage the data found there, and extract the information in a structured format, such that you can use it as per your needs, and grow your business, you should take the help of a provider who has been working in this field and helping other successful businesses.