History of Internet Archiving & Web Crawling – It’s Been 20 Years.
Within the last 1 second, while you were just starting to read this article, the internet has faced and changed many things on the world wide web. Within that 1 second, Google processed more than 45,000 search requests( That’s 3.5 billion per day and 1.2 trillion per year) and the web has seen 7000 tweets, 1100 Tumblr posts, 730 new Instagram photos. Encasing all of these, the internet archive has generated 35 TB raw datasets and you hardly winked twice.
This is a fascinating stat. Right? 20 years back it was not this easy and this humongous. In 1998, Google used to handle 10,000 queries per day and data as a service was a virtual concept then. Now, it strategizes every parameter with the help data gathered from various web sources and DaaS is now a mainstream technology.
We often find it astonishingly cool that ‘history of the internet’ is still one of the most popular search phrases on Google. Admittedly, it’s more than a fact that, the internet ( and internet archiving) has become a lifeline for the most of us. Now, it’s not just a service to us but acts like one of our life supporting organs. Today, our every query to Google travels around 3000 miles to get processed and fetch the closest answer back to us so that we feel nearer to knowledge.
None of these would have happened if there was no Internet and if there were no Internet archiving techniques, which started 20 years back.
What is internet archiving?
As per the recorded history, scientific internet archiving, with a non-profit approach, was implemented for the first time around 2004 by European Archive Foundation. Later, in 2010, it was renamed as Internet Memory Foundation. The chief purpose of this foundation is to collect and preserve all the contents of the internet to help the current as well as the upcoming generations.
Wayback Machine: The Time Machine
Internet archiving first started its journey back in 1996 with two bright minds, namely, Brewster Kahle and Bruce Gilliat. They developed a software(web crawler) which can crawl and download web pages which were available for general web users. Later, for commercial use, Brewster Kahle started the prestigious company namely, Alexa Internet which now provides information regarding browsing behaviour, global ranking and web traffic report of 30 million websites. This colossal amount od data maintains the contents of Wayback machine.
Wayback machine is the backward timeline of the internet contents for the last 20 years which was launched in October 2001. With it, any regular user can see different archived versions of the content that links to a particular web address. Currently, this organisation through its Archive-it program, along with 440 more organisations, stores 4.1 million items. We love to call it the ‘time-machine’ of the World Wide Web. According to the last report published on 2014, it contained 9 petabytes of web data with a growing rate of 20 terabytes per week.
Internet archiving & Web Crawling:
We do brand this current page of world history as the ‘age of data’ and Internet archiving with the evolution of web crawling unitedly have made this possible. For today’s data scientists data equals the value of gold. Now, if data is gold then internet archive is the goldmine and web crawlers are the skilled gold-miners.
Let’s take a deeper look.
Back in 1999, Google took an entire month to crawl and build an index of 50 million web pages. It took less than a minute to do the same in 2012 and nowadays, it takes less than a second to complete this gigantic task. Now, within 0.2 second a single search query completes its search through this mammoth database to fetch the exact answer.
In these 20 years of Internet archiving web crawlers have redefined the techniques of web crawling and data extraction processes. With time, the web got more dynamic and complex to provide its users a higher level of interaction. If we look 100 years back, the medium of our information exchange with the rest of the world was limited to a bunch of audio devices(radio). Later, came the television which provided an audio-visual medium to us. Now, someone represents us on the digital social platforms. Today, it’s the edge of virtual reality.
The internet archive carries all the data this human civilization is spewing on the web but this sea of information is only accessible as web crawlers are doing all the heavy lifting of gathering and indexing them in a scientifically organized manner for a single purpose, ‘Reuse’ and commercially, companies are collecting and churning this huge amount of data for extracting the information hidden in it to design a better future.
So, what’s your take on this? Do feel free to share with us.
If you are in a quest for more data to power your business, it’s time to talk to us about your requirements.