Web crawling in web 2.0 era: a Spider’s view
What is Web 2.0?
With Web 2.0, the hot word is ‘user-generated web content’ or ‘user-centric web content’ but for the web crawlers, it was like a world of unknown web contents. Currently, we are living in such a time where technology is maturing faster than the gyrations of time. Factually, we are now pretty accustomed with the word ‘upgrade’, the bliss of embracing the sharp-witted technology, and for the habitats of this planet, it is better to tag a thing as dead if it is not upgraded. Eventually, this is also true for the world-wide web.
But, it will be a callous mistake if we apprehend this journey of the internet from web 1.0 to the Web 2.0 as a mere technology upgrade. Admittedly, it is not a well planned and well focused, premeditated tech spec updation, but the change is in the way we interact with the web. Web 2.0 is much more friendly to its users than its previous version as users now can create more web contents and share the same with others without having any technical knowledge regarding web publishing and the other allied programming languages. Most of the popular social network sites use this Web 2.0 technology like Wikipedia, YouTube, Facebook and other alike.
What’s the difference between Web 1.0 and Web 2.0?
Web 1.0 was anything but interactive. A radical change was awaiting to step in and mould the web with its future. Web 1.0 was just a colossal information storehouse for the regular internet users. Searchers, used to pursue their knowledge hunt on the web by dabbling through static web pages, that encased relevant information on their search query. But, there is a certain demarcation line between information gathering and interacting with the same. Typically, Web 1.0 was pretty less interactive as social media was yet to become a buzzword and there were also several technical layers, user had to penetrate to establish an interactive action. If we dig deep into some basic functional attributes of web 1.0 we can perceive that:
- web 1.0 housed more static than dynamic pages
- server-side scripting support was rare on web servers
- Users used to go through pesky guestbooks to interact with a web document
- The browser used to fetch contents from a web server’s filesystem as more versatile web database technologies like RDBMS was not that popular or known database standard.
- Basic ‘mailto’ forms were used as the feedback communication system as there were no inbuilt email clients within web browsers.
So, before web 2.0 stepped in, the world wide web was more an information store and less an interactive medium for communication. But today, we are customizing and publishing every small update, that happens in our regular life, to popular social media sites for getting further interactions from our friends and relatives.
Web 2.0 got more social and user-centric.
Today, someone represents you on the web and interacts with your friends, colleagues and with the rest of the world when you are in a deep sleep and this is true for every single person that has access to the internet. Ok, let’s tear apart the main concept of web 2.0 to understand the changes in the interactions between us and the web 2.0. They are:
- Social media networks
- Personal publishing platforms
- Mass user participation
- Podcasting (process of making audio and video files available on the web)
- Rich user experience
- User enabled data tagging and data classification
- web-based applications as a full-fledged desktop service
So, web 2.0 engages its users in a more interactive approach. Simply, users can publish their thoughts, create social contents, participate in live social debates and can express their views as comments. Nowadays, we tend to share almost every little happening on the social channels. At times, published product reviews and usage experiences on these social surfaces help us pick an informed purchasing decision. So, in every way Web 2.0 is more integrated with the world of the internet of things (IOT) which own our daily lives.
Web crawling in WWW 2.0: Challenges
If we look deep into the existing standards of web crawling in Web 1.0, the key processes were:
- protocol driven approach
- communication through a secure socket connection on the host or IP address
- HTTP requests and response interpretation
- Parsing of responses resource collection like data, links, other targeted components
Thus, event – driven crawling stepped in the scenario which was built for working with the new standard namely, DOM or document object model. Technically, DOM is an application programming interface which defines the structure of documents in a web page and provide a feasible way to the web crawlers to access and manipulate the same. This New web crawling approach is not a step less one. It mainly concludes four well-calculated steps. They are:
- The analysis of java script
- Analysis of its linking with Ajax
- DOM event handling and processing
- Dynamic content extraction
Through these steps an array of riddles got solved like, penetrating difficult application logic of Ajax and Flash which are the linchpins behind the rich user interface of Web 2.0 and understanding the structures of the data sets which flow between applications in an encrypted format.
Moreover, In this data-driven world It is very much possible that, the targeted application, which is to be crawled, stores information from multiple sites as RSS feeds for delivering an information-rich interface to its users. So, for web crawlers this has also been a problem as it is hard for them to get a clear structure of the accumulated data by a particular application and web 2.0 had been designed to combat with these issues.