Web crawling in Web 2.0 Era : A Spider’s view

Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com

August 28, 2015
Last updated: June 28, 2023
Blog

Table of Contents

What is Web 2.0?

With Web 2.0, the hot word is ‘user-generated web content’ or ‘user-centric web content’ but for the web crawlers, it was like a world of unknown web contents. Currently, we are living in such a time where technology is maturing faster than the gyrations of time. Factually, we are now pretty accustomed with the word ‘upgrade’, the bliss of embracing the sharp-witted technology, and for the habitats of this planet, it is better to tag a thing as dead if it is not upgraded. Eventually, this is also true for the world-wide web.

But, it will be a callous mistake if we apprehend this journey of the internet from web 1.0 to the Web 2.0 as a mere technology upgrade. Admittedly, it is not a well planned and well focused, premeditated tech spec updation, but the change is in the way we interact with the web. Web 2.0 is much more friendly to its users than its previous version as users now can create more web contents and share the same with others without having any technical knowledge regarding web publishing and the other allied programming languages. Most of the popular social network sites use this Web 2.0 technology like Wikipedia, YouTube, Facebook and other alike.

What’s the difference between Web 1.0 and Web 2.0?

Web 1.0 was anything but interactive. A radical change was awaiting to step in and mould the web with its future. Web 1.0 was just a colossal information storehouse for the regular internet users. Searchers, used to pursue their knowledge hunt on the web by dabbling through static web pages, that encased relevant information on their search query. But, there is a certain demarcation line between information gathering and interacting with the same. Typically, Web 1.0 was pretty less interactive as social media was yet to become a buzzword and there were also several technical layers, user had to penetrate to establish an interactive action. If we dig deep into some basic functional attributes of web 1.0 we can perceive that:

web 1.0 housed more static than dynamic pages
server-side scripting support was rare on web servers
Users used to go through pesky guestbooks to interact with a web document
The browser used to fetch contents from a web server’s filesystem as more versatile web database technologies like RDBMS was not that popular or known database standard.
Basic ‘mailto’ forms were used as the feedback communication system as there were no inbuilt email clients within web browsers.

So, before web 2.0 stepped in, the world wide web was more an information store and less an interactive medium for communication. But today, we are customizing and publishing every small update, that happens in our regular life, to popular social media sites for getting further interactions from our friends and relatives.

Web 2.0 got more social and user-centric.

Today, someone represents you on the web and interacts with your friends, colleagues and with the rest of the world when you are in a deep sleep and this is true for every single person that has access to the internet. Ok, let’s tear apart the main concept of web 2.0 to understand the changes in the interactions between us and the web 2.0. They are:

Social media networks
Personal publishing platforms
Mass user participation
Podcasting (process of making audio and video files available on the web)
Rich user experience
User enabled data tagging and data classification
web-based applications as a full-fledged desktop service

So, web 2.0 engages its users in a more interactive approach. Simply, users can publish their thoughts, create social contents, participate in live social debates and can express their views as comments. Nowadays, we tend to share almost every little happening on the social channels. At times, published product reviews and usage experiences on these social surfaces help us pick an informed purchasing decision. So, in every way Web 2.0 is more integrated with the world of the internet of things (IOT) which own our daily lives.

Web crawling in WWW 2.0: Challenges

Web 2.0 brought in a herd of new and advanced web programming components and architectural patterns in web programming language like, Ajax, SOA(Service Oriented Application), JSON(JavaScript object notation), REST(Representational State Transfer) and SOAP(simple object access protocol). These advancements hit the existing execution standards of web crawling and pushed it to achieve a new level overnight.

If we look deep into the existing standards of web crawling in Web 1.0, the key processes were:

protocol driven approach
communication through a secure socket connection on the host or IP address
HTTP requests and response interpretation
Parsing of responses resource collection like data, links, other targeted components

Web 2.0 was built with a philosophy to slash the page loading time by enabling the web pages to get updated asynchronously with small packets of data from the server instead of loading the entire page at a time. To enhance the user experience and to present the rich media within a blade of time these changes were inevitable and Ajax and (Asynchronous Javascript and XML) was the obvious standard. But, Ajax threw a pretty tough challenge to the existing standard of the web crawling. In the web 2.0 protocol driven approach of web crawling was not a solution. It had to evolve and get smarter with the new web application programming standard and with the new logical structure of web data components.

Thus, event – driven crawling stepped in the scenario which was built for working with the new standard namely, DOM or document object model. Technically, DOM is an application programming interface which defines the structure of documents in a web page and provide a feasible way to the web crawlers to access and manipulate the same. This New web crawling approach is not a step less one. It mainly concludes four well-calculated steps. They are:

The analysis of java script
Analysis of its linking with Ajax
DOM event handling and processing
Dynamic content extraction

Through these steps an array of riddles got solved like, penetrating difficult application logic of Ajax and Flash which are the linchpins behind the rich user interface of Web 2.0 and understanding the structures of the data sets which flow between applications in an encrypted format.

Moreover, In this data-driven world It is very much possible that, the targeted application, which is to be crawled, stores information from multiple sites as RSS feeds for delivering an information-rich interface to its users. So, for web crawlers this has also been a problem as it is hard for them to get a clear structure of the accumulated data by a particular application and web 2.0 had been designed to combat with these issues.

Image credit: commexus, ceviu

Sharing is caring!