A roundup of 7 most exciting web crawl use cases of 2013
2013 was an exciting year of growth at PromptCloud and feels like it went a mile a minute. Here’s a moment to some of the most amusing tasks that kept us on our toes and took our crawl platform forward. Following is a list of such requirements that came along-
1. Products data normalization– Early on in 2013, a semantic search engine for products needed data from all e-commerce sources on the web. While acquiring ongoing data feeds at this scale was the challenge for our client, the real pain at our end was conforming to the customer’s data schema which was the most intricate that we had encountered. The key, hence, was data normalization and a great job was done at that. The client continues to take the data from multiple sources on a regular basis in an automated manner.
2. Scraping shipping line websites– It was an exciting phase when one of the forerunners from the shipping industry approached us for scraping sailing schedules of various shipping line websites to track transshipment across ports globally.
All of these websites were AJAX-based plus there were millions of queries to be made on each to get to the final result. Components within our system were exploited, some domain knowledge acquired via client interactions, servers’ politeness policies adhered to and queries optimized- not an easy journey there. As of current state, our platform sails smooth through these crawls to grab all the sailing schedules in the customer’s desired format without any manual intervention.
3. URL Freshness Check– Not until one of our existing customers approached with this need did we realize our platform could very well serve it too. The requirement was to check if URLs on a very large website still existed. The module was added to the client project and freshness reports now get updated each day to the client’s servers.
4. Mass-scale crawls– Although our platform is designed to handle large-scale crawls, we had never really gotten to more than 300 sites per client in the past (regardless of the volumes per site). One of the social media intelligence firms wanted data from ~20,000 sources based on a set of 5000 key phrases. Few components were added to the existing stack to automate such large-scale crawls without getting into details of each site for data extraction. Once the solution was validated with the client, we officially introduced Mass-scale crawls where a little more than meta info can be extracted from relevant pages. Couple other clients from the finance industry have subscribed to this offering now.
5. Named-entity Recognition– We also got our hands dirty with Named entity recognition (NER) last year. We aggregated jobs data for one of our clients from a predefined set of sites. Entities within the data as desired by client were marked up which was later analyzed at client’s end. NER is now provided as a service as part of our DaaS offering.
6. Low-latency– Our platform had previously worked with a crawl latency of 15-20 minutes. One of our clients expected a latency of 5 minutes to track user sentiments via reviews on iTunes Apps. The system was tweaked to lower the latency and now data gets uploaded every time there’s something new. Low-latency is an integral part of crawls that run for sentiment analysis (including Twitter crawls) and there’s an evident trend in adoption of the same across clients.
7. Search-based crawls– Last year saw us getting into a lot of search-based crawls – either within a site or on the web. We searched a list of ISBNs on a re-commerce site to extract the book’s details; there were searches for certain routes to retrieve results from flight/bus ticketing websites; searches were also done on the web to analyze the top results for a particular keyword on a known domain.
Being a vertical-agnostic solution, we’ve been lucky to have served customers from across domains with same ease and whiz. According to our analysis, price comparison (both eCommerce and re-commerce) and social media monitoring is trending. These are the verticals that have immersed themselves into Big Data as it gets further democratized.
We do have a roadmap for the next year and are eager to turn on the execution engine. We’ll keep you posted as unprecedented use cases unfold. Do leave your comments here.