This post has 2 precursors- Part 1 and Part 2 . Here’s the third batch of big data use cases that we’ve been getting our hands dirty with.
The source site that I’d like to collect data from limits the number of results. So use a search engine to collect as many results as possible for each keyword from my list and for a particular date range; only within that source domain.
Re-commerce is really growing in my country and I’d like to build a price comparison site for used goods. I’d like to gather prices on all used products being sold on e-commerce sites, as well as the ones that can be bought by these sites for a price. Please also get me all the barcode numbers, along with other product details, so that we can use these as identifiers to search other similar sites.
We are in the hygiene industry and deal with a lot of data. We’d like to use your hosted indexing service to index a maximum of 2 TB of data every month and run a maximum of 20,000 queries per day.
We need to extract phone numbers from an online classifieds website for all the listings present. We have had difficulty using another piece of software on this site since the information is dug deep into multiple page levels. Since you say it’s feasible, please provide me with only the mobile numbers starting with these digits. Since we haven’t given you any other detail, feel free to use your judgement to extract whatever you think makes sense (the last sentence is what usually happens).
I would like you to collect reviews from this popular travel site as described for these brand of hotels only within a particular continent. Please provide us with all details on author, review date, hotel name and the content.
We are looking for data from various e-commerce websites that categorize products for men, women and children. Please facilitate this data that includes item name, size, rate, color and page URL.
We are a personal travel planner in beta looking to enhance our venue database. In that regard, please collect as much information on destinations, hotels and restaurants around the world.
I am currently working on a mobile app and am looking for the most comprehensive database of restaurants & bars in the US. The required data attributes of each restaurant include- name, address, phone, website, cuisine, menu URL and operation hours. Also, please discover the most popular sources for this kind of information to avoid ending up with subsets.
We have a list of target sport sites that we’d like to crawl and then cluster the content based on keywords that are sport celebrities. Being a global sports brand, we need this data for our internal analysis.
We would like to collect airfare data from multiple websites for various sectors on a daily basis to run a fare analysis project across all airlines. This should include both one-way and round-trip data along with the timings, airlines and flight codes. Please facilitate the data in a CSV along with any monitoring involved.
We are interested in mass-scale crawls for two specific markets that we often struggle with capturing enough data for our clients. Our aim is maximum coverage; so please gather all data feeds and upload the same at specified frequencies (which can be weekly or monthly). The source list includes ~10,000 sites.
We are looking for a solution that could provide us with 400,000 data points each month from the books and electronics category from some e-commerce sites. Please crawl these sites on a daily basis to feed into our analytics platform.
I am mostly interested in “location” data for businesses all across Canada and the USA. I also have a list of companies that I’m interested in gathering data on. Please extract the comprehensive store data from sites as you discover them.
We need pricing information for approximately 200,000 retail products. Please provide us detailed product information for a list of GTIN/EAN numbers on a time-to-time basis.
We are a social intelligence engine enabling people to discover and share information via our platform. We’d like to gather consumer complaints information along with the brand for which this complaint was made from multiple sites.
On a daily basis, we will upload a list of URL’s on our FTP server and would like you to check which ones are still alive and which don’t exist anymore. This list will comprise of advertisement URL’s that are pretty dynamic.
We are interested in crawling a directory of broker listings along with the broker profiles. Please collect this data for a list of zip codes that we provide.
Being in the copyright business, I need to look for illegal download links on the popular file sharing sites. I’d like you to crawl web pages for certain filename extensions and other keywords, and collect links for such illegal files on them.
We’d like to build a database of all world stadiums with their venue details and capacities. Please discover the relevant sources and provide us a CSV to import into our DB.
We already crawl few e-commerce sites in-house but there are some complicated sites that we’d want you to crawl. These sites need interactive crawling where you click on certain buttons/forms to display the data. Please provide us the product price data from such sites in a structured format.
These use cases when read together, quite evidently display the big data trends these days. We’ll continue this part series..