Last Updated on by
It’s amusing how Big Data is knocking doors these days and so it took a while to settle down with the overwhelming response on the previous post. Here’s the next one.
a) Notes that applied to the previous batch apply here too.
b) Only public data gets crawled in the process and robots.txt is strictly adhered to.
- We’d like to know what’s going on in the home appliances industry within these countries concerning these brands. Please get us all the data you come across on forums, blogs, news, or other reviews.
- We are into building a product-centric search engine. Keep collecting product catalogs from this huge list of sites that we give you. But what’s more important is sticking to this deeply involved schema we suggest. Oh wow! you can discover sites as well? Bring it on..
- Currently, we have a big team of data entry guys who manually search for relevant real estate listings in our target cities. Since we’d like to get this automated, please acquire data from all of these real-estate websites.
- We are in the process of building an Expedia for oceans and seas. Since the websites on our list are not just plain HTML, please help us collect all shipping lines and sea routes from these websites daily.
- I’m building a travel search engine. Yes, there are many already out there but we bring an added layer of social and some machine learning on the ground data. To facilitate this idea, we are interested in gathering hotel addresses and reviews, destination reviews, traveler photos, and author profiles across the major travel websites.
- We are aiming to become a single database for all reviews- products or travel. On top of that, we would also like to add a categorization layer so please get us all relevant data from various sites with these markups.
- I would ultimately like all postings by a user and originating URL that correlate with selected shopping sites I provide like Amazon, Victoria Secrets, Etsy, etc, and so would like to collect as many feeds as possible from this popular content sharing website.
- We are into all things tech. We have a variety of projects that range from getting all the tech questions ever asked on tech forums to get the list of all file extensions from around the web. Not many record-level details required here but all are in the name for us. Speed is prime so quick turnaround is appreciated.
- We are looking to crawl a bunch of HTML pages and track server response codes.
- We would like to track any news on these car models in this particular country. Please provide a data dump of the same every 2 days.
- Crawl about 100 e-commerce websites to bring together product images. Using this, we’d like to analyze what kind of products are appealing to the audience these days.
- Get me product feeds from the Indian E-commerce market with all product-level details and specifications. I need this to build an analytics engine.
- I have a simple one-time request. Please compile a database of all business listings in my locality.
- We measure the marketing performances of enterprises and optimize their marketing budgets. For one of such projects, we would like to track a few mobile handsets every week. We would like to see the different levels of pricing/financing options available as well as how the ratings vary.
- We have built a lot of crawlers ourselves and are just not happy with the way we have done it. We are looking for a solution that can eventually crawl more than 5000 retail sites. We will use this data to deliver innovative software solutions.
- We perform a lot of competitive intelligence internally for our clients and are looking for a more automated solution where we can feed in keywords, and query the data depending on client requirements. So also index the data that you deliver.
- We have built a profiling system for lawyers based on several parameters. To enable this, we research the web extensively. Our researchers simply Google using the lawyer’s name and about 100 relevant keywords for each to fetch the results that can go into our system. We need a solution like yours to automate the process and summarize the retrieved data in an easily editable format.
- We provide some thought-provoking content to our readers from politics, sports, and other everyday things. Editorial being our strength, we would like a solution to facilitate all this content based on some keywords that we provide to you.
- We are interested in making a semantic job search engine for which we would like to acquire all job listings from these sources and catalog them.