With their expertise in analytics and research, the client was looking to include a data layer into its current set-up that would allow continuous free flowing feeds, free of “noise” so that the team could only focus on interesting approaches to analytics. They were interested in having easy access to a complete product listing from specific categories along with all the product specifications and prices listed together. The client previously had a data team that manually gathered data from various web sources but the results were limited and efforts were high. Even with the manual effort, structuring of the data in order to import it into their database was a challenge. Client was in need of clean data that could be uploaded into their DB in order to run the comparison engine and perform other monitoring activities. The client provided us with the list of sources to be crawled and the data points required. The extraction was to be done on daily basis which meant fresh data sets have to be provided everyday. Our team set up crawlers to fetch the required data fields from the source sites provided by the client. This use case comes under our site specific crawl offering since the websites in the list had different structuring and design. The client needed the extracted data in CSV format and be uploaded to their S3 servers. The initial setup was complete in a few days and the crawlers started delivering data immediately. About 200 k records were delivered to the client during the first crawl.
A crawler was set up that could extract product prices and specifications only for predefined categories in an automated manner on a daily basis. Based on the schema provided by the client, a template was created using which structuring of the data (extraction) would occur. The final data was delivered in an XML format via the Data API on a daily basis without any manual intervention from either side. Each record within a dataset had all details i.e. product name, product price, availability status, short and long descriptions, all image URL’s, SKU, dimensions, category, brand, source and the source URL from where it was fetched.