Use case from Site-specific crawl and extraction
Client: Data Analytics and Business research Shop for E-commerce and Retail
Challenge: With their expertise in analytics and research, the client was looking to include a data layer into its current set-up that would allow continuous free flowing feeds, free of “noise” so that the team could only focus on interesting approaches to analytics. They were interested in having easy access to a complete product listing from specific categories along with all the product specifications and prices listed together. The client previously had a data team that manually gathered data from various web sources but the results were limited and efforts were high. Even with the manual effort, structuring of the data in order to import it into their database was a challenge. Client was in need of clean data that could be uploaded into their DB in order to run the comparison engine and perform other monitoring activities.
The Solution: A crawler was set up that could extract product prices and specifications only for predefined categories in an automated manner on a daily basis. Based on the schema provided by the client, a template was created using which structuring of the data (extraction) would occur. The final data was delivered in an XML format via the Data API on a daily basis without any manual intervention from either side. Each record within a dataset had all details i.e. product name, product price, availability status, short and long descriptions, all image URL’s, SKU, dimensions, category, brand, source and the source URL from where it was fetched.
- Any changes within the source sites were taken care of and clients were abstracted from such issues
- Any changes with respect to schema was done as requested
- Other categories could be added as per changing requirements
- Productivity increased since the data team could work on other projects. Client expanded into other verticals
- Low turnaround time of data improved the ability to market client’s services and capabilities
- Value addition from the project was 50 times the spend
- Data quality levels had increased alarmingly without any time investment from the team