Content Extraction Service

Clients:A popular media website from Brazil

Context:Site-specific crawl and extraction

Challenge:
The client wanted content to be extracted on a continuous basis from Brazilian news sites to power their news portal. The list of websites included popular blogs, news sites, forums and a few content bookmarking sites. The required data points were date of publishing, author name, title, main text content and tags.

Solution:

Once we were provided with the list of source websites and data points, our team started working on the project. As this use case was for news data, the frequency of crawls had to be very high. This meant fresh data sets had to be provided every day. Since each site in the list had a different structure and design, site specific crawl and extraction was the solution used for this case. Once our team finished setting up the web crawlers, the data started flowing in. The data was then cleaned and formatted to be uploaded to the client’s Dropbox servers in XML format. The number of records being delivered per day was above 300,000.

Benefits to the client:
  • Our team handled every technical aspect of the crawling process
  • The initial setup only took 3 days to get completed and the supply of data was consistent
  • Since monitoring was set up for each site, the quality and consistency of data was top notch
  • Although some of the sites used dynamic coding practices, our tech stack could handle them well
  • The client could launch their news portal with the data in a short notice
  • The cost incurred was way less than what an in-house crawling set up would have costed them


SUBMIT REQUIREMENT
Talk to us!
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • Please submit the requirement on CrawlBoard if you're looking to crawl less than 3 sites.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • Please submit the requirement on CrawlBoard if you're looking to crawl less than 3 sites.
  • This field is for validation purposes and should be left unchanged.

Price Calculator

  • Total number of websites
  • number of records
  • including one time setup fee
  • from second month onwards
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • Mary
    Sorry, we are offline right now. Please leave a message and someone will reach out to you soon.