Although crawling frequency can be specified, optimal frequencies are hard to determine. The problem is that sites may not update as frequently as they are crawled. The result is suboptimal crawling, redundant data and a negative impact on the target site due to frequent, unproductive crawls.
The solution is intelligent adaptive crawling where the crawler identifies pages that are updated more frequently by machine-learning. As a radical solution, crawls run more frequently on updated pages than dormant. The crawlers modify automatically to establish optimal frequencies based on site behavior and changes. They refine the list of URLs to process and extend the archive with semantic information about extracted content