The key is to have your content made available for the crawler in 2 versions: one, JS-enabled at an ‘AJAX style’ URL, and, the second which is conventional HTML-type URL.
It’s been a while since Google bots started crawling AJAX sites, yet a site still excludes web crawlers that come from other engines.
At PromptCloud we solved this problem with simple GET requests despite the fact that AJAX pages work with POST requests that are not easy to trace for a normal bot.
From our experience with numerous AJAX sites on the web, we’ve crossed the tech barrier. Although we solved the AJAX problem, there do remain challenges when it comes to running AJAX crawls.
Some of these include:
Solution: Headless browser emulating human interaction with a web page without an interface
Solution: Allocating high bandwidths to POST requests to diminish incomplete responses.
Solution: Crawler needs to track View State and pass validation; thus to ensure nothing breaks down midway, a mechanism is employed to restore states.
Solution: Request must be sent in the exact format as expected by the server (Content-type or media type, accept fields, etc.) and similarly responses need to be parsed based on the content-type.
Overall, AJAX crawling requires more compute power in addition to the technical expertise. And because there’s no uniformity on the web, there’s always a new challenge to overcome in this landscape.