FAQs

Questions on pricing, ETAs, SLAs and more.
Q. What do you consider as a record?

One row in your schema would constitute a record.
Eg: If you are extracting data from an ecommerce site, details about one product (say, product name, url, id, price, shipping cost etc.) would be one record.

Q. Can you build a mailing list by crawling popular listing sites?

We do not provide data from the entire web but only crawl certain sites that host specific information you're looking to extract. Validating such data would require a lot of manual intervention. Moreover, our model is more suited for recurring crawls whereas list building is more like a one-time data acquisition. We'll thus not be able to help with this. It is a better idea to buy such data from companies specializing building lists vs. developing a crawler to do it.

Q. Once setup, can I initiate a crawl by myself?

Yes, you can initiate crawling from your end via CrawlBoard - the internal crawl management tool exclusively developed for our clients. You can also set up billing alarms and receive notifications related to crawls.

Q. Do you have ready to use data from popular sources like TripAdvisor, Amazon etc.?

Ours’ is a custom model where we deliver data depending on each requirement, instead of crawling certain sites and having users subscribe to feeds. However, you can check out our DataStock solution to download pre-crawled web datasets.
We also have free e-commerce, travel and job feed section, wherein you can get free data from some of the popular sites for a predefined number of categories.

Q. Can you crawl sites which require login?

We can extract data that are behind login. We would require the login credentials from the Clients' end. However, we will not be able to help if there is a captcha or the site legally blocks automated login.

Q. Is there any way you could frame a work flow and harvest content (price and reviews) according to our needs, we can provide you the sources?

Sure that is possible, in fact that was our first offering, over the period we have added more.

Q. For the data to be useful to us, the node names in the XML structure would have to be normalized and abstracted. Is this a service you would offer?

Absolutely. That’s what we would like to provide, give you the data in a structure that works best for your business.

Q. Do you have capability of crawling specific items/products on a site? If so, can you guarantee that you get all the reviews for the specified item/product. And what's the pricing?

Crawling specific products on a site is possible too. There are 2 options here- either you provide us with the list of URLs for these products or we discover a way to figure these products on the site. Pricing depends on the number of products we’ll crawl for you and is site-based.

Q. Can your platform perform multi-lingual (non-English) crawls too?

Yes. Till date, we have crawled sites in German, Danish, Norwegian, Chinese, Japanese, Hebrew and Spanish, French and Finnish.

Q. Are you focused on custom crawling or offering large standardized crawls. Can we expect to send you specific sites and get rapid turnaround of customized crawls or are you selling access to large standardized crawls?

We are focused on custom crawling. We take requirements from the client (which is the list of sites that they would like to be crawled) and primarily do vertical specific crawls. We do both deep crawl as well as incremental crawls; and turnaround time for the first feed to arrive is less than 2 days. Thereafter, feeds arrive continuously as per the specified frequency.

Q. If we provide you with a list of URLs, can you crawl those and deliver in a format we specify?

Yes. We internally discover relevant pages to crawl. So if you already have the list, that’s even better as long as the sites involved allow bots.

Q. We would like to validate if our list of URLs are still live. Is that something you can offer?

Definitely. To know how this works, you can read through our blog post.: https://www.promptcloud.com/blog/custom-404-freshness-checker-for-urls/

Q. Can your crawlers perform interaction-based crawling?

Yes, we have multiple crawlers designed into our platform to handle various levels of crawling. We more than often have performed interactive crawling with much ease and accuracy abstracting clients from all details. We can do a demo if you like.

Q. How does your mass scale crawl offering work?

Mass scale crawls are for you if you have numerous number of sites to crawl and interested in very high level details on the web pages. The primary use case of our mass scale crawl offering is social media monitoring. As part of this offering, we crawl thousands of websites (and social media sites like Twitter) to extract relevant data in near real-time. In this case, you have the flexibility of providing sources that you’d like us to crawl, geographies to focus, dynamically provide keywords or phrases to base your crawls on, as well as the schema + format in which you’d like the data delivered. In addition, we take care of end-to-end monitoring and hence you only need to download the data from our API (or we can set up FTP) without you having to be involved in any other process.

Q. Do you deliver data only from a set of sites?

Our’s is a custom model where we deliver data depending on each requirement, instead of crawling certain sites and having users subscribe to feeds.

Q. Can we specify parameters/filters to extract only the relevant data?

Yes. We take care of any kind of data normalization as long as that can be done programatically. We also help you query the data using such filters by providing a search API layer (hosted indexing offering).

Q. Do you offer stand-alone license-based solutions too?

Our product is developed as a typical DaaS and cannot operate on a software license agreement considering the monitoring and overhead costs. Such a solution calls for replicating our entire technology stack for every license.

Q. Can you provide the source code of the crawler you set up?

No. Our model is more on the lines of a managed service using our proprietary platform which is not public. Hence the crawlers we set up are only meant for running on PromptCloud’s DaaS platform.

Q. What formats can you provide the data in?

The most preferred format (both for us and across clients) is XML considering its robustness. We can also do CSV and JSON.

Q. Is using your API the only way to get the data?

No. You can directly download the data from CrawlBoard with few clicks. Although all data gets uploaded to our API, we can push it to your AWS S3 accounts, FTP servers or APIs. However, in the latter cases, feeds will arrive in a batch mode. You can also use the one-click data download option on CrawlBoard.

Q. What's the turnaround time like?

That’s one of our SLAs. We work on quick turnarounds from receiving the requirements to uploading data in a structured format for the first set of feeds. There on, data gets uploaded as per the frequency given by the client. Usually, for popular forums or group sites, it would ideally be multiple times a day whereas for smaller ones, it could be weekly or monthly.

Q. Do you provide raw data, annotated data, or a search interface?

We crawl public data from the HTML pages and present the extracted data in a structured format (normally XML). Exact schema is decided beforehand in consultation with the client. We also provide a search component on top of this data. Read more about hosted indexing here. We can also add client specific normalizations to data if any.

Q. What does Advanced Filtering* Include?

Deduplication, Normalization, Keyword based searches, Geographical searches and other behavior specified searches from the extracted data.

Q. Is there a freemium version of this solution?

We do have free e-commerce, travel and job feed section, wherein you can get free data from some of the popular sites for a predefined number of categories. You can check out more information about it here.
However for other data types apart from the ones provided above, we entertain PoCs (paid) and the costs are adjusted once you’re on board.

Q. Do you support one-time data collection?

Although our model is most suited for recurring data needs, we do entertain one-time requests once in a while if your requirements excite us.

Q. Can I dynamically add or remove sites?

Yes. New sites will incur their own individual one-time setup fees though.

Q. Can we get the raw HTML?

Yes, we can either provide HTML content as fields in the data we may deliver; or, upload the HTML files directly to one of your file sharing servers (like FTP, Amazon S3 etc.) at an additional cost.

Q. Can we get the content extraction? If so, how do you extract and maintain it?

Yes! As per our general model, we would setup dedicated crawlers for each target site.

Q. How do you structure your data?

We should generally be able to extract high-level data points from the target sites with a well-defined structure and tags.

Q. What is the maximum frequency you can scrape data at?

Frequency would depend on your specific requirements. We can extract the data at a frequency ranging from a few minutes to once in a month.

Q. Can we get contextual information from the web page?

It would depend from site to site. However, we should generally be able to provide you with the preceding URL from where we discovered the final page URL.

Q. How do you maintain your code in order to deal with website structural changes?

While setting up crawlers, we setup automated check points to monitor structural changes. In case a site changes its structure, we would be notified and shall fix them accordingly.

Q. How do we access data on our side?

The data can be delivered in XML, JSON or CSV format. The default mechanism for delivering data is via our RESTful API. We can also push the data to one of your file sharing servers (FTP, Amazon S3, Dropbox, Box or MS Azure). If you're not very technically inclined, you can simply use the one-click data download option on CrawlBoard.

Q. Do you have an IP rotation service?

Yes, our platform, by default, handles IP rotation and mechanism to handle other common blocking issues.

Q. What kind of infrastructure do you offer?

As a client, you'd have access to our portal - CrawlBoard. This would be your centralized portal for technical support, billing and keeping a tab on the crawler activities and stats. You'd also be able to schedule ad-hoc crawls for the future. Error handling happens via our ticketing system.

Q. What are your clauses for discontinuing an agreement?

In most cases, we expect you to notify at least a month in advance to release your project-specific resources. Each contract has a specific term with termination and renewal clauses.

Q. Can I crawl any website?

We as a crawling company respect robots.txt and crawl a site only if bots are allowed in robots.txt file. If crawling is disallowed in robots.txt, even though crawling might be feasible technically, it involves legal issues for us as well as our clients. Also in cases where bots are allowed and we give data to clients, it is up to clients to conform to the Terms of Service for the usage of that data.

Q. Can you crawl sites that block your IP?

Yes, we have system components in place to overcome IP blocking.

Q. Can you crawl sites that disallow bots?

No. We respect the robots.txt and crawl a site only if bots are allowed in robots.txt file.

Q. Is crawling legal?

Yes. For a more convincing answer, read our blog post here.

Q. What's the billing frequency?

Monthly.

Q. Do you have any referral program?

We have a generous referral program for all of our existing customers. Get up to $100 credit for every friend you successfully refer and use that to pay for our data solutions.

Q. How is the monthly bill calculated based on volumes?

Monthly bill is calculated based on the crawling frequency and data volume. Eg: If we are crawling a site on weekly basis and deliver, say, 50,000 records in a month, cost for this site on the monthly bill would be: $15 (for volume fee) + $79 (towards site maintenance & monitoring fee) = $94 for the month. Note – this is assuming volume fee to be $5 per 10k records (prorated) and monthly site maintenance & monitoring fee as $79/site.

Q. What does the monthly site maintenance & monitoring fee cover?

The monthly maintenance and monitoring fee covers technical support, overheads in maintaining the data pipeline and related infrastructure as well as fixing the crawlers in case a target site undergoes structural changes.

Q. Does the frequency of crawls change pricing?

Yes, our pricing plan is based on crawling frequency. Volume charges may increase based on the number of records we deliver, which may be directly related to the crawl frequency.

Q. Can I dynamically add or remove sites?

Yes, new sites can be added any time. Just that the newly added sites will incur their own setup fee and time for setting up crawlers.

Q. Do you offer volume discounts?

Yes! In fact, our offerings are also classified based on volumes. We will also be happy to work out attractive discounts in case your monthly data volumes are expected to be in millions.

Q. Is the multiple delivery option free of cost?

No, for customized delivery mechanisms (FTP, Amazon S3, Dropbox or Box), there will be an additional fee of $30 per month. Our default delivery mechanism is via PromptCloud Data API, which is free.

Q. What payment modes do you use?

We accept all the major credit cards.

Q. What are your different pricing schemes?

Our default pricing structure can be found here: https://www.promptcloud.com/pricing. However, we may be able to group similar websites and setup crawlers. Hence, there is a possibility of offering a different pricing model for you.

Q. Can you give us a demo?

Ours is a custom solution and do not have a specific software that can be demonstrated. The final deliverable would be data files in a format that you may specify. The best we could do is to share sample data from past projects that are similar in nature.

Q. Can we run a proof of concept to evaluate your offerings?

In order to provide a proof of concept, we’ll have to setup the crawlers in its entirety, which is a key step in the whole process. Hence, this will be a paid engagement. We provide 30-days paid PoC for a maximum of up to 2 sites.

Q. How do you differ from other providers?

We basically deal with large-scale data and operate on Data as a Service (DaaS). So you do not have to be involved in any of the set-up or monitoring and we take care of end-to-end data delivery. Our solution has been quite useful for clients who wanted to scale with data and had issues both crawling at that scale and then converting unstructured to structured. Other than that, we’ve a pretty low turnaround time and our set-up is capable of uploading data every few minutes from a site that’s more active.

Ready to share your requirements?
 
 
SUBMIT REQUIREMENT
  • This field is for validation purposes and should be left unchanged.
  • Please submit the requirement on CrawlBoard if you're looking to crawl more than 3 sites.
  • This field is for validation purposes and should be left unchanged.
  • Please submit the requirement on CrawlBoard if you're looking to crawl less than 3 sites.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • Please submit the requirement on CrawlBoard if you're looking to crawl less than 3 sites.
  • This field is for validation purposes and should be left unchanged.

Price Calculator

  • Total number of websites
  • number of records
  • including one time setup fee
  • from second month onwards
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.