One row in your schema would constitute a record.
Eg: If you are extracting data from an ecommerce site, details about one product (say, product name, url, id, price, shipping cost etc.) would be one record.
As of now, we do not have a start/stop feature that can be initiated from your end. There needs to be a pre-defined frequency associated with the crawls. In some cases, we may be able to work out a solution for triggering crawls from your end, given the process can be programmatically setup.
Ours’ is a custom model where we deliver data depending on each requirement, instead of crawling certain sites and having users subscribe to feeds. If urgent, we may be able to re-process past data (if available) at a nominal price.
Though we do have a Free eCommerce Feed section, wherein you can get free data from Indian ecommerce sites for a predefined number of categories. You can check out more information about it here.
We can extract data that are behind login. We would require the login credentials from the Clients' end. However, we will not be able to help if there is a captcha or the site legally blocks automated login.
Sure that is possible, in fact that was our first offering, over the period we have added more. You can read more about this offering here.
Absolutely. That’s what we would like to provide, give you the data in a structure that works best for your business.
Crawling specific products on a site is possible too. There are 2 options here- either you provide us with the list of URLs for these products or we discover a way to figure these products on the site. Pricing depends on the number of products we’ll crawl for you and is site-based.
Yes. Till date, we have crawled sites in German, Danish, Norwegian, Chinese, Japanese, Hebrew and Spanish, French and Finnish.
We are focused on custom crawling. We take requirements from the client (which is the list of sites that they would like to be crawled) and primarily do vertical specific crawls. We do both deep crawl as well as incremental crawls; and turnaround time for the first feed to arrive is less than 2 days. Thereafter, feeds arrive continuously as per the specified frequency.
Yes. We internally discover relevant pages to crawl. So if you already have the list, that’s even better as long as the sites involved allow bots.
Totally. To know how this works, you can read through our blog post.: https://www.promptcloud.com/blog/custom-404-freshness-checker-for-urls/
Yes, we have multiple crawlers designed into our platform to handle various levels of crawling. We more than often have performed interactive crawling with much ease and accuracy abstracting clients from all details. We can do a demo if you like.
Mass scale crawls are for you if you have numerous number of sites to crawl and interested in very high level details on the web pages. The primary use case of our mass scale crawl offering is social media monitoring. As part of this offering, we crawl thousands of websites (and social media sites like Twitter) to extract relevant data in near real-time. In this case, you have the flexibility of providing sources that you’d like us to crawl, geographies to focus, dynamically provide keywords or phrases to base your crawls on, as well as the schema + format in which you’d like the data delivered. In addition, we take care of end-to-end monitoring and hence you only need to download the data from our API (or we can set up FTP) without you having to be involved in any other process.
Our’s is a custom model where we deliver data depending on each requirement, instead of crawling certain sites and having users subscribe to feeds.
Yes. We take care of any kind of data normalization as long as that can be done programatically. We also help you query the data using such filters by providing a search API layer (hosted indexing offering).
Our product is developed as a typical DaaS and cannot operate on a software license agreement considering the monitoring and overhead costs. Such a solution calls for replicating our entire technology stack for every license.
No. Our model is more on the lines of a managed service using our proprietary platform which is not public. Hence the crawlers we set up are only meant for running on PromptCloud’s DaaS platform.
The most preferred format (both for us and across clients) is XML considering its robustness. We can also do CSV, XLS or JSON.
No. Although all data gets uploaded to our API, we can push it to your AWS S3 accounts, FTP servers or APIs. However, in the latter cases, feeds will arrive in a batch mode.
That’s one of our SLAs. We work on quick turnarounds from receiving the requirements to uploading data in a structured format for the first set of feeds. There on, data gets uploaded as per the frequency given by the client. Usually, for popular forums or group sites, it would ideally be multiple times a day whereas for smaller ones, it could be weekly or monthly.
We crawl public data from the HTML pages and present the extracted data in a structured format (normally XML). Exact schema is decided beforehand in consultation with the client. We also provide a search component on top of this data. Read more about hosted indexing here. We can also add client specific normalizations to data if any.
Deduplication, Normalization, Keyword based searches, Geographical searches and other behavior specified searches from the extracted data.
We do have a Free eCommerce Feed section, wherein you can get free data from Indian ecommerce sites for a predefined number of categories. You can check out more information about it here.
However for other data types apart from the ones provided above, we entertain PoCs (paid) and the costs are adjusted once you’re on board.
Although our model is most suited for recurring data needs, we do entertain one-time requests once in a while if your requirements excite us.
Yes. New sites will incur their own individual one-time setup fees though.
