It is quite well known that proxy management plays an important part in any web crawling assignment. So, for anyone looking to crawl and extract data with a relatively larger volume, proxy services have become an absolute must. Here we need to understand that it’s usual for web scraping projects to consume a similar amount of time that is required for developing the crawlers.
In this post, we’d be listing down different types of proxy services and provide comprehensive data points to select the right proxy service for your web scraping project.
Before moving to proxy services, we must first go through IP addresses and the way they work. According to Wikipedia, an IP address is a numerical label assigned to each device connected to a computer network that uses the Internet Protocol for communication. Generally, IP addresses look like the following: 220.127.116.11
A proxy can be a third-party server that allows you to send and receive network requests via their server by providing their IP address. After deploying a proxy, when you send a request to any website, they can only identify the IP address of the proxy server. This enables you to access the data anonymously.
At this point, the web is moving to IPV6 from IPV4. This switch to IPV6 will open up the IP addresses to a significantly higher number, but note that the proxy providers still primarily use IPV4 protocol.
As a best practice of web scraping, ideally, you should keep your company name as the user agent when using third-party proxy service. This helps the companies to get in touch with you in case crawling results in unusually high server load or if they wish to ask you to not extract data from their site.
We covered the basics of proxy service and its importance in web scraping, so in this section, we’ll explore how poxy pools can be used in your web scraping projects.
Just like using the IP address of your system, if you only use one proxy, then there would be severe limitations in terms of accessing geo-specific data, data volume and overall crawling efficiency. Hence, there should be a proxy poll using which you can make multiple requests or in other words, the traffic load on the site will be distributed via different proxies.
All of the above-mentioned factors have a large influence on the efficiency of your proxy management system. Obviously, with a low-quality proxy configuration, you will frequently encounter blocking from a website and this would result in poor data quality.
Now we’ll understand the category of IPs for proxy setup.
I’m sure you have already checked the type proxies available for your crawling projects. So, you would have noticed one element in any proxy provider’s site – they are all claiming to be the top proxy service without showcasing any credible reason. And this makes it challenging to assess the right proxy vendor.
Let’s now go through the three primary types of IP addresses and provide the positive and negative factors associated with them to help you choose the right proxy type.
These proxy types are the most conventional IP addresses. Generally, they are IPs of servers available in data centers and because of their abundance they cost the lowest. You can create a state of the art crawling infrastructure by correctly building a powerful proxy management system with datacenter IPs.
As the name suggests, these proxies originate from IPs of a person’s personal residential network. Since it is very difficult to acquire this type of IPs, they usually cost very high. Netnut and Luminati are two notable vendors in this space.
In most of the cases, you can achieve the same outcome with datacenter IPs when compared with residential IPs. So, use this when it is absolutely necessary. Also, note that there is a legal angle here you would be using a personal residential IP.
These are classified as the IPs of mobile devices. Since it is very complex to collect IPs of mobile devices, they are also quite costly. Hence, choose these proxy services when you need to crawl data available for the mobile web. As mentioned earlier, in this case also there is a legal concern as the mobile phone owner might not be aware that you are using their network for web crawling.
Since resource management is important in any business project, it is advisable to select datacenter IPs (because of very low cost) along with a solid proxy management system. This allows you to get the same data quality at a much lower cost in comparison to mobile and residential IP.
Now let’s look at another criterion for proxy services, i.e., anonymity.
Transparent, Anonymous and Elite Proxies.
Essentially a proxy can send three different headers:
Here `REMOTE_ADDR` header is the one that transmits the IP address. It does the same even if you are simply browsing the web.
Transparent proxy → This proxy type makes your IP visible by transmitting your real IP address in the `HTTP_X_FORWARDED_FOR` header. This means a website that does not only find out your `REMOTE_ADDR` but also checks for certain proxy headers will still identify your actual IP address.
The `HTTP_VIA` header is also sent, which tells that you are using a proxy server.
Anonymous proxy → This proxy type will never transmit your actual IP address in the `HTTP_X_FORWARDED_FOR` header, but it sends the IP address of the proxy or is simply void.
The `HTTP_VIA` header in this case also reveals that you are using a proxy service to make access requests.
Elite proxy → An elite proxy only sends REMOTE_ADDR header, the other headers are blank/empty, so it makes you look like a normal web user who is not using any proxy.
At this point, you have a fair understanding of different types of proxies along with their usage depending on the business case. But, choosing the right is only the first step, another important part of the project is to actually manage the pool of proxies. This ensures that you don’t get blocked easily and get maximum value for the investment.
If your project entails large volume data collection, simply acquiring a proxy pool will not help you build a powerful data acquisition pipeline for the long term. At some point, your proxies will get blocked and your data extraction project will be stopped.
So, you need to factor in the following:
It is somewhat easy to manage a pool of close to 15 IP addresses, but when you really need to scale the same with hundreds or thousands of proxies, it can become very complex. In this case, there are largely three options for you. Let’s explore them here:
Here you acquire a pool of private or anonymous proxies, then either make changes to an existing proxy management software or create a management system from scratch based on your requirements. As you can imagine, this approach would cost you the lowest. But, you should opt for this if you have a dedicated team of engineers would continuously monitor the system to manage the proxy pool. This is also a good option when you don’t have any leeway with the allocated capital.
You can go for a solution in which the vendor that delivers the IP addresses, also handles the proxy rotation and location-based targeting. This way you can focus on blocking detection, session allocation, and request moderation while the vendor handles the fundamentals of proxy management.
You can completely outsource the project to a fully managed crawling service provider such as PromptCloud who can take care of end-to-end data acquisition requirements that includes complete proxy management as well. In this case, you have to only focus on the application of the data, as the data delivery pipeline will be fully handled by the service provider.
As you can see each of these options have their own positive and negative points, so evaluate the correct choice based on the nature of the project and budget.
We have explored various technical facets of the proxy selection according to your project requirements, but legality is another important (yet commonly left out) factor that needs to be considered.
The process of using a proxy IP to browse a website is legal, but, there are some considerations to ensure that you are always in the legal perimeter.
Building a robust proxy system makes you really powerful, but that can lead to some disasters as well. Since you can request millions of request to a website and collect huge amounts of data, you can get easily swayed to disrupt the website’s server by consuming its computing resources. This can be catastrophic to the website owner. Hence, you must be very careful about the website’s resources.
As a company that collects data from the web, you must build polite crawlers that respect the robots.txt file and follow the guidelines with perfection. This ensures that you are not causing any harm to the site. Apart from that, if the site owner reports that your crawlers that consuming way too many resources you must immediately throttle the access or stop data collection depending on the scenario. As long as you are not going too far with your powerful proxy management system, you should be fine in terms of legality.
Also, since GDPR has been enforced, now you need to see if you have consent from the residential IP owner when you deploy this type of proxy. This is important as according to GDPR, IP addresses are personally identifiable information. So, make sure that the EU residential proxies are GDPR compliant for web crawling via their house or mobile IP.
This is relatively straightforward when you are using your own residential proxies. But, if you purchase the same from a vendor, please check if they have acquired consent based on GDPR compliance for web crawling projects.