The famous Chinese military strategist Sun Tzu had said-
“If you know the enemy and know yourself, you need not fear the result of a hundred battles.”
Although this was written in his famous book “The Art of War”, it is actually applicable in many cases, such as, when setting up an e-commerce business. So you know that you have the right idea, the right people, the right USP, the right products, and the right prices. This is not enough. You also need to have enough data of your competition to determine whether your business model will be a success or is doomed to fail from the very beginning. This is the reason why almost any new company that is trying to make an entry into the e-commerce market, first scrapes Amazon to see how it measures against the Goliath of e-commerce.
Well in case you are thinking that you could very well have scraped any other website, or maybe you should scrape a number of websites to get to know the market better, let me tell you, Amazon delivers to almost every corner of the world, and has thirteen country specific websites. They do a thorough market research before setting or changing prices. So, after a comparison with their data, if your products come off superior, you can bet that your prices and detailing are the best in the market. Also they have the largest repository of products in the market, so in case you can automate this code with the help of a service provider, then you can scrape product data brand-wise and category-wise, and build your own database of products, before you set up shop. It will help you save considerable man hours and money if you are just starting your business.
In the “How to extract hotel data from travel site” article, we already showed you how to set up the environment. Just follow the steps if you are new to python. Everything remains the same. Install atom, python, then use pip to install beautifulsoup, and then copy and paste this program into the editor screen and save it with the name of amazon_data_extractor.py :-
In case you are having difficulty copying the code, you can also download it from here-
( https://drive.google.com/open?id=1lf8afUELJILffU2FY-nIdzscYe2JaWT0 )
You can download the file and open it in atom.
#!/usr/bin/python # -*- coding: utf-8 -*- import urllib.request import urllib.parse import urllib.error from bs4 import BeautifulSoup import ssl import json # For ignoring SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url=input("Enter Amazon Product Url- ") html = urllib.request.urlopen(url, context=ctx).read() soup = BeautifulSoup(html, 'html.parser') html = soup.prettify('utf-8') product_json = {} # This block of code will help extract the Brand of the item for divs in soup.findAll('div', attrs={'class': 'a-box-group'}): try: product_json['brand'] = divs['data-brand'] break except: pass # This block of code will help extract the Prodcut Title of the item for spans in soup.findAll('span', attrs={'id': 'productTitle'}): name_of_product = spans.text.strip() product_json['name'] = name_of_product break # This block of code will help extract the price of the item in dollars for divs in soup.findAll('div'): try: price = str(divs['data-asin-price']) product_json['price'] = '$' + price break except: pass # This block of code will help extract the image of the item in dollars for divs in soup.findAll('div', attrs={'id': 'rwImages_hidden'}): for img_tag in divs.findAll('img', attrs={'style': 'display:none;' }): product_json['img-url'] = img_tag['src'] break # This block of code will help extract the average star rating of the product for i_tags in soup.findAll('i', attrs={'data-hook': 'average-star-rating'}): for spans in i_tags.findAll('span', attrs={'class': 'a-icon-alt'}): product_json['star-rating'] = spans.text.strip() break # This block of code will help extract the number of customer reviews of the product for spans in soup.findAll('span', attrs={'id': 'acrCustomerReviewText' }): if spans.text: review_count = spans.text.strip() product_json['customer-reviews-count'] = review_count break # This block of code will help extract top specifications and details of the product product_json['details'] = [] for ul_tags in soup.findAll('ul', attrs={'class': 'a-unordered-list a-vertical a-spacing-none' }): for li_tags in ul_tags.findAll('li'): for spans in li_tags.findAll('span', attrs={'class': 'a-list-item'}, text=True, recursive=False): product_json['details'].append(spans.text.strip()) # This block of code will help extract the short reviews of the product product_json['short-reviews'] = [] for a_tags in soup.findAll('a', attrs={'class': 'a-size-base a-link-normal review-title a-color-base a-text-bold' }): short_review = a_tags.text.strip() product_json['short-reviews'].append(short_review) # This block of code will help extract the long reviews of the product product_json['long-reviews'] = [] for divs in soup.findAll('div', attrs={'data-hook': 'review-collapsed' }): long_review = divs.text.strip() product_json['long-reviews'].append(long_review) # Saving the scraped html file with open('output_file.html', 'wb') as file: file.write(html) # Saving the scraped data in json format with open('product.json', 'w') as outfile: json.dump(product_json, outfile, indent=4) print ('----------Extraction of data is complete. Check json file.----------')
Well, let me explain with an example. I will show you a product page on Amazon with reviews, and what the program returns, when the URLs are fed to the given program.
Let’s take this Dell Laptop-
Fig:- Amazon product page
When you run the program, it will print “Enter Amazon Product Url- ”
When it does so, just copy paste the URL given above. On doing so, the following JSON will be generated under the name of product.json in your current directory.
Link to match with your JSON generated-
( https://drive.google.com/open?id=1hOTaEufBiS1sxrLhQ3kmdb2MyukP3ZoL )
{ "brand": "Acer", "name": "2018 Newest Acer 14-inch HD Chromebook LED Anti-glare Display, Intel Dual-Core Celeron 3855u 1.6GHz processor, 4GB RAM, 16GB SSD, HDMI, USB 3.0, Webcam, 802.11a Wifi, Bluetooth, Google Chrome OS", "price": "$229.00", "img-url": "https://images-na.ssl-images-amazon.com/images/I/41nlp137qeL._SX300_QL70_.jpg", "star-rating": "4.2 out of 5 stars", "customer-reviews-count": "79 customer reviews", "details": [ "14\" Anti-Glare HD WLED Backlit (1366x768) Display with Acer ComfyView Technology, Built-in media reader", "Intel dual-core Skylake Celeron 3855U 1.60 GHz processor 2M Cache, Intel HD Graphics 510, Built-in HD webcam with microphone", "4GB LPDDR3 Memory, 16 GB eMMC Flash Memory, Built-in cloud support-easily save your files to your Google Drive account for secure access wherever you go", "High Speed 802.11a WiFi, Bluetooth, HDMI, 2x USB 3.0, 1x USB 3.1 Type-C, 1 x Headphone/Microphone Combo Jack", "Google Chrome OS, Up to 10 hours Battery life, Color: Black" ], "short-reviews": [ "Best and safer computer to surf the Web and Watch Videos of all kinds.", "... only had this for a couple days but I love it. I went from a macbook to this ...", "Great for school work....and Netflix", "This is NOT a 2018 Newest Acer - manufacture date is 08/2016", "AWESOME PRODUCT...SIMPLE TO USE", "I love this laptop", "... it is the biggest screen - Mom seems to love it.", "easy to use" ], "long-reviews": [ "I already have an 11.5\" Acer Chromebook, bought few years ago and based on the success of this first one, I decided to go ahead and buy another one of a larger size. I just LOVE, LOVE my two Acer Chromebooks. In the past I made the mistake of buying regular Microsoft Windows type of computer, and few of them still have at home, but after discovering the Acer Chromebook, I must say, I wish I knew earlier about the superiority of a Chromebook over a Windows, Microsoft product....I do not want to open a Microsoft Windows type of computer war vs an Acer Chromebook here....I am just saying...if you are happy with your Microsoft Windows product, I am not criticizing your purchase, what I am saying is that having both products at home...I Love the the simplicity of the Acer Chromebook over the \"OTHERS\"....I am fully aware that a Microsoft Windows type of computer can do other things that a Chromebook cannot do, but if you are just surfing the web and consuming much web media, Facebook, or Yahoo Mail, YouTube, Netflix....Nothing beats the Acer Chromebook for reliability and speed. Acer Chromebook turns on at the speed of light, and no viruses of any kind to worry about. Also, it does not slow down when updating the system like Microsoft Windows constantly does....Yes, at 15.6 inches screen, it is indeed a little jewel adn the price was very good....Thank You Amazon. And Thank You ACER.", "I have only had this for a couple days but I love it. I went from a macbook to this and expected to be disappointed. I didn't have to adjust any of the settings or anything with this laptop. I logged into my gmail and everything was perfect. So far I have no complaints. I can't believe how inexpensive it was.", "Had it for over a month now. It's fast, easy-to-use machine with a large screen. Bought it for my daughter to use for homework. Our school, like many use Google Docs for much of the work that is done online. A chromebook is all they need - no need for Microsoft! Of course my daughter likes the large screen for watching Netflix when the homework is completed!", "NOT a 2018 Chromebook as stated in the title. I was expecting a Chromebook manufactured in late 2017 or 2018. This particular Chromebook was manufactured in August of 2016. Very misleading title......definitely not the \"2018 Newest\". The Chromebook does seem to be pretty decent though.", "I was tired of Windows 10 faulty updates, and needed a reliable computer. The Acer was recommended to me by several friends. I got it last week and love it. Easy to set up and understand, no heat from hard drive, quick, less hassle, nice quality, etc. I still have my Dell with Windows 10, but for right now this Acer Chromebook is my first choice. It takes a little while getting used to a keyboard with fewer keys, and more closely spaced together, but I have no other qualms about this Chromebook. I am learning more about the Chromebook daily, and appreciate its simplicity and no need for added security. Excellent value, and performance. I was pleased with the Acer's price and features. Easy to hook up to Ethernet with adapter if needed.", "I love this laptop! While I am used to Macbooks, my beloved recently died. I was not on the market for another macbook (not by choice) and needed a new laptop quick. I did some quick research and saw good reviews on this laptop. I ordered on Amazon prime and received within 3 days. I'm impressed! Chomebook has made my life extremely easy. Everything is connected! Something about the keyboard makes it easy to type. I LOVE IT!", "A little heavy but it is the biggest screen - Mom seems to love it... no more desktop screen clutter... Thanks", "bought for an elderly relative. easy to use!" ] }
You will also see that we have saved the scraped html page under the name of output_file.html in the same working directory. Here is the link for you to check how the scraped html doc for this particular product page will be-
( https://drive.google.com/open?id=1KT8WR_pBJMykkocpJX7PMF-1lfIt7Qu4 )
You can try scraping data from this html using beautifulsoup.
You can use some more URLs to check how the JSON is coming-
Each data point will be present in the JSON (one or two might be missing if not present for a product, or if the scraper is unable to locate it). Each data point will also come with its respective label.
Fig:- HTTP error 503
It is a server side error. But what it means in this case, is that Amazon is blocking your attempts to scrape data. So when scraping large amount of data from Amazon, it is always suggested, that you get help from experts, like PromptCloud, who have been working in this industry, and can set up a system for you, which will automatically scrape all the data that you need, so that you can focus on your business, without worrying about the data.
Five e-commerce sites are setting shop, while ten are closing, every single day. If you do not follow a strategy, where you are continuously comparing your data with your competitors’, you will end up far behind. It is better that you follow some basic principles, and grow slowly, than grow fast, by giving huge discounts using VC money, and then fall with a thud.
Get clean and ready-to-use data from websites for business applications through our web scraping services.
Disclaimer: The code provided in this tutorial is only for learning purposes. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code.