Submit Your Requirement
Scroll down to discover

BLOGS

Web Scraping Examples

August 19, 2019Category : Blog
Web Scraping Examples

Last Updated on by Tarun

Data has become a key component of the growth strategy for every company. This data can be of any type such as public, user data, competitor data, stock market data, product data, job data, weather forecasting data and many more. However, analyzing data, building machine learning models, or finding patterns in data that would otherwise remain invisible to the naked eye, are the 3rd or 4th steps in the process that begins with collecting data.

Now, when it comes to collecting or gathering data, the number of sources available are many – from newspapers and websites to internal database and books. However, using sources from which data has to be collected manually is difficult due to two reasons. There is an increased chance of mistakes and the process is very slow. Also, to increase your data collection capacity, the only option you have is to increase the number of resources working on it.

A better way to go about collecting data in today’s ecosystem is to leverage the internet, that is, crawl data off the web. This has multiple benefits. Data on the web is refreshed from time to time. If you refresh your scraped data as well, you will always have the latest figures. Also, the internet is an ever-growing repository of data and hence, you do not need to look for new data sources every day. Once you have set up a system to crawl data from certain sites and use the scraped data within your business workflow, you can keep using the same system for many years. Hence, today we will be discussing some of the top web scraping examples as well as some of the code based DIY solutions provided by our team at PromptCloud.

Real estate data

This is one of the most sought after data in the world. Most machine learning books or courses start with a set of houses, their details, and their prices to teach linear regression before going on to complex ML models. Some of the top real estate websites across the US which contain millions of records of homes both on the market or off. They even contain rental prices, estimates of the prices of homes after some years, etc. We scraped the data from leading sites and you can check these links along with the JSON files with the multiple data points.

Example 1

[code language=”python”] {
"description": "327 101st St #1A, Brooklyn, NY is a 3 bed, 3 bath, 1302 sq ft home in foreclosure. Sign in to Trulia to receive all foreclosure information.",
"link": "https://www.trulia.com/p/ny/brooklyn/327-101st-st-1a-brooklyn-ny-11209–2180131215",
"price": {
"amount": "510000",
"currency": "USD"
},
"broad-description": "Very Large Duplex Unit with 1st floor featuring a Finished Recreational Room, an Entertainment Room and a Half Bathroom. Second Level Features 2 Bedrooms, 2 Full Bathrooms, a Living Room/Dining Room and an Outdoor Space. There is Verrazano Bridge views.\n View our Foreclosure Guides",
"overview": [
"Condo",
"3 Beds",
"3 Baths",
"Built in 2006",
"5 days on Trulia",
"1,302 sqft",
"$392/sqft",
"143 views"
] }
[/code]

Example 2

[code language=”python”] {
"Details_Broad": {
"Number of Rooms": 4,
"Floor Size (in sqft)": "1,728"
},
"Address": {
"Street": "638 Grant Ave",
"Locality": "North baldwin",
"Region": "NY",
"Postal Code": "11510"
},
"Title": "638 Grant Ave, North Baldwin, NY 11510 | MLS #3137924 | Zillow",
"Detail_Short": "638 Grant Ave , North baldwin, NY 11510-1332 is a single-family home listed for-sale at $299,000. The 1,728 sq. ft. home is a 4 bed, 2.0 bath property. Find 31 photos of the 638 Grant Ave home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # 3137924",
"Price in $": 299000,
"Image": "https://photos.zillowstatic.com/p_h/ISzz1p7wk4ktye1000000000.jpg"
}
[/code]

Hotel data from top travel portals

Hotel booking websites contain a ton of data such as prices, reviews, ratings, number of people who rated the hotel, and more. we showed how to crawl data from the largest hotel review booking company sometime back, in this article.

Using the HTML parsing library called Beautiful Soup, we were able to crawl multiple data points. Using the small piece of code given below, you can hit the website, get the HTML content and convert it to a Beautiful Soup object. Once this is done, parsing the object and finding specific data points in specific tags that have certain attributes is a simple task.

[code language=”python”] warnings.simplefilter("ignore")#For ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url=input("Enter Hotel Url – ")
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, ‘html.parser’)

html = soup.prettify("utf-8")
hotel_json = {}
[/code]

Code to get the HTML content of a web-page and convert it into a Beautiful Soup object

Social media data

One of the biggest sources of user data is social media. Whether you want to check if people like a particular song, a movie or a company, social media data can help you understand user sentiments as well as keep track of your company’s public reputation. At PromptCloud, we have scraped data from Twitter™️, Instagram™️, and even Youtube™️. The data points in all three were different. For example, from Instagram, we scraped data like this.

[code language=”python”] User: Ariana Grande (@arianagrande)
Followers: 130.5m
Following: 1,348
Posts: 3,669
[/code]

Data scraped from Instagram accounts

However, the data points that we scraped from Youtube™️ were entirely different. An example is the data scraped from a famous song that led to an online challenge itself.

[code language=”python”]

{
"TITLE": "Drake – In My Feelings (Lyrics, Audio) \"Kiki Do you love me\"",
"CHANNEL_NAME": "Special Unity",
"NUMBER_OF_VIEWS": "278,121,686 views",
"LIKES": "2,407,688",
"DISLIKES": "114,933",
"NUMBER_OF_SUBSCRIPTIONS": "614K",
"HASH_TAGS": [
"#InMyFeelings",
"#Drake",
"#Scorpion"
] }

[/code]

Data scraped from Youtube™️ pages

For Twitter, it is to be noted that we needed a developer account, and also we could crawl tweets for each account, only till the count of last 3240 tweets of that particular user. Hence you can see that different web scraping examples can have different approaches as well as outcomes.

Song lyrics from sites like Genius™️

Scraping song lyrics is something that has been done by people since times immemorial. The only difference is that now you can crawl song lyrics much more easily in a few seconds, using a piece of code instead of spending hours or minutes doing it manually. One such example is this article where we showed how to crawl song lyrics and other related data from a popular music website called Genius.

Since the website contains a lot more than just song lyrics, we were able to capture data points like comments, titles and date of release too.

Stock Market data like ones from Yahoo™️ Finance

Stock market data is one huge repository of data that is usually analyzed by people studying the market and deciding where to put their bets. Both current and historic data is of much value. One website that can be scraped quite easily to capture stock information about different companies is Yahoo Finance. Stock information does not only mean the current stock prices since we were able to crawl many other data points too using this process.

These are the data points we scraped for Apple™️

[code language=”python”] {
"PRESENT_VALUE": "198.87",
"PRESENT_GROWTH": "-0.08 (-0.04%)",
"OTHER_DETAILS": {
"PREV_CLOSE": "198.95",
"OPEN": "199.20",
"BID": "198.91 x 800",
"ASK": "198.99 x 1000",
"TD_VOLUME": "27,760,668",
"AVERAGE_VOLUME_3MONTH": "28,641,896",
"MARKET_CAP": "937.728B",
"BETA_3Y": "0.91",
"PE_RATIO": "16.41",
"EPS_RATIO": "12.12",
"EARNINGS_DATE": [
"30 Apr 2019"
],
"DIVIDEND_AND_YIELD": "2.92 (1.50%)",
"EX_DIVIDEND_DATE": "2019-02-08",
"ONE_YEAR_TARGET_PRICE": "193.12"
}
}
[/code]

Product, pricing, and review data from eCommerce websites

For information on different products and their current market prices, there is no better place to gather data from, than big eCommerce companies like Amazon™️. While Amazon™️ does have different page layout across different categories and subcategories and even in different regions across the world, you can safely crawl a small amount of data across limited categories as we have shown in this page, where we scraped product data and pricing information.

Using the code, you can extract the price of an article and its top features. Once the links you’ll need to crawl regularly are ready, you can run your code at a particular frequency. This way you would be able to keep track of price changes of that item and take advantage of it.

News data like from websites like BBC, New York Times, Al Jazeera

News aggregators are on a high demand today. They make for one of the best web scraping examples that directly helped users increase their productivity. No more do people have time to go through newspapers or even entire webpages. So what do news aggregators do differently?

  • News aggregators gather news and show only a line or two explaining a news article in brief. In case you want to know more, you can click on a link and they would direct you to an actual news webpage.
  • They aggregate news articles from big news agencies like the BBC™️ and the New York Times™️ and often this helps in providing you with a fuller picture with more details.
  • With time the app ascertains your likes and dislikes and presents you with news articles depending on your past usage.

You see, these are some of the things that set news aggregators apart and yet, the first step in all these processes is aggregating the data which is often just scraping news articles from different websites.

Scraping Job Data

Recruiting is one industry that, like the real estate industry, has found a huge boost thanks to web scraping and the internet boom. These days, you can crawl job listings from both company websites and the popular internet-based job boards and then use the collected data to boost your business. Whether you are a recruitment firm or a consultancy or you run a job board yourself, scraping job data is a must. One of our many web scraping solutions, JobsPikr, makes it very simple to get updated job listings to run your business. It is a completely autonomous job discovery tool that can fetch you fresh job listing using filters such as title, location, post and more.

Scraping image and textual data required for research

A huge amount of data is required in research projects when working on different machine learning models. Even for training the computer to differentiate between the picture of a dog and a cat, you would need thousands of pictures of dogs and cats. Such data requirements are solved through web-scraping solutions and scientists today crawl Google images and other image sources to get images for their projects. I used Twitter data to gather images that were uploaded to the social media site during a flood. I was trying to separate images that were related to the flood to those that were not.

For content creation

Companies need to do build high-quality content on a regular basis to increase visibility, educate customers and build a brand and boost sales. Scraping content on the internet helps marketing and advertising folks get better ideas, brainstorm and come up with new ways of attracting customers, and increasing sales.

While we explained some of the web scraping examples, the possibilities are endless and web scraping is something that can be taken advantage of by different businesses in different scenarios. At the end of the day, it helps make processes and decisions smarter using the power of data.


Web scraping service cta

Leave a Reply

Your email address will not be published. Required fields are marked *

© Promptcloud 2009-2020 / All rights reserved.
To top
  • This field is for validation purposes and should be left unchanged.
  • This field is for validation purposes and should be left unchanged.