Web Scraping Examples | Web Data Mining

Contact information

PromptCloud Inc, 16192 Coastal Highway, Lewes De 19958, Delaware USA 19958

We are available 24/ 7. Call Now. marketing@promptcloud.com

Abhisek Roy

August 19, 2019
Blog, Web Scraping

Table of Contents show

Data has become a key component of the growth strategy for every company. When it comes to collecting data, plenty of sources are available. However, manually collecting data is difficult due to two reasons – a) increased chance of mistakes, and b) time-consuming process. A better way to go about collecting data is to crawl data off the web, in short, web scraping. Once you have set up a system to crawl data from certain sites and use the scraped data within your business workflow, you can keep using the same system for many years. Today we will be discussing some of the top web scraping examples we have come across at PromptCloud.

Scraping Real Estate Data Using Python

This is one of the most sought-after data in the world. Most machine learning books or courses start with a set of houses, their details, and their prices to teach linear regression before going on to complex ML models. Some of the top real estate websites across the US contain millions of records of homes both on the market or off. They even contain rental prices, estimates of the prices of homes after some years, etc. We scraped the data from leading sites and you can check these links along with the JSON files with the multiple data points.

Example 1

[code language=”python”] {
“description”: “327 101st St #1A, Brooklyn, NY is a 3 bed, 3 bath, 1302 sq ft home in foreclosure. Sign in to Trulia to receive all foreclosure information.”,
“link”: “https://www.trulia.com/p/ny/brooklyn/327-101st-st-1a-brooklyn-ny-11209–2180131215”,
“price”: {
“amount”: “510000”,
“currency”: “USD”
},
“broad-description”: “Very Large Duplex Unit with 1st floor featuring a Finished Recreational Room, an Entertainment Room and a Half Bathroom. Second Level Features 2 Bedrooms, 2 Full Bathrooms, a Living Room/Dining Room and an Outdoor Space. There is Verrazano Bridge views.n View our Foreclosure Guides”,
“overview”: [
“Condo”,
“3 Beds”,
“3 Baths”,
“Built in 2006”,
“5 days on Trulia”,
“1,302 sqft”,
“$392/sqft”,
“143 views”
] }
[/code]

Example 2

[code language=”python”] {
“Details_Broad”: {
“Number of Rooms”: 4,
“Floor Size (in sqft)”: “1,728”
},
“Address”: {
“Street”: “638 Grant Ave”,
“Locality”: “North baldwin”,
“Region”: “NY”,
“Postal Code”: “11510”
},
“Title”: “638 Grant Ave, North Baldwin, NY 11510 | MLS #3137924 | Zillow”,
“Detail_Short”: “638 Grant Ave , North baldwin, NY 11510-1332 is a single-family home listed for-sale at $299,000. The 1,728 sq. ft. home is a 4 bed, 2.0 bath property. Find 31 photos of the 638 Grant Ave home on Zillow. View more property details, sales history and Zestimate data on Zillow. MLS # 3137924”,
“Price in $”: 299000,
“Image”: “https://photos.zillowstatic.com/p_h/ISzz1p7wk4ktye1000000000.jpg”
}
[/code]

Scraping Hotel Data from Top Travel Portals

Hotel booking websites contain a ton of data such as prices, reviews, ratings, the number of people who rated the hotel, and more. We showed how to crawl data from the largest hotel review booking company in another article.

Using the HTML parsing library called Beautiful Soup, we were able to crawl multiple data points. Using the small piece of code given below, you can hit the website, get the HTML content and convert it to a Beautiful Soup object. Once this is done, parsing the object and finding specific data points in specific tags that have certain attributes is a simple task.

[code language=”python”] warnings.simplefilter(“ignore”)#For ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url=input(“Enter Hotel Url – “)
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, ‘html.parser’)
html = soup.prettify(“utf-8”)
hotel_json = {}
[/code]

Code to get the HTML content of a web page and convert it into a Beautiful Soup object.

Scraping Social Media Data

One of the biggest sources of user data is social media. Whether you want to check if people like a particular song, a movie, or a company, social media data can help you understand user sentiments as well as keep track of your company’s public reputation. At PromptCloud, we have scraped data from Twitter™️, Instagram™️, and even YouTube™️. The data points in all three were different. For example, from Instagram, the data scraping works like this..

[code language=”python”] User: Ariana Grande (@arianagrande)
Followers: 130.5m
Following: 1,348
Posts: 3,669
[/code]

Data scraped from Instagram accounts

However, the data points that we scraped from YouTube™️ were entirely different. An example is the data scraped from a famous song that led to an online challenge itself.

[code language=”python”]

{
“TITLE”: “Drake – In My Feelings (Lyrics, Audio) ”Kiki Do you love me””,
“CHANNEL_NAME”: “Special Unity”,
“NUMBER_OF_VIEWS”: “278,121,686 views”,
“LIKES”: “2,407,688”,
“DISLIKES”: “114,933”,
“NUMBER_OF_SUBSCRIPTIONS”: “614K”,
“HASH_TAGS”: [
“#InMyFeelings”,
“#Drake”,
“#Scorpion”
] }
[/code]

Data scraped from YouTube™️ pages

For Twitter, it is to be noted that we needed a developer account, and also we could crawl tweets for each account, only till the count of the last 3240 tweets of that particular user. Hence, you can see that different web scraping examples can have different approaches as well as outcomes.

Scraping Song Lyrics using Python from Sites Like Genius™️

Scraping song lyrics is something that has been done by people since times immemorial. The only difference is that now you can crawl song lyrics much more easily in a few seconds, using a piece of code instead of spending hours or minutes doing it manually. One such example is this article where we showed how to crawl song lyrics and other related data from a popular music website called Genius.

Since the website contains a lot more than just song lyrics, we were able to capture data points like comments, titles, and date of release too.

Scrape Stock Data Python From Sites Like Ones from Yahoo™️ Finance

Stock market data is one huge repository of data that is usually analyzed by people studying the market and deciding where to put their bets. Both current and historic data are of much value. One website that can be scraped quite easily to capture stock information about different companies is Yahoo Finance. Stock information does not only mean the current stock prices since we were able to crawl many other data points too using this process.

These are the data points we scraped for Apple™️

[code language=”python”] {
“PRESENT_VALUE”: “198.87”,
“PRESENT_GROWTH”: “-0.08 (-0.04%)”,
“OTHER_DETAILS”: {
“PREV_CLOSE”: “198.95”,
“OPEN”: “199.20”,
“BID”: “198.91 x 800”,
“ASK”: “198.99 x 1000”,
“TD_VOLUME”: “27,760,668”,
“AVERAGE_VOLUME_3MONTH”: “28,641,896”,
“MARKET_CAP”: “937.728B”,
“BETA_3Y”: “0.91”,
“PE_RATIO”: “16.41”,
“EPS_RATIO”: “12.12”,
“EARNINGS_DATE”: [
“30 Apr 2019”
],
“DIVIDEND_AND_YIELD”: “2.92 (1.50%)”,
“EX_DIVIDEND_DATE”: “2019-02-08”,
“ONE_YEAR_TARGET_PRICE”: “193.12”
}
}
[/code]

Scrape Product Data, Pricing, and Review from eCommerce Websites

For information on different products and their current market prices, there is no better place to gather data, than big eCommerce companies like Amazon™️. While Amazon™️ does have different page layouts across different categories and subcategories and even in different regions across the world, you can safely web crawl a small amount of data across limited categories as we have shown on this page, where we scraped product data and pricing information.

Using the code, you can extract the price of an article and its top features. Once the links you’ll need to crawl regularly are ready, you can run your code at a particular frequency. This way you would be able to keep track of price changes of that item and take advantage of it.

Scrape News Websites Data from Websites Like BBC, New York Times, Al Jazeera

News aggregators are in high demand today. They make for one of the best web scraping examples that directly helped users increase their productivity. No more do people have time to go through newspapers or even entire web pages. So what do news aggregators do differently?

News aggregators gather news and show only a line or two explaining a news article in brief. In case you want to know more, you can click on a link and they would direct you to an actual news webpage.
They aggregate news articles from big news agencies like the BBC™️ and the New York Times™️ and often this helps in providing you with a fuller picture with more details.
With time, the app ascertains your likes and dislikes and presents you with news articles depending on your past usage.

You see, these are some of the things that set news aggregators apart, and yet, the first step in all these processes is aggregating the data, which is often just scraping news articles from different websites.

Scraping Job Data

Recruiting is one industry that, like the real estate industry, has found a huge boost thanks to web scraping and the internet boom. These days, you can crawl job listings from both company websites and the popular internet-based job boards and then use the collected data to boost your business. Whether you are a recruitment firm or a consultancy or you run a job board yourself, scraping job data is a must. One of our many web scraping solutions, JobsPikr, makes it very simple to get updated job listings to manage your strategic workforce planning and running the business efficiently. It is a completely autonomous job discovery tool that can fetch you a fresh job listing using filters such as title, location, post, and more.

Scraping Image and Textual Data Required for Research

A huge amount of data is required in research projects when working on different machine learning models. Even for training the computer to differentiate between the picture of a dog and a cat, you would need thousands of pictures of dogs and cats. Such data requirements are solved through web-scraping solutions and scientists today crawl Google images and other image sources to get images for their projects. I used Twitter data to gather images that were uploaded to the social media site during a flood. I was trying to separate images that were related to the flood from those that were not.

Web Scraping for Content Creation

Companies need to build high-quality content on a regular basis to increase visibility, educate customers, build a brand and boost sales. Scraping content on the internet helps marketing and advertising folks get better ideas, brainstorm, and come up with new ways of attracting customers, and increasing sales.

While we explained some of the web scraping examples, the possibilities are endless and web scraping is something that can be taken advantage of by different businesses in different scenarios. At the end of the day, it helps make processes and decisions smarter using the power of data.

Abhisek Roy