Submit Your Requirement
Scroll down to discover

How to Scrape Youtube Data using Python

May 28, 2019Category : Blog Marketing
How to Scrape Youtube Data using Python

YouTube has emerged as the clear winner when it comes to video sharing websites, and while it is said to be valued at more than $160 billion, the number of people who are making a living through the website is also massive. These content creators join the YouTube partnership program and by monetizing their content make a ton of money through advertisements.

Why scrape data from Youtube?

YouTube data can be resourceful for a wide range of use cases, such as:

1. Find most liked keywords

Say you run a search to find the top videos displayed on YouTube for some particular words. Now, if you scrape the likes, dislikes, views and titles of each of those videos, you would be able to make a list of keywords which when inserted in your YouTube titles, can lead to better revenue.

2. Compare Hashtags

By comparing likes and views on videos with particular hashtags, you can get a better idea on which hashtags to use on your video to make it more popular, or to understand which hashtags go better with your video title.

3. Find the most popular channels 

Extracting top videos on YouTube can help you create a frequency graph of the channel names that occur, thus enabling you to find the top channels that people enjoy. This, in turn, would also help you understand which topics are most popular among YouTube viewers.

4. Keeping track on the popularity of channels

By extracting the data of newly uploaded videos of a specific YouTube channel, you would be able to find whether a channel’s popularity is increasing or decreasing, or is stagnant.

5. Recording likes, dislikes and views on videos

You can create a graph with time in the x-axis and likes, dislikes or views in the y-axis, by scraping data from those videos at regular time intervals.

Let’s get started with the code:

Since we had already explained the installation and initialisation process in previous “How to scrape data from” articles like this one, we hope you are ready with these steps done.

To run the code, use the python command and then enter a YouTube video URL when prompted.

extract youtube data

Fig: Running the code from shell script.

Copy the code given below into a file and name it as youtubeDataExtractor.py (although you could actually give it any file name, as long as it ends with ‘.py’).

[code language=”python”]

#!/usr/bin/python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import json
import os
from urllib.request import Request, urlopen

# For ignoring SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Input from user

url = input(‘Enter Youtube Video Url- ‘)

# Making the website believe that you are accessing it using a mozilla browser

req = Request(url, headers={‘User-Agent’: ‘Mozilla/5.0’})
webpage = urlopen(req).read()

# Creating a BeautifulSoup object of the html page for easy extraction of data.

soup = BeautifulSoup(webpage, ‘html.parser’)
html = soup.prettify(‘utf-8’)
video_details = {}
other_details = {}

for span in soup.findAll(‘span’,attrs={‘class’: ‘watch-title’}):
video_details[‘TITLE’] = span.text.strip()

for script in soup.findAll(‘script’,attrs={‘type’: ‘application/ld+json’}):
channelDesctiption = json.loads(script.text.strip())
video_details[‘CHANNEL_NAME’] = channelDesctiption[‘itemListElement’][0][‘item’][‘name’]

for div in soup.findAll(‘div’,attrs={‘class’: ‘watch-view-count’}):
video_details[‘NUMBER_OF_VIEWS’] = div.text.strip()

for button in soup.findAll(‘button’,attrs={‘title’: ‘I like this’}):
video_details[‘LIKES’] = button.text.strip()

for button in soup.findAll(‘button’,attrs={‘title’: ‘I dislike this’}):
video_details[‘DISLIKES’] = button.text.strip()

for span in soup.findAll(‘span’,attrs={‘class’: ‘yt-subscription-button-subscriber-count-branded-horizontal yt-subscriber-count’}):
video_details[‘NUMBER_OF_SUBSCRIPTIONS’] = span.text.strip()

hashtags = [] for span in soup.findAll(‘span’,attrs={‘class’: ‘standalone-collection-badge-renderer-text’}):
for a in span.findAll(‘a’,attrs={‘class’: ‘yt-uix-sessionlink’}):
hashtags.append(a.text.strip())
video_details[‘HASH_TAGS’] = hashtags

with open(‘output_file.html’, ‘wb’) as file:
file.write(html)

with open(‘data.json’, ‘w’, encoding=’utf8′) as outfile:
json.dump(video_details, outfile, ensure_ascii=False,indent=4)

print (‘———-Extraction of data is complete. Check json file.———-‘)
[/code]

Once you run the code, you will find a JSON created in your current directory, having the name data.json. We ran the code for some popular music videos and here are the JSONs associated with them-

1. Thunder by Imagine Dragons –

[code language=”python”]

{
“TITLE”: “Imagine Dragons – Thunder”,
“CHANNEL_NAME”: “ImagineDragonsVEVO”,
“NUMBER_OF_VIEWS”: “1,182,556,781 views”,
“LIKES”: “6,693,559”,
“DISLIKES”: “337,823”,
“NUMBER_OF_SUBSCRIPTIONS”: “17M”,
“HASH_TAGS”: [] }

[/code]

2. In your feelings by Drake

[code language=”python”] {
“TITLE”: “Drake – In My Feelings (Lyrics, Audio) \”Kiki Do you love me\””,
“CHANNEL_NAME”: “Special Unity”,
“NUMBER_OF_VIEWS”: “278,121,686 views”,
“LIKES”: “2,407,688”,
“DISLIKES”: “114,933”,
“NUMBER_OF_SUBSCRIPTIONS”: “614K”,
“HASH_TAGS”: [
“#InMyFeelings”,
“#Drake”,
“#Scorpion”
] }
[/code]

The YouTube crawler code explained:

As usual, we first scrape the HTML code from the web page and save it to a file in our local directory, so that we can analyse it and find the data points that can be extracted easily and would be valuable too. Most of the study for data points in the HTML page has to be done manually, by searching for specific keywords or values and finding where they occur.

We have used BeautifulSoup (BS4) for extracting data from specific places in the HTML code-

  1. The span type element having class as ‘watch-title’ is where you can find the Title of the video.
  2. The script element that has a type of ‘application/ld+json’ contains the channel name.
  3. The div element with class watch-view-count, would help you get the number of views of that particular YouTube video.
  4. The button element with the title ‘I like this’ has the count of the number of likes on that particular video.
  5. Similar to the above point, the button element with the title ‘I dislike this’, has the count of the number of dislikes on a particular video.
  6. The span element with class ‘yt-subscription-button-subscriber-count-branded-horizontal yt-subscriber-count’ is one from which you can extract the number of subscribers to the channel which has uploaded that particular video.
  7. Finding the hashtags associated with a given video is slightly more complicated than the other data-points. First, you have to extract all the spans with class ‘standalone-collection-badge-renderer-text’, and from there one has to extract all the a-tags with class- ‘yt-uix-sessionlink’. By extracting the text in all the a-tags, into an array, you will be able to create a list of hashtags. This array can be added to the result json under a particular key called ‘HASH_TAGS’, in order to get the information in a structured format in the final result json.

Which data points can you scrape from Youtube?

Using the code given above, you can scrape certain data points from any YouTube video, as long as you possess their URL. Only the hashtags field may be absent in certain videos since it is not a compulsory field in YouTube video pages. The data points that can be scraped are as follows-

TITLE

The most important data point is the one that we are extracting in the very beginning. The title of the video contains a lot of information, and is of utmost importance, without which all other data points would make no sense whatsoever.

CHANNEL NAME

Right after the title, the Channel name is important to associate the title with the creator- that is, who created the content. In YouTube, videos are associated by their Channel names and not by their creators because in many cases, more than one person works on videos of a single channel.

NUMBER OF VIEWS

The simplest metric to understand a video’s reach is to find the number of views that it has received. This is also the most important metric associated with a YouTube video and in many ways, it determines how much revenue the video’s creator will make.

LIKES

The likes on a YouTube video, is simply what percentage of the viewers liked the video enough to actually click on the thumbs up button below a video.

DISLIKES

Similar to the above data point, the number of dislikes would determine the number of clicks on the dislike button for a video.

NUMBER OF SUBSCRIPTIONS

While likes, dislikes and views paint a picture of the popularity of a single YouTube video, the number of subscriptions gives a finer idea about how popular the YouTube channel is. For YouTube channels, we have no other metric. The number of subscriptions is the only single data point and the higher it is, the more popular is the YouTube channel in question.

HASHTAGS

Hashtags have become a popular way of making your content searchable in different mediums. Be it Facebook posts or Instagram pictures, people are using hashtags with different types of online content today so that different types of content can be associated together. That is the reason why ‘trending hashtags’ is a thing today.

Conclusion

While the given code can only extract some specific data points from a YouTube video page, exploring HTML pages from different YouTube pages can help you find more data points that occur under similar HTML elements. Web scraping has no given hard and fast rules since websites themselves keep changing. Hence, learning what data to scrape and how to scrape is something that can be gathered only from experience by scraping different web pages having different formats of data.


Need help with extracting web data?

Get clean and ready-to-use data from websites for business applications through our web scraping services.

Oops! We could not locate your form.

Disclaimer: The code provided in this tutorial is only for learning purposes. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code.

 
Web Scraping Service CTA

Leave a Reply

Your email address will not be published. Required fields are marked *

Generic selectors
Exact matches only
Search in title
Search in content
Filter by Categories
Blog
Branding
Classified
Data
eCommerce and Retail
Enterprise
Entertainment
Finance
Healthcare
Job
Marketing
Media
Real Estate
Research and Consulting
Restaurant
Travel
Web Scraping

Get The Latest Updates

© Promptcloud 2009-2020 / All rights reserved.
To top