Submit Your Requirement
Scroll down to discover

Scraping Song Lyrics using Python from Genius

August 8, 2019Category : Blog
Scraping Song Lyrics using Python from Genius

Last Updated on by Tarun

What started initially as Rap-Genius in 2009 has now evolved into a unique music knowledge sharing media company called Genius, that serves more than a hundred million people each month. Most internet users who are into music must be familiar with this fast-growing website that goes into what’s behind the song, what the artist’s thoughts were when he or she was writing it down, and not just the lyrics. Even well-known artists like Ariana Grande and Kendrick Lamar collaborate with Genius to give the world a deeper insight into their art.

Genius has evolved into a music community, where contributors, musicians, and even editors come together to deconstruct songs and due to this reason, they have become the biggest music lyrics library in the world. Genius has also partnered with music streaming companies to increase their user base. While on one hand, it has joined hands with Apple to provide lyrics for Apple music services, it has also enabled Spotify users to use Genius to display lyrics for them when they play songs.

If anyone wants to crawl song-lyrics, no matter what their purpose be, Genius is the website to go to. Their database of over twenty-five million songs, albums, artists as well as annotations make them the biggest database of song lyrics, anywhere in the world.

Why would one want to crawl data from Genius?

Genius is one of the biggest names in the music industry today. But not everyone is trying to be that big. Some just want to gather some lyrics for certain reasons or some are focusing or creating a group on a certain genre of music. For all such persons, Genius is the place to crawl music-lyrics and other data from since not only would it give you a huge repository of lyrics but you would also have access to user comments- that is user sentiments. So you could predict what genre is currently a hit among new users and what type of songs are more in sync with the latest happenings.

How to get started with scraping song lyrics from Genius?

When it comes to web scraping, few languages support different types of web scraping, but among them, Python is the easiest to learn and also comes handy for different types of Web Scraping projects. Along with Python, you will also need some third party Python packages as well as a text editor. You can follow the instructions on this page since the setup is the same, no matter which website you are scraping. Once you are done and you have set up your system, you can continue reading.

Where is the code?

So we have discussed a lot about song lyrics, Genius, and the basic setup. So now lets you give a glimpse of the code before I show you how to run it and how the code works –

[code language=”python”] #!/usr/bin/python
#- * -coding: utf – 8 – * –

import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import os
from urllib.request import Request, urlopen

# For ignoring SSL certificate errors

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

# Input from user
url = input(‘Enter Genius song lyrics Url- ‘)

# Making the website believe that you are accessing it using a mozilla browser
req = Request(url, headers = { ‘User-Agent’ : ‘Mozilla/5.0’ })
webpage = urlopen(req).read()

# Creating a BeautifulSoup object of the html page for easy extraction of data.

soup = BeautifulSoup(webpage, ‘html.parser’)
html = soup.prettify(‘utf-8’)
song_json = {}
song_json["Lyrics"] = [];
song_json["Comments"] = [];

#Extract Title of the song
for title in soup.findAll(‘title’):
song_json["Title"] = title.text.strip()

# Extract the release date of the song
for span in soup.findAll(‘span’, attrs = {‘class’: ‘metadata_unit-info metadata_unit-info–text_only’}):
song_json["Release date"] = span.text.strip()

# Extract the Comments on the song
for div in soup.findAll(‘div’, attrs = {‘class’: ‘rich_text_formatting’}):
comments = div.text.strip().split("\n")
for comment in comments:
if comment!="":
song_json["Comments"].append(comment);

#Extract the Lyrics of the song
for div in soup.findAll(‘div’, attrs = {‘class’: ‘lyrics’}):
song_json["Lyrics"].append(div.text.strip().split("\n"));

#Save the json created with the file name as title + .json
with open(song_json["Title"] + ‘.json’, ‘w’) as outfile:
json.dump(song_json, outfile, indent = 4, ensure_ascii = False)

# Save the html content into an html file with name as title + .html
with open(song_json["Title"] + ‘.html’, ‘wb’) as file:
file.write(html)

print(‘———-Extraction of data is complete. Check json file.———-‘)
[/code]

To run the code given above, all you need to do is save it in a file with the extension .py. So you could save it to a file with the name songLyricsExtractor.py for example, and then from the terminal run the command – 

[code language=”python”] python songLyricsExtractor.py
[/code]

When you run it, you will be prompted to enter a URL. This can be a link to any song in the Genius website. You can use this link for Lana Del Rey’s Looking for America. We have scraped this song’s lyrics from Genius, to show you how the code works in action, and that is covered in a latter subheading.

Explanation of code

Before we go about running the code and understanding what it gives you, let’s try to understand the code itself. Like always, we are using BS4 (or Beautiful Soup), a library that makes parsing through an HTML page very simple and scraping data from it very easy. In the very beginning, we get the webpage and convert it into a Beautiful Soup object from where we can pick up divs, spans, titles, and other tags, with specific attributes. We use these techniques, to crawl the lyrics, the comments, the title of the webpage, as well as the day the song was actually released. Once we have scraped these data we save it in a JSON file, with the name- title of the page+.json extension. We also save the HTML file with the name title of the page+.html extension. This is done so that the HTML page can be analyzed and more data points can be found in the future. 

An example of the code in action

On running the code and giving it the link of a song on Genius, you will get a JSON that will look something like the one given below. The one below is the JSON that we get for Looking for America by Lana Del Rey. We have only presented one JSON since they are so large in size but you can run it against your favorite songs as well, to extract the lyrics, save them, print them, or do anything you want with them.

[code language=”python”] {
"Lyrics": [
[
"[Verse 1]",
"Took a trip to San Francisco",
"All  our friends said we would jive",
"Didn’t  work, so I left for Fresno",
"It was quite a scenic drive",
"Pulled over to watch the children in the park",
"We  used to only worry for them after dark",
"",
"[Chorus]",
"I’m  still looking for my own version of America",
"One without the gun, where the flag can freely fly",
"No  bombs in the sky, only fireworks when you and I collide",
"It’s just a dream I had in mind",
"It’s just a dream I had in mind",
"It’s just a dream I had in mind",
"",
"[Verse 2]",
"I  flew back to New York City",
"Missed that Hudson River line",
"Took a train up to Lake Placid",
"That’s another place and time, where",
"I used to go to drive-ins and listen to the blues",
"So many things that I think twice about before I do, no",
"",
"[Chorus]",
"I’m still looking for my own version of America",
"One without the gun, where the flag can freely fly",
"No bombs in the sky, only fireworks when you and I collide",
"It’s just a dream I had in mind",
"It’s just a dream I had in mind",
"It’s just a dream I had in mind",
"It’s just a dream I had in mind"
] ],
"Comments": [
[
""Looking for America" is a song Lana wrote on August 5th, 2019 regarding the mass shootings throughout the US, once she got back to L.A. The message of the song relies upon Lana dreaming of a better situation for American people, a topic she has sung before in tracks like "Coachella – Woodstock in My Mind" and "When The World Was at War We Kept Dancing". Del Rey released the song via streaming platforms on August 9th, 2019 as a single.",
"The title of the song might be a reference to "America" by Simon & Garfunkel:",
"They’ve all come to look for America",
"Del Rey took her Instagram to share a video of her singing in the studio with friend and producer Jack Antonoff."
],
[
"Del Rey shared the song via Instagram adding:",
"Hi folks came back early from Montecito with my brother this morning and asked Jack Antonoff to come into town because I had a song on my mind that I wanted to write. Now I know I’m not a politician and I’m not trying to be so excuse me for having an opinion- but in light of all of the mass shootings and the back to back shootings in the last couple of days which really affected me on a cellular level I just wanted to post this video that our engineer Laura took 20 minutes ago. I hope you like it. I’m singing love to the choruses I recorded this morning. I’m going to call it ‘Looking for America’."
] ],
"Title": "Lana Del Rey – Looking For America Lyrics | Genius Lyrics",
"Release date": "August 9, 2019"
}
[/code]

How scalable is this solution?

While the solution we provided is for a single song, you could create a list of song-lyrics pages’ links on Genius and then run the code on the list iteratively. You could also find a regex match for the Genius pages that contain song lyrics, and then crawl multiple pages from Genius, at one go, such that your code itself recognizes pages that have lyrics. 

However, these types of DIY solutions are good for the buddying hobbyist or one who has a one time need. In case your requirement is more commercial and you have a web scraping problem where you need features like-

  1. Data delivery in specific formats.
  2. Data refreshing in regular frequency.
  3. No maintenance and infrastructure costs.

Then you should go with a DaaS provider like PromptCloud. Our team at PromptCloud prides itself in providing enterprise-grade web-scraping solutions to business teams across the world to enable them to use data to within their business workflows, and make data-driven decisions. 

Leave a Reply

Your email address will not be published. Required fields are marked *

Get The Latest Updates

© Promptcloud 2009-2020 / All rights reserved.
To top