What is Rap Genius?
What started initially as Rap-Genius in 2009 has now evolved into a unique music knowledge sharing media company called Genius, that serves more than a hundred million people each month. In this blog, we will learn how to web scrape song lyrics using Python, in a step-by-step guide.
Most internet users who are into music must be familiar with Genius. It is a fast-growing website that shares what goes behind the song — what the artist’s thoughts were when he or she was writing it down, and a lot more. Even well-known artists like Ariana Grande and Kendrick Lamar collaborate with Genius, to give the world a deeper insight into their art.
If anyone wants to crawl song-lyrics, no matter what their purpose be, Genius is the website to go to. Their database of over twenty-five million songs, albums, artists as well as annotations make them the biggest database of song lyrics, anywhere in the world.
Why would One want to Crawl Data from Genius?
Genius has evolved into a music community, where contributors, musicians, and even editors come together to deconstruct songs. Because of this reason, they have become the biggest music lyrics library in the world. Genius has also partnered with music streaming companies to increase their user base.
Genius is the place to web scrape and crawl music-lyrics and other music industry-related data from it. Since not only would it give you a vast repository of lyrics but you would also have access to user comments, user sentiments. Imagine, you could predict what genre is currently a hit among new users based on the data extracted from Genius!.
How to Web Scrape Song Lyrics from Genius using Python?
When it comes to scraping the web, few languages support different types of web scraping projects. Among all, Python is the easiest to learn and also comes in handy for different types of projects. Along with Python, you will also need some third party Python packages as well as a text editor. Since the setup is the same, no matter which website you are scraping, you can easily follow the instructions mentioned in our guide ‘Scrape Amazon Product Reviews and Pricing Data Using Python‘.
Where is The Code?
So we have discussed a lot about song lyrics, Genius, and the basic setup. Let’s get a glimpse of the code for scraping song lyrics using python before I show you how to run it and how the code works.
[code language=”python”]
#!/usr/bin/python
#- * -coding: utf – 8 – * –
import urllib.request
import urllib.parse
import urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import ast
import os
from urllib.request import Request, urlopen
# For ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
# Input from user
url = input(‘Enter Genius song lyrics Url- ‘)
# Making the website believe that you are accessing it using a mozilla browser
req = Request(url, headers = { ‘User-Agent’ : ‘Mozilla/5.0’ })
webpage = urlopen(req).read()
# Creating a BeautifulSoup object of the html page for easy extraction of data.
soup = BeautifulSoup(webpage, ‘html.parser’)
html = soup.prettify(‘utf-8’)
song_json = {}
song_json[“Lyrics”] = [];
song_json[“Comments”] = [];
#Extract Title of the song
for title in soup.findAll(‘title’):
song_json[“Title”] = title.text.strip()
# Extract the release date of the song
for span in soup.findAll(‘span’, attrs = {‘class’: ‘metadata_unit-info metadata_unit-info–text_only’}):
song_json[“Release date”] = span.text.strip()
# Extract the Comments on the song
for div in soup.findAll(‘div’, attrs = {‘class’: ‘rich_text_formatting’}):
comments = div.text.strip().split(“n”)
for comment in comments:
if comment!=””:
song_json[“Comments”].append(comment);
#Extract the Lyrics of the song
for div in soup.findAll(‘div’, attrs = {‘class’: ‘lyrics’}):
song_json[“Lyrics”].append(div.text.strip().split(“n”));
#Save the json created with the file name as title + .json
with open(song_json[“Title”] + ‘.json’, ‘w’) as outfile:
json.dump(song_json, outfile, indent = 4, ensure_ascii = False)
# Save the html content into an html file with name as title + .html
with open(song_json[“Title”] + ‘.html’, ‘wb’) as file:
file.write(html)
print(‘———-Extraction of data is complete. Check json file.———-‘)
[/code]
To run the code given above, all you need to do is save it in a file with the extension .py. So you could save it to a file with the name songLyricsExtractor.py for example, and then from the terminal run the command:
[code language=”python”]
python songLyricsExtractor.py
[/code]
When you run it, you will be prompted to enter a URL. This can be a link to any song in the Genius website. You can use this link for Lana Del Rey’s Looking for America. We have scraped this song’s lyrics from Genius, to show you how the code works in action, and that is covered later in the blog.
Explanation of code
Before we go about running the code and understanding what it gives you, let’s try to understand the code itself. Like always, we are using BS4 (or Beautiful Soup), a library that makes parsing through an HTML page.
In the very beginning, we get the webpage and convert it into a Beautiful Soup object from where we can pick up divs, spans, titles, and other tags, with specific attributes. We use these techniques to crawl lyrics, comments, the title of the webpage, as well as the day the song was actually released. Once we have scraped this data we save it in a JSON file, with the name- title of the page+.json extension. We also save the HTML file with the name title of the page+.html extension. This is done so that the HTML page can be analyzed and more data points can be found in the future.
An Example of the Code in Action
On running the code and giving it the link of a song on Genius, you will get a JSON that will look something like the one given below. The one below is the JSON that we get for Looking for America by Lana Del Rey. We have only presented one JSON since they are so large in size, but you can run it against your favourite songs as well.
[code language=”python”]
{
“Lyrics”: [
[
“[Verse 1]”,
“Took a trip to San Francisco”,
“All our friends said we would jive”,
“Didn’t work, so I left for Fresno”,
“It was quite a scenic drive”,
“Pulled over to watch the children in the park”,
“We used to only worry for them after dark”,
“”,
“[Chorus]”,
“I’m still looking for my own version of America”,
“One without the gun, where the flag can freely fly”,
“No bombs in the sky, only fireworks when you and I collide”,
“It’s just a dream I had in mind”,
“It’s just a dream I had in mind”,
“It’s just a dream I had in mind”,
“”,
“[Verse 2]”,
“I flew back to New York City”,
“Missed that Hudson River line”,
“Took a train up to Lake Placid”,
“That’s another place and time, where”,
“I used to go to drive-ins and listen to the blues”,
“So many things that I think twice about before I do, no”,
“”,
“[Chorus]”,
“I’m still looking for my own version of America”,
“One without the gun, where the flag can freely fly”,
“No bombs in the sky, only fireworks when you and I collide”,
“It’s just a dream I had in mind”,
“It’s just a dream I had in mind”,
“It’s just a dream I had in mind”,
“It’s just a dream I had in mind”
]
],
“Comments”: [
[
“”Looking for America” is a song Lana wrote on August 5th, 2019 regarding the mass shootings throughout the US, once she got back to L.A. The message of the song relies upon Lana dreaming of a better situation for American people, a topic she has sung before in tracks like “Coachella – Woodstock in My Mind” and “When The World Was at War We Kept Dancing”. Del Rey released the song via streaming platforms on August 9th, 2019 as a single.”,
“The title of the song might be a reference to “America” by Simon & Garfunkel:”,
“They’ve all come to look for America”,
“Del Rey took her Instagram to share a video of her singing in the studio with friend and producer Jack Antonoff.”
],
[
“Del Rey shared the song via Instagram adding:”,
“Hi folks came back early from Montecito with my brother this morning and asked Jack Antonoff to come into town because I had a song on my mind that I wanted to write. Now I know I’m not a politician and I’m not trying to be so excuse me for having an opinion- but in light of all of the mass shootings and the back to back shootings in the last couple of days which really affected me on a cellular level I just wanted to post this video that our engineer Laura took 20 minutes ago. I hope you like it. I’m singing love to the choruses I recorded this morning. I’m going to call it ‘Looking for America’.”
]
],
“Title”: “Lana Del Rey – Looking For America Lyrics | Genius Lyrics”,
“Release date”: “August 9, 2019”
}
[/code]
How Scalable is This Solution?
While the solution we provided is for a single song, you could create a list of song-lyrics pages’ links on Genius and then run the code on the list iteratively for web scraping song lyrics using Python. You could also find a regex match for the Genius pages that contain song lyrics, and then crawl multiple pages from Genius at one go, such that your code itself recognizes pages that have lyrics.
However, these types of DIY solutions are good for the buddying hobbyist or one who has a one time need. In case your requirement is more commercial and you have a web scraping problem where you need features like:
- Data delivery in specific formats
- Data refreshing in regular frequency
- No maintenance and infrastructure costs
Then you should go with a web scraping service provider like PromptCloud. Our team at PromptCloud prides itself in providing enterprise-grade web-scraping solutions to business teams across the world, to enable them to use data within their business workflows, and make data-driven decisions.