Comparative Lyrical Analysis – Top Pop and Rap Songs of the Past 50 Years
(2019)
Motivation
I love music, and I listen to a lot of it. I’ve always loved it when songs have good lyrics, and I was curious to see what common words were used in top songs throughout the years, in both pop and rap, and if there was any overlap between the genres. For example, is love a common word (and perhaps therefore a common theme) in top pop songs, and also top rap songs? What does that say about both genres? To explore this, I did a lyrical analysis of top pop and rap songs from the past 50 years.
Data Sources
The first step in any data analysis project is finding actual data to analyze.
One of my sets of data is a collection of song lyrics of top pop songs from 1964 – 2015, called “Billboard 1964-2015 Songs + Lyrics: 50 years of pop music lyrics”. I found this dataset on Kaggle, and it is in CSV form, making it easily downloadable and accessible. It’s 1.95 MB, and has 6 columns and 5100 rows.
This dataset can be found here.
My other dataset I collected using Genius, a website that hosts millions of song lyrics, and an article from Billboard called “Billboard Top 100 Hot Rap Songs” that has the Billboard top rap songs and their artists from the same timeframe as the first dataset. I took the artist and song names from this list and put them into a csv that had 100 rows.
Genius has specific URLs containing of the artist and song name that go to each different set of song lyrics. Every URL follows the same pattern, so once I had the artist and song name from the article, I was able to create a function that made the URL to the lyrics and another function to scrape them and throw them into a dataframe. This resulted in a hundred-row dataframe with the artist name, song name, and lyrics.
The Billboard Top 100 Hot Rap Songs list can be found here.
The spreadsheet of names I created can be found here.
The Genius website can be found here.
Data Manipulation
Making URLs
As mentioned above, after I had a list of artist and song names, I had to make them into URLs in order to get to the lyrics pages on the Genius website. Luckily Genius URLs follow a basic pattern which I used in my code as follows:
'https://genius.com/' + artist_name + '-' + song_name + '-lyrics'
All punctuation was removed from both the artist and song name, only the first letter of the artist name was capitalized, and all spaces were replaced with “-”. I initially had several problems creating these URLs which didn’t present themselves until I tried to scrape the actual data – for example the song “O.P.P.” by Naughty by Nature had the URL “https://genius.com/Naughty-by-nature-opp-lyrics”. I originally had my code placing “-” in places of “.”, and had to figure out what was going wrong and ended up changing it to replace it with nothing so it was just gone.
Fortunately, this type of error would crash my code, so I was able to figure out what was wrong and make the appropriate corrections to my code. To do this URL making I primarily used regex, .replace(), and re.sub(). I put all of these actions into a function called get_url(), which spit out the appropriate URL given the artist name and song name.
Cleaning the Data
Luckily my pop song dataset was delivered to me fully cleaned and ready to process, but that wasn’t the case for the rap dataset. I had to do all the same things I did to clean the artist and song names: removing punctuation, symbols, and capitalization. I had a few additional issues with this, because Genius includes headings of who’s signing or what part of the song it is within some of their lyrics (ex: [Beyoncé] or [Chorus]). I did this using regex, .replace(), .lower(), and re.sub(), and placed it in a function called clean_lyrics() that would spit out the cleaned lyrics if the normally formatted lyrics were given.
Making DataFrames
My pop songs dataset was easy – I just did read_csv.
The rap dataset was more complicated though. Once I had the URL to get the rap song lyrics and a method to clean them, I wanted to store them in a dataframe with the corresponding artist and song name, and so I did. I created a dataframe of the song and artist names and then created a new column where I called the functions I had made on the first two columns of each row and stored the data into the third column, resulting in a nice dataframe of the song name, artist name, and song lyrics, which matched my pop dataset quite nicely.
Analysis and Visualization
The Most Common Words That Appear in Both Genres
First, I wanted to see what the overlap between genres was for top words used. Sometimes words can be pretty indicative of themes, and so I wanted to see that, as well as if there were certain words regardless of genre that ended up in top songs.
To do this, I had to go through several steps. First, I took a random sample of 100 pop songs so that I could directly compare that to the rap songs, as I had way more data on pop songs compared to rap songs and I wanted it to be as balanced as possible. Next, I took the top 20 most common words for each genre using the method above with Counter(), and saved the returned output as a list of tuples: (word, freq). After that, I made a new list from the list of tuples that just contained the word portion. Then once I had those two lists, I ran them through a loop that compared the lists against each other and added any words that appeared in both to a new list that held the most common words of both genres. The list varies as the random sample of the pop songs does, but an below is an example of one of the trials:
'im', 'dont', 'know', 'like', 'yeah', 'baby', 'got', 'right', 'come', 'want'
I wanted to create a visualization with this, so I needed to get back the number of occurrences for each word in the pop and rap dataset, so I got those by going back through the first lists I got from Counter(), and stored everything in a triple tuple: (word, num_in_pop, num_in_rap. Once I had that triple tuple, I made it into a dataframe, and then into a double bar graph.
This visual is great because it directly compares the data in a way that you can see which shared words are either genre more. For example, the word “like” has far more occurrences in our rap song dataset than our pop sample, but the word “want” is much closer. This can also point to repetitiveness in songs, because even though we are looking at the same number of songs of each genre, each word is appearing in a much higher frequency in the rap dataset.
Most Popular Words in Top Pop and Rap Songs
My next goal was to find the top words that occurred in the top pop and rap songs, separately. In order to do this, I went through each row in each dataset and made one big string of all the lyrics – one for pop and one for rap. After that, I ran them through code to remove all stop words so we wouldn’t get silly words like “the” or “a” as our top words. Then for each of those word blobs, I used Counter().most_common(20) to get the most common words used in all the songs, and the number of times that they occurred.
Other Analysis
Just for fun, I did a little analysis on individual datasets.
One thing I wanted to see was what the percentage of top pop songs that did and didn't contain the most common word, love, was. Because all my lyrics are stored as big strings, this wasn’t too hard to do – I just used .find(“love”) to see if the word was in the lyrics, and if it was, I added to a count. Then I just took the percentage of the songs with and those without and made a pie chart, shown below:
Another thing I wanted to see was how many unique words there were total for each genre. I thought this would be interesting because pop songs tend to be a little repetitive, where I tend to think rap songs have a little bit more variety of words used. I did this just by running the big word blob of each set of lyrics into a set, which would only result in unique words. The results were as follows:
For pop: 42,340 unique words across 5,100 songs, or about 8.3 unique words per song
For rap: 6,169 unique words across 100 songs, or about 61.9 unique words per song
I also went through each data set and found how many unique words were in each song, and then took the average of all those totals for each genre. This gave me the following results:
Average number of unique words in top pop songs: 108.44
Average number of unique words in top rap songs: 236.18
This shows that at least for these datasets, that on average rap songs use a greater variety of words than pop songs do.
My Code
My code for this project can be found here.