
Beyonce Topic Modeling
- Audrey Taylor-Akwenye
- April 21, 20219
Since 16 in her stilettos she's been strutting her stuff and producing original songs all in genres from Country to Rock! The recent release of her Netflix documentary "Homecoming", fans have been looking back over the two-decade career of the Queen herself. Below is a breakdown of the major topic patterns that can be found throughout her songs.
Latent Dirichlet Allocation(LDA) is a model that analyzes large documents of text and then drills down on major topics. I used the Gensim package to analyze all Beyonce song lyrics including her latest album with her hubby Jay Z.
I was able to compile all the lyrics for Beyonce from a larger dataset of over 380,000 lyrics from Metro Lyrics. Because the dataset was too large, I had to host the csv file on AWS S3. This gave me a total of 137 songs.
Using pyLDAvis, we can see the 30 most salient terms from Beyonce's Lyrics.

The Code
Build Bigrams and Trigrams
Bigrams look at the two words before and after a word to pull out some context. Trigrams do the same but for the three words before and after. We also have to remove stop words. These are commonly used words that don't really add to the context of a text.
Remove Stop Words
Next we create the corpus. We use doc to bag-of-words to encode each word in the lyrics.
Run the LDA Model
Using gensim, I build the lda_model which will create 10 topics. The 10 passes means the lda model will go through the lyrics 10 times to improve the Coherence Value
Get LDA Model Coherence Scores
Next We use LDA Mallet to generate coherence scores for our lda model. After working around to ge the optimal number of topics. We get 14 topics that are shown below.
Interpret the Results
Interpreting the topics is actually the most difficult part of topic modeling because if takes a little domain knowledge. My interpretation is below:
Conclusion
Top Topics by Count
Zoom In
Top 4 Topics Over Time
Zoom In
Audrey Taylor-Akwenye
Data Scientist, Educator, Entrepreneur