Beyonce Topic Modeling
Since 16 in her stilettos she's been strutting her stuff and producing original songs in all genres from Country to Rock! With the recent release of her Netflix documentary "Homecoming", fans have been looking back over the two-decade career of the Queen herself. Below is a breakdown of the major topic patterns that can be found throughout her songs.
Latent Dirichlet Allocation (LDA) is a model that analyzes large documents of text and then drills down on major topics. I used the Gensim package to analyze all Beyonce song lyrics including her latest album with her hubby Jay Z.
I was able to compile all the lyrics for Beyonce from a larger dataset of over 380,000 lyrics from Metro Lyrics. Because the dataset was too large, I had to host the csv file on AWS S3. This gave me a total of 137 songs.
Using pyLDAvis, we can see the 30 most salient terms from Beyonce's Lyrics.
The Code
Build Bigrams and Trigrams
Bigrams look at the two words before and after a word to pull out some context. Trigrams do the same but for the three words before and after. We also have to remove stop words. These are commonly used words that don't really add to the context of a text.
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc
if token.pos_ in allowed_postags])
return texts_out
Remove Stop Words
Next we create the corpus. We use doc to bag-of-words to encode each word in the lyrics.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
Run the LDA Model
Using gensim, I build the lda_model which will create 10 topics. The 10 passes means the lda model will go through the lyrics 10 times to improve the Coherence Value.
lda_model = gensim.models.ldamodel.LdaModel(
corpus=corpus,
id2word=id2word,
num_topics=10,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True
)
Get LDA Model Coherence Scores
Next we use LDA Mallet to generate coherence scores for our lda model. After working around to get the optimal number of topics, we get 14 topics that are shown below.
# Show Topics
pprint(ldamallet.show_topics(formatted=False))
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(
model=ldamallet,
texts=data_lemmatized,
dictionary=id2word,
coherence='c_v'
)
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)
Interpret the Results
Interpreting the topics is actually the most difficult part of topic modeling because it takes a little domain knowledge. My interpretation is below:
Beyonce's 14 Song Topics
Conclusion
Through topic modeling, we can see that Beyonce's music covers a wide range of themes from empowerment and independence to love and relationships. Her most common topics center around being strong, independent, and in control.