Blog Image

Beyonce Topic Modeling

  •   Audrey Taylor-Akwenye
  •   21-April 

Since 16 in her stelletos she's been strutting her stuff and producing original songs all in genres from Country to Rock! The recent release of her Netflix documentary "Homecoming", fans have been looking back over the two decade career of teh Queen herself. Below is a breakdown of the major topic patterns that can be found throughour her

Latent Dirichlet Allocation(LDA) is a model that analyzes large documents of text and then drills down on major topics. I used the Gensim package to analyze all Beyonce song lyrics including her latest album with her hubby Jay Z.

I was able to compile all the lyrics for Beyonce from a larger dataset of over 380,000 lyrics from Metro Lyrics. Because the dataset was too large, I had to host the csv file on AWS S3. This gave me a total of 137 songs.

Using pyLDAvis, we can see the 30 most salient terms from Beyonce's Lyrics.

Blog Image

The Code

  1. Build Bigrams and Trigrams

    Bigrams look at the two words before and after a word to pull out some context. Trigrams do the same but for the three words before and after.

    # Build the bigram and trigram models
    bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) 
    trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
    # Faster way to get a sentence clubbed as a trigram/bigram
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
  2. Remove Stop Words

    Stop Words are words that are commonly used in the English Language. Removing them helps us focus on the words that can give us an understandign of the lyrics.

    data_words_nostops = remove_stopwords(data_words)
    data_words_bigrams = make_bigrams(data_words_nostops)
    nlp = spacy.load('en', disable=['parser', 'ner'])
    data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
  3. Convert Lyrics to Bag of Words

    Using 'doc2bow", I created a bag of words which encodes each word in the lyrics.

    id2word = corpora.Dictionary(data_lemmatized)
    texts = data_lemmatized
    corpus = [id2word.doc2bow(text) for text in texts]
  4. Run the LDA Model

    Using gensim, I build the lda_model which will create 10 topics. The 10 passes means the lda model will go through the lyrics 10 times to improve the Coherence Value

    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,


    Blog Image
    Top Topics by Count
    Blog Image
    Top 4 Topics Over Time
Audrey Taylor-Akwenye

Data Scientist, Educator, Entrepreneur