P O P D A T A
Blog Image

Beyonce Topic Modeling

  •   Audrey Taylor-Akwenye
  •   21-April 

Since 16 in her stilettos she's been strutting her stuff and producing original songs all in genres from Country to Rock! The recent release of her Netflix documentary "Homecoming", fans have been looking back over the two-decade career of the Queen herself. Below is a breakdown of the major topic patterns that can be found throughout her songs.


Latent Dirichlet Allocation(LDA) is a model that analyzes large documents of text and then drills down on major topics. I used the Gensim package to analyze all Beyonce song lyrics including her latest album with her hubby Jay Z.

I was able to compile all the lyrics for Beyonce from a larger dataset of over 380,000 lyrics from Metro Lyrics. Because the dataset was too large, I had to host the csv file on AWS S3. This gave me a total of 137 songs.




Using pyLDAvis, we can see the 30 most salient terms from Beyonce's Lyrics.



The Code

  1. Build Bigrams and Trigrams

    Bigrams look at the two words before and after a word to pull out some context. Trigrams do the same but for the three words before and after. We also have to remove stop words. These are commonly used words that don't really add to the context of a text.

    
    # Build the bigram and trigram models
    bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) 
    trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
    
    # Faster way to get a sentence clubbed as a trigram/bigram
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
    
    def remove_stopwords(texts):
        return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
    
    def make_bigrams(texts):
        return [bigram_mod[doc] for doc in texts]
    
    def make_trigrams(texts):
        return [trigram_mod[bigram_mod[doc]] for doc in texts]
    
    def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
        """https://spacy.io/api/annotation"""
        texts_out = []
        for sent in texts:
            doc = nlp(" ".join(sent)) 
            texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
        return texts_out
    
    
  2. Remove Stop Words

    Next we create the corpus. We use doc to bag-of-words to encode each word in the lyrics.

     
    
    id2word = corpora.Dictionary(data_lemmatized)
    texts = data_lemmatized
    corpus = [id2word.doc2bow(text) for text in texts]
    [[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
    		 
    
    
  3. Run the LDA Model

    Using gensim, I build the lda_model which will create 10 topics. The 10 passes means the lda model will go through the lyrics 10 times to improve the Coherence Value

     
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
    				id2word=id2word,
    				num_topics=10, 
    				random_state=100,
    				update_every=1,
    				chunksize=100,
    				passes=10,
    				alpha='auto',
    				per_word_topics=True)
    
    
  4. Get LDA Model Coherence Scores

    Next We use LDA Mallet to generate coherence scores for our lda model. After working around to ge the optimal number of topics. We get 14 topics that are shown below.

     
    # Show Topics
    pprint(ldamallet.show_topics(formatted=False))
    
    # Compute Coherence Score
    coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
    coherence_ldamallet = coherence_model_ldamallet.get_coherence()
    print('\nCoherence Score: ', coherence_ldamallet)
    
    OUTPUT of TOPIC DOMINANCE
    (0,
      '0.044*"big" + 0.041*"back" + 0.037*"talk" + 0.031*"upgrade" + 0.029*"good" '
      '+ 0.028*"hustler" + 0.024*"call" + 0.023*"strong" + 0.023*"tough" + '
      '0.022*"diva"'),
    (1,
      '0.087*"hand" + 0.077*"hold" + 0.071*"man" + 0.060*"daddy" + 0.054*"side" + '
      '0.044*"smack" + 0.033*"air" + 0.028*"care" + 0.027*"said_shoot" + '
      '0.027*"put"'),
     (2,
      '0.092*"girl" + 0.038*"freedom" + 0.028*"wanna" + 0.020*"make" + '
      '0.019*"fall" + 0.019*"kind" + 0.018*"baby" + 0.018*"dress" + 0.018*"daddy" '
      '+ 0.018*"club"'),
     (3,
      '0.060*"make" + 0.050*"pull" + 0.047*"leave" + 0.035*"real" + 0.034*"friend" '
      '+ 0.024*"die" + 0.023*"boss" + 0.018*"bad" + 0.018*"gon" + 0.017*"jump"'),
     (4,
      '0.240*"girl" + 0.165*"run" + 0.127*"world" + 0.048*"motha" + 0.033*"catch" '
      '+ 0.026*"boy" + 0.021*"pray" + 0.020*"bring" + 0.018*"play" + '
      '0.015*"check"'),
     (5,
      '0.079*"slay" + 0.039*"make" + 0.035*"good" + 0.035*"lady" + '
      '0.033*"flawless" + 0.025*"ride" + 0.023*"teach" + 0.022*"rock" + '
      '0.022*"bitch" + 0.018*"hard"'),
     (6,
      '0.049*"work" + 0.046*"hurt" + 0.036*"ring" + 0.033*"lie" + 0.032*"money" + '
      '0.032*"beautiful" + 0.030*"give" + 0.026*"back" + 0.025*"start" + '
      '0.025*"cry"'),
     (7,
      '0.117*"turn" + 0.051*"cherry" + 0.044*"feel" + 0.030*"til" + 0.028*"wait" + '
      '0.026*"stand" + 0.024*"morning" + 0.019*"home" + 0.017*"time" + '
      '0.015*"stop_lov"'),
     (8,
      '0.080*"life" + 0.056*"town" + 0.052*"care" + 0.041*"live" + 0.035*"bow" + '
      '0.025*"ground" + 0.021*"forget" + 0.020*"watch" + 0.020*"man" + '
      '0.020*"bitch"'),
     (9,
      '0.107*"baby" + 0.051*"put" + 0.050*"top" + 0.041*"time" + 0.038*"day" + '
      '0.031*"make" + 0.027*"boy" + 0.023*"whatev" + 0.023*"stay" + '
      '0.023*"immortal"'),
     (10,
      '0.073*"wanna" + 0.048*"tonight" + 0.047*"babe" + 0.046*"body" + '
      '0.040*"baby" + 0.032*"show" + 0.023*"rock" + 0.021*"feel" + 0.018*"move" + '
      '0.018*"dance"'),
     (11,
      '0.313*"love" + 0.077*"crazy" + 0.048*"baby" + 0.023*"break" + 0.021*"feel" '
      '+ 0.020*"hop" + 0.020*"touch" + 0.019*"make" + 0.017*"jealous" + '
      '0.016*"promise"'),
     (12,
      '0.078*"good" + 0.038*"hear" + 0.031*"put" + 0.030*"lose" + 0.025*"wanna" + '
      '0.022*"thing" + 0.022*"time" + 0.022*"face" + 0.021*"boy" + '
      '0.019*"thinking_bout"'),
     (13,
      '0.109*"love" + 0.092*"night" + 0.076*"light" + 0.041*"baby" + 0.036*"kiss" '
      '+ 0.025*"long" + 0.020*"rub" + 0.017*"feel" + 0.014*"boy" + 0.012*"sweet"')]
    
    
    
  5. Interpret the Results

    Interpreting the topics is actually the most difficult part of topic modeling because if takes a little domain knowledge. My interpretation is below:

     
    0.0 : "I'm a Hustler" - "big" "back" "talk" "upgrade" "good" "hustler" "call" "strong" "tough" "diva"
    1.0 : "I will fight you" - "hand" "hold" "man" "daddy" "side" "smack" "air" "care" "said_shoot" "put"
    2.0 : "I'm independent" - "girl" "freedom" "wanna" "make" "fall" "kind" "baby" "dress" "daddy" "club"
    3.0 : "I rep my crew" - "make" "pull" "leave" "real" "friend" "die" "boss" "bad" "gon" "jump"
    4.0 : "Girl Power" - "girl" "run" "world" "motha" "catch" "boy" "pray" "bring" "play" "check"
    5.0 : "I slay" - "slay" "make" "good" "lady" "flawless" "ride" "teach" "rock" "bitch" "hard"
    6.0 : "Don't Play Me" - "work" "hurt" "ring" "lie" "money" "beautiful" "give" "back" "start" "cry"
    7.0 : "S*xual Relations" - "turn" "cherry" "feel" "til" "wait" "stand" "morning" "home" "time" "stop_lov"'
    8.0 : "I'm the Queen" - "life" "town" "care" "live" "bow" "ground" "forget" "watch" "man" 
    9.0 : "Leave Something Behind" - "baby" "put" "top" "time" "day" "make" "boy" "whatev" "stay" "immortal"
    10.0 : "I like to Party"- "wanna" "tonight" "babe" "body" "baby" "show" "rock" "feel" "move" "dance"
    11.0 : "I'm crazy" -  "love" "crazy" "baby" "break" "feel" "hop" "touch" "make" "jealous" "promise"
    12.0 : "You cheat, You Crazy" -  "good" "hear" "put" "lose" *"wanna" "thing" "time" "face" "boy" "thinking_bout"
    13.0 : "But I Love you" -  "love" "night" "light" "baby" "kiss" "long" "rub" "feel" "boy" "sweet"
    
    
    


    Conclusion

Author
Audrey Taylor-Akwenye

Data Scientist, Educator, Entrepreneur