Blog Image

Who's the @RealQaiQai?

  •   Audrey Taylor-Akwenye
  •   08-April 

Natural Language processing is a Machine Learning application where we can train a computer to analyze and attempt to interpret human readable text. Unfortunately, the underlying language of computers is math, so there needs to be a method to translate human readable text into math. Word to Vector is a method to do just that. I have used the Basilica create Word2Vec embeddings of hundreds of tweets to solve one of the biggest social media mysteries of the last year….. WHO IS @REALQAIQAI?

In mid-August 2018, a new account for @RealQaiQai sent her first tweet. This alone doesn’t seem odd as more than half a million users tweet daily. However, Qai Qai is a small African American baby doll who’s the partner-in-crime to Olympia Ohanian, the daughter of Serena Williams and Alexis Ohanian. The realQaiQai has since become America’s favorite doll with over 20k twitter followers.

The world became obsessed with the blossoming relationship between Serena and Alexis in 2018, when the couple announced their engagement. In September, Olympia was born, and by August, the new parents were grandparents. Qai Qai the Doll became an instant hit on twitter and Instagram with her catchy clap backs and beautiful pics from a doll’s point of view. However, unlike Olympia’s twitter profile which clearly explains that both mom and dad manage the account, The adoring fans of @RealQaiQai have no idea which superstar parent in the author behind the account.

Many believe if to Alexis the founder of Reddit because of his background in tech and love of social media. Using Natural Language Processing and a Logistic Regression model, We can finally put this mystery to bed.

The Code

  1. Using Twitter Api

    The first step to solving this ‘who done it’ is to pull tweet history from @alexisohanian, @serenawilliams, and @RealQaiQai. This was done through Twitter’s developer api.

    # User 1
    username = "serenawilliams" 
    tweets = TWITTER.user_timeline(username, count=200, exclude_replies=True, include_rts=False, tweet_mode='extended')
    tweets_for_csv = [tweet.full_text for tweet in tweets] # CSV file created  
    for j in tweets_for_csv: 
  2. Word to Vector with Basilica

    Once we have all the tweets, we use Basilica to convert words to vectors. This process turns a word into a collection of numbers that can be understood by the computer. Once the words are vectors, the computer can analyze the speech patterns, word usage, and sentiment of the tweets.

    user1 = []  
    for tweet in tweets: 
    user1_embedding = BASILICA.embed_sentence(tweet.full_text, model='twitter') 
  3. Apply Logistic Regression Model

    We then feed all this analysis into a logistic regression model. We train the computer to identify which tweets were written by Serena and which were written by Alexis. After training, we fed the word to vector information from QaiQai. The logistic regression model can then predict which tweet was more similar to Serena's tweets, and which were more like Alexis.

    import numpy as np
    from sklearn.linear_model import LogisticRegression
    embeddings = np.vstack([user1, user2])
    labels = np.concatenate([np.ones(len(user1)),
    log_reg = LogisticRegression().fit(embeddings, labels)
  4. Predict Parent Author

    And Finally, we can compare the predictions of each tweet from @RealQaiQai to see which parent was predicted more often. Drum roll please…..

    preds = []
    for tweets in tmp3:
    tweet_embedding = BASILICA.embed_sentence(tweets, model='twitter')
    prediction = log_reg.predict(np.array(tweet_embedding).reshape(1, -1))


    Blog Image
    The parent behind @RealQaiQai is roughly twice as more likely to be Serena than Alexis. The results don't definitively say that only Serena Williams is the author of @RealQaiQai but it looks likely.
Audrey Taylor-Akwenye

Data Scientist, Educator, Entrepreneur