Twitter demographics: User Age Inference

Published in

Towards Data Science

5 min readFeb 1, 2018

Social media data scientists use social network data to build products, however there are certain social networks that are more difficult to mine than others, such is Twitter. Being a continuous and abundant source of data globally adopted and maintained by millions of users sharing their interests and opinions with their digital communities, data extracted from Twitter can be utilized to study human behavior in large scale and by organizations to analyze their impact on social media audiences. However data is incomplete having demographic information almost always absent.

Enriching Twitter data with demographic features is of great importance for both : industry and research applications.

Some efforts made to classify twitter users by age use tweet metadata (original tweet count, retweet rate, reply rate, etc.) as features and other efforts focus on using linguistic features of the users tweets have not been very successful. Linguistic models have been reported to have the best performance, however a linguistic approach makes the model language dependent, making it difficult to generalize a method for estimating user age.

In this post I will share a supervised machine learning solution, proposing a language independent approach to tackle the age inference problem. However this approach has some tradeoffs because it depends completely on having the user friends list to make predictions.

Twitter is commonly regarded as an interest network because the action of following an account is a ‘one way’ interaction. Using those assumptions as starting points, and applying the homophily principle:

“people who are close in age have similar interests as a result of age-related life events”

In other words: people with the same age or close in age will follow the same accounts (in general).

I was able to construct a model using user’s friends as features to train a random forest classifier on an age-labeled dataset. And then infer any user’s age class using only what his or her account follows.

Training Data

The dataset used for training the model consisted of ~7000 users that tweeted their birthday number. It was constructed by using regex to extract their reported age and bring it to the present value, take it as ground truth and label it within four classes: Teens(27.91%), Twenties(49.64%), Thirties(12.71%) and Forties Plus(9.73%).

The training dataset age distribution was skewed towards younger ages since older people do not tweet their birthday’s new age.

For each user on the dataset I downloaded his friend’s (follows) Twitter IDs using Twitter’s API.

Then used Sklearn’s Tfidf Vectorizer to automatically extract the features from the user's friends list. Using the vectorizer parameters max_df and min_df is possible to find adequate thresholds on the friend’s ids frequencies (as if it were a text corpus).

It is important to use the user’s friends ids and not their screen name, since the screen name can change anytime.

Age Classification Model

After testing several models (SVC, Logistic Regression and Fully Connected Neural Networks) on the labeled data the best classifier algorithm was a Random Forest Classifier along with a random over sampler to balance the heavily imbalanced classes.

After running a cross validated grid search for hyper-parameter optimization the model scored amazing training and validation metrics:

0: Forties Plus 1: Teens 2: Thirties 3: Twenties

The metrics of the final model are quite good on all of the age classes except for the Twenties class. This behavior is expected since it can be a challenge for the model to distinguish between a nineteen year old and a twenty year old just using vector about their interests, they may overlap.

Model Shortcomings

This age inference model has two big shortcomings:

It relies on the training set follows list. When trying to predict a new Twitter Id the “friend vector” may be almost or completely empty. A solution to this problem is to cut the friend distribution of each age class on threslholds that contain the adequate amount of information. However since the training set used is very limited this solution is yet to be verified on larger datasets.
Using only information about a Twitter account follows it is difficult to generalize this method because the training set may not represent all age classes from everywhere (ie. teens in Mexico may not have the same interests as teens in Japan).
The dataset used to train and validate the model is a very small sample of twitter users and it is constrained to Mexico’s users.

Another shortcoming is the difficulty to apply this method to big amounts of twitter user data since downloading every follower for every account is time consuming and expensive.

Conclusions:

In this post I shared a supervised approach to classify Twitter user’s by age groups. As any other supervised approach it has shortcomings and tradeoffs.

The solution consisted in:

Get Twitter users to tweet their age on their birthday and bring it to the present value.
Use Twitter API to request the Twitter Ids from the users friends (accounts they follow).
Preprocess all the data and group the users by age group label.
Build a random forest classifier using a tfidf vectorizer to extract the features on the users follows.

The model “in the wild” behaved fairly well and served its purpose: infer Mexican audience age groups attending to public events. (ie. marathons, concerts, expos, etc.)

Inferring age groups in social media (specially Twitter) remains an open and very challenging problem for ML since a generalized method has not been found. Maybe a mix of linguistic, metadata and user follows information can increase the current state of the art model performance.

In my opinion the biggest challenge of age inference for Twitter users resides on creating an adequate dataset and on feature engineering.

Uselful links to papers and other blogs about age inference:

If you have any questions or more interest in my work don’t hesitate on contacting me.