How can Videos be Used to Detect Your Personality?

A look into a time distributed deep bimodal approach to predict scores for the Big-5 Personality traits based on videos from the First Impression Challenge on Google Colab.

Metika Sikka

Published in

Towards Data Science

7 min readAug 4, 2020

Videos are the New First Impressions!

Think about the approximate number of video calls you have been a part of since March, 2020. Now, compare it to the number of video calls you were a part of before that. I am sure the difference is huge for most of us. Meetings with family, friends, and colleagues have shifted to video calls.

Video calling has also made it possible for us to keep expanding our networks and meet new people while maintaining social distancing. Hence, it is not wrong to say that, we are making quite a few personal as well as professional first impressions over videos. The personality perception through first impressions can be quite subjective and even lead to first impression bias. Of course, there are self-reported personality assessment tests but they can often suffer from a social desirability bias. And this gives us an opportunity to leverage AI to find a more objective approach to apparent personality analysis.

Keeping this in mind, the aim of this blog post is to show one such deep learning approach which uses videos to predict the scores for the Big-5 personality traits.

What are the Big-5 Personality Traits?

Most contemporary psychologists believe that there are 5 core dimensions to personality: Extraversion, Agreeableness, Openness, Conscientiousness, and Neuroticism; often referred by the acronym OCEAN. Unlike many of its previous counterparts which believe in the binary aspect of personality traits, the Big-5 Personality trait theory asserts that each personality trait is a spectrum.

Let’s look at how each trait is characterized followed by a map of how some popular fictional characters would score on the Big-5 personality traits…

Created by Author using Character Images from their Wiki Profiles

An interesting aspect of the Big-5 personality trait theory is that, these traits are independent but not mutually exclusive. For example, we can see in the above image that Sheldon Cooper (The Big Bang Theory) would score low on Extraversion but would also score high on Neuroticism, Phoebe Buffay (Friends) would score low on conscientiousness but score high on openness and so on…

About the Data Set

The First Impressions Challenge provides a data set of 10k clips from 3k YouTube Videos. The aim of this challenge was to understand how a deep learning approach can be used to infer apparent personality traits from videos of subjects speaking in front of the camera.

The training set comprised of 6k videos. The validation and test sets had 2k videos each. The average duration of the videos was 15 seconds. The ground truth labels for each video, consisted of five scores representing performance on each of the Big-5 personality traits. These scores were between 0 and 1. The labeling was done by Amazon Mechanical Turk Workers. More information about the challenge and the data set can be found in this paper.

Video data is unstructured but rich with multimedia features. The approach explained in this blog post uses audio and visual features from the videos. The analysis and modeling was done on Google Colab. The code can be accessed on Github.

Distribution of the Ground Truth Labels

The graph on the left shows the distributions of personality scores in the training data set. It’s interesting to note that the distributions of the scores are quite similar and even symmetric along the mean. The reason for this symmetry could be that the scores aren’t self reported. Self-reported personality assessment scores are usually skewed due to social desirability bias.

Extracting Visual Features

Videos consists of image frames. These frames were extracted from videos using OpenCV. In apparent personality analysis, visual features include facial cues, movement of hands, posture of the person, etc. Since the data set consisted of videos with an average duration of 15 seconds, from each video 15 random frames were extracted. Each extracted frame was then resized to 150 X 150 and scaled by a factor of 1/255.

Created by Author using a Video from the First Impressions Challenge

Extracting Audio Features

The waveform audio was extracted from each video using ffmpeg subprocess. An open source toolkit, pyAudioAnalysis was used to extract audio features from 15 non overlapping frames (keeping frame step equal to the frame length in the audioAnalysis subprocess). These included 34 features along with their delta features. The output was 1 X 68 dimensional vector for each frame or a 15 X 68 dimensional tensor for 15 audio frames.

The types of features extracted through pyAudioAnalysis include Zero crossing rate, Chroma Vector,Chroma Deviation, MFCCs, Energy, Entropy of Energy, Spectral Centroid, Spectral spread, Spectral entropy, Spectral Flux and Spectral Rolloff.

Deep Bimodal Regression Model

The Functional API of Keras with Tensorflow as backend was used for defining the model. The model was defined in two phases. In the first phase the image and audio features were extracted and then, the sequential features of the videos were processed. To process audio and visual features a bimodal time distributed approach was taken in the first phase.

Keras has a time distributed layer which can be used to apply the same layer individually to multiple inputs, resulting in a “many to many” mapping. Simply put, the time distributed wrapper enables any layer to extract features from each frame or time step separately. The result: an additional temporal dimension in the input and the output, representing the index of the time step.

The audio features extracted via pyAudioAnalysis were passed through a dense layer with 32 units in a time distributed wrapper. Hence, the same dense layer was applied to 1 X 68 dimensional vectors of each audio frame. Similarly, each image frame was passed in parallel through a series of convolutional blocks.

After this step the audio and visual models were concatenated. To process the chronological or temporal aspect of videos, the concatenated outputs were further passed to a stacked LSTM model with a dropout and recurrent dropout rate of 0.2. The output of the stacked LSTM was passed to a dense layer with ReLU activation and dropout rate of 0.5. The final dense layer had 5 output units (one for each personality trait), along with sigmoid activation to get predicted scores between 0 and 1.

Generator Function

The biggest challenge was managing the limited memory resources. This was accomplished using mini batch gradient descent. To implement it a custom generator function was defined as follows:

Note: The generator function yields the input for the audio and visual models in one list. Corresponding to this the model is defined by passing a list of two inputs to the Model class of keras:

model = Model([input_img,input_aud],output)

Results

The model was compiled using the Adam optimizer with a learning rate of 0.00001. The model was trained for 20 epochs with a mini batch size of 8. Mean squared error was taken as the loss function. A custom metric called Mean accuracy was defined to see the performance of the model. It was calculated as follows:

**Here N is the number of input videos.**

Overall the model performed quite well with a final test mean accuracy of 0.9047.

The table below shows the test mean accuracy for each of the Big-5 personality traits. The model shows similar performance for all 5 personality traits.

The Road ahead…

The results of the model can be further improved by increasing the frame sizes and lengths depending upon the availability of processing power. NLP analysis of video transcriptions can also be used to get additional features.

While automated apparent personality analysis has important use cases, it should be made sure that, algorithmic bias does not affect results. The aim of such AI applications is to provide a more objective approach. However, such objectivity can only be achieved if bias is excluded at each stage i.e. from data collection to results interpretation.

References:

[1] T. Giannakopoulos, pyAudioAnalysis(2015), https://github.com/tyiannak/pyAudioAnalysis

[2] A. Subramaniam, V. Patel, A. Mishra, P. Balasubramanian, A. Mittal, Bi-modal First Impressions Recognition using Temporally Ordered Deep Audio and Stochastic Visual Features (2016)

[3] C. Zhang, H. Zhang, X.Wei, J. Wu, Deep Bimodal Regression for Apparent Personality Analysis (2016)