Data for Change
Crisis Text Line data scientists used a machine learning model to predict texter age based on the unique words that teen texters use. Learn how we used NLP for data augmentation on a dataset of 118 million text messages.
Content Warning // This post may contain upsetting or triggering content about mental health issues including anxiety and stress.

This article was written by Tiffany Meshkat, and edited by Lili Torok, with coauthor Kei Saito. Contributions from Dua Shamsi, Shannon Green, and Jaclyn Weiser. This study has been conducted at Crisis Text Line in partnership with Hopelab and Well Being Trust.
At a glance
- In setting out to learn about the unique mental health challenges facing teens during Covid-19, Crisis Text Line ran into an obstacle: only 21% of our texters directly disclose their age in a post-conversation survey. To get around this barrier we applied machine learning to try to accurately separate teen texters from adult texters, in order to increase the scope of teen conversations that we can analyze.
- We used natural language processing to predict texter age in order to learn about the mental health challenges of our young texters during Covid-19.
- The voice of teen texters is unique, and markedly different from adults. Our Machine Learning model could identify that they were younger based on the words they used in conversation with us with 85% accuracy.
- While the majority of our teen texters were white and female (according to the survey), the model performed similarly well among non-white, non-female texters, with 83% accuracy
- Teen texters are more likely to use abbreviations, words of agreement, and they are more likely to mention their age.
- Adult texters are more likely to discuss financial troubles, drinking, homelessness, and mental health diagnoses.
2020 was a uniquely challenging year for people around the world due to the Covid-19 pandemic. Because 50% of Crisis Text Line texters who fill out our post-conversation survey are 17 years old or younger, we are in a unique position to discuss what this time has meant for the mental health of young people in particular, and so we launched a series of studies on the despair and resilience of teens in 2020. In our first post, we found that teens who texted Crisis Text Line in 2020 dealt with more grief, eating disorders, anxiety, and stress than in 2019. Teens were also 45% more likely to contact us late at night in 2020 than before. (Throughout the article, we use the terms young people and teens interchangeably to describe texters 17 years old or younger.)
Up until now, we were limited in our analysis because we had to rely on a subset of conversations where our texters shared their age with us in an optional post conversation survey.
However, our Crisis Counselors have explained to us that they can typically tell when they are texting with a teen, because the way teens text and the language they use is different from older texters.
For example, the below message makes it clear that a young person is texting:
_"I’m really stressed because i don’t understand anything going on in remote school. I can’t follow all the work i get assigned and my grades are going down. It’s really hard to get up and get dressed with nowhere to go. My mom has tried to make me get up and i just don’t want to. I can’t wait until this whole covid-19 thing is over". _We paraphrased this quote to protect texter identity
A Crisis Counselor could immediately tell that this is a young texter – and so could a computer, if not as precisely. This presented a great opportunity to use machine learning and data augmentation to approximate the age of our texters, in order to learn about teen mental health from a much larger dataset – almost all of the conversations on the Crisis Text Line platform instead of 21% of them. We saw this as an exciting challenge for our data scientists to take on and an opportunity to learn about the unique mental health challenges of young people in 2020.
Building the model
Step 1: We started by cleaning and wrangling our data.
We accessed the anonymized and scrubbed text message data from the texters in crisis from 2018 to 2020. We used term frequency-inverse document frequency (TF-IDF) to convert the content of the text messages into numbers that our model could process. We removed double spaces; we also replaced numbers with the word "numeric_value" to search for large-scale trends in the use of numbers, rather than individual numbers. We chose not to remove stop words (words with little interpretable content, such as "the") as removing them decreased our model’s predictive power (see the methodology section at the end for more detail on this choice). However, we did choose to include punctuation to include any differences between youth and adult punctuation marks. We predicted that young people would use punctuation differently, and we believed this could be an additional unique identifier for teens.
Step 2: We trained on conversation data from 2018 and 2019, then tested on conversations from 2020.
Between 2018 and 2020, our Crisis Counselors texted with 3.8 million people in crisis, but most texters don’t share their age and other demographic information. To train the model, we had to limit our sample to conversations where we did have age available through our voluntary post conversation survey that the aforementioned 21% of the texters fill out after speaking with us. This meant that our training and testing data were limited to only this ~21% of the data. So, our sample sizes looked like this:
- Training set sample size: 450,000 crisis conversations from 2018 and 2019,
- Testing set sample size: 240,000 crisis conversations from 2020.
Split: 45% of the labeled conversations were teens and 55% were adults for the data from 2018–2020. The two classes were nearly balanced; there was a similar number of young people and adults input into our model.
Step 3: We used logistic regression because it performs well on text data with a large feature set and it is easy to interpret.
We used a logistic regression model to create a binary classifier which identifies if a texter is a teen or an adult, based on the label we were able to assign using our demographic survey. We first tested the model with just the text messages as the input.
We then tested adding a few other data inputs, including the active rescue number (the number of times emergency services were called to support a texter), number of conversations (the total number of times the individual reached out to us), conversation start hour, and certain issue tags more correlated with teens. In the end, the model was most predictive using both the number of conversations and the active rescue number.
Step 4: We evaluated performance across several demographics.
Our model has an accuracy of 85% and area under the curve (AUC) of 84%. The AUC score is the area under the blue Receiver Operating Characteristic (ROC) curve, which plots the true positive rates against the false positive rates. In short, the AUC score is a good indicator of how well the model predicts if the texter is a teen or an adult.

The majority of Crisis Text Line texters who filled out the post-conversation survey and listed a gender are female texters (~70%). The percentage of texters who listed a race are 50% white. Given this majority population, we also tested our model on the self-identified non-female, non-white texter population. We included two tables at the bottom of the article with demographic breakdowns of our texters for gender and race. Texters can also choose to write in a response for gender, as well as choose the option "Prefer not to answer".
The model performed very well on the non-female, non-white texter popluation and was 83% accurate and had an AUC score of 83%. This test shows that our model performed comparably well for this population, even though they are the minority of our texters. This is not always the case in academic studies, you can read more about this issue in articles from UC Berkeley, University of Virginia, and George Mason University. The non-female, non-white population of texters in 2020 was ~19,000 texters, compared to ~237,000 texters in our full 2020 model (which includes those who identify as female and white texters).
We also tested how well our model predicts teen conversations for specifically those texters who identify as "Hispanic, Latino, or Spanish origin" or "Black or African American". Our model had an accuracy of 84% and AUC score of 83% for Hispanic texters. It also performed well for the Black and African American texter population, with 83% accuracy and an AUC score of 83%.
How teens text differently according to our model
Our logistic regression model works by identifying the strongest indicators of either teen or adult texters. We can access those attributes to gain insights into the linguistic differences between these groups. The figures below list the top features (1 to 3 words) which were most distinguishing between teens and adults.
TOP 50 ATTRIBUTES OF TEEN TEXTERS

TOP 50 ATTRIBUTES OF ADULT TEXTERS

The most distinguishing words used to determine if a texter is a teen include describing school-related topics, their age, or their parents. For adults, these words primarily include words related to their partner and children, as well as work and college. A few things stood out to us about the unique language used by teens and adults.
Teens:
- Surprisingly, some top features for teens include expressions to signal agreement, like "you too", "alright", and "mhm". According to our clinical team, teens respond more frequently than adults to Crisis Counselors, and seem to feel more obligated to respond. This often results in more acknowledging messages to Crisis Counselors than adults, such as "ok", "mhm", "oh", etc.
- One of the top features for teens is "I’m only [numeric_value]". This includes the term "numeric_value" which we used to replace all numbers. Our clinical team interprets this as a teen texter speaking about their age in reference to something difficult. It is a common way for them to understand their situation, in which they think their life shouldn’t be this difficult given they are only a child.
- Teens use significantly more abbreviations, initialisms, and shortened words than adults. Figure 4 lists the top abbreviations used by teens.
Adults:
- Among the top 150 features associated with adults, we find money related worries are much more prominent than in teen conversations. Top money related features include "work", "insurance", "job", "afford", "bills", "financial", "money", and "career".
- Other top features associated with adults include drinking and homelessness. Alcohol related words do not appear in the top 150 features for teens, which is interesting as population level surveys indicate that some teens do drink alcohol and that when they do have alcohol, it tends to be binge drinking. This finding is perhaps indicating that adults have more struggles with alcohol, or label it as a coping mechanism.
- Adults use more mental health diagnosis related words, such as "disability", "illness", "depressed", and "bipolar". Additionally adult conversations use more words surrounding professional services such as "counseling", "meds", "medical", and "psychiatrist".
- Car-related words only appear in the top 150 features for adults and not in the teen list (i.e. "my car", "car", "driving", "drive").
Among these top 150 features, we see that teens use a number of acronyms, initialisms and shortened words which adults typically do not use. We note that using abbreviations does not belittle the seriousness of the conversation but simply indicates linguistic differences in the choices teens make when texting. These are included in the figure below:
TOP 10 ABBREVIATIONS INDICATING A TEEN TEXTER

Most of the words listed in Table 4 are abbreviations of common words like "Instagram" or "really", with the notable exception of CPS, which stands for Child Protective Services, a potential stressor in the lives of teen texters. We performed a qualitative analysis of the messages and found that the conversations about CPS ranged from teens talking about their parents, to teen parents concerned about their children.
We will use this augmented dataset to study the despair and resilience of young texters in 2020.
This is the second article in a series analyzing the mental health of teens in 2020. Our third and final article will be an analysis of what made teen texters feel better. We will use the machine learning model which we described in this article to increase the scope of teen conversations we can analyze. We will look for trends in this large dataset to find what coping strategies were used by teens to help them emerge from a difficult situation. We will continue to share what we learn on this blog in the coming weeks.
A note on data limitations
Data is never perfect; it provides a story based on an incomplete set of information. Crisis Text Line‘s data is no different. We think Crisis Text Line has an important perspective to add to the national conversation, but it’s important to note that our data is not representative of all people in the U.S., nor is it representative of what all people in crisis are experiencing. Issue data is reported by volunteer Crisis Counselors for approximately 95% of conversations, based on their interpretation of the conversation issues. Demographic data is self-reported by texters after a conversation, in a web survey. We have always used data to help us improve our service to texters in crisis, and regularly have third parties advise and verify that our processes are informed by best practices. We are engaging additional third parties to further review our data practices to ensure that they are proper, private, secure, and as rigorous as possible.
Demographic breakdowns in Crisis Text Line conversations:
SELF-REPORTED GENDER POPULATION OF TEXTERS

The following table lists the most common race responses from the texters. As with the previous table, we only list the most common races for which there is more than 0.4% of texters.
SELF-REPORTED RACE POPULATION OF TEXTERS

These were the parameters we used in the model:
- We allowed the model to classify based on 1, 2, or 3 words at a time (1–3 ngrams).
- We set the maximum document frequency to 0.7 in order to ignore words which appear in more than 70% of texts. We increased this parameter from 0.5 to 0.7 because conversations with teens make up nearly half of our data; a word that appears in approximately 50% of conversations could technically be a word that helps identify a teen’s voice.
- We set a minimum document frequency of 50, ignoring words which appear in fewer than 50 conversations – in order to discard very common or uncommon words.
- We increased the maximum number of iterations to 2000 to allow for more time for the solvers to converge.
Why we didn’t remove stop words: We chose not to remove stop words before inputting the messages into the model. Instead, we chose certain values for min_df (minimum document frequency) and max_df (max document frequency) to whittle down the corpus of words and remove rare words. This is following the results of a study from The Open University, which found that removing a pre-compiled stopword list negatively impacted the performance of their sentiment analysis classifier on sample Twitter data. They found the best approach was a dynamic method to remove the most infrequently occurring words in a corpus.