The Most Popular Languages on Reddit after Analyzing ~1M Comments

The most popular language on Reddit (other than English) will surprise you. To build this chart I analyzed almost a million Reddit comments with Snowflake and a Java UDTF (in less than 2 minutes).

Felipe Hoffa
Towards Data Science

--

The most popular languages on Reddit, after analyzing 1M comments: English, German(!), Spanish, Portuguese, French, Italian, Romanian(!), Dutch(!)…

Surprising results, compared to the # of native speakers:

Top languages on Reddit — full image + updated Tableau interactive (by author)

German is the second most popular language on Reddit — Which probably is a big reason for Reddit opening an office in Berlin this week:

Millions of German users flock to Reddit every day to find community and belonging with those who share similar interests, and with 47% user growth year-over-year, Germany is now home to our fifth-largest user base. Today, we’re officially saying “hallo” to our friends in the heart of Europe and opening our doors in Berlin with a dedicated team on the ground.

In this post we’ll find the results of analyzing almost a million reddit comments with a Java UDF in Snowflake — in less than 2 minutes. What other languages, regions, and topics have a high representation on Reddit?

Analysis

I started with almost a million Reddit comments outside the top 40 subreddits (find the exact rules for building this sample below). Once the analysis was done it turns out than only ~3% of comments are written in a language other than English:

~97% of analyzed Reddit comments are in English

The ranking for the most popular languages had a surprise: The second most popular language on Reddit is German. The 3rd one is Spanish (which is less surprising), followed by Portuguese, French, Italian, Romanian, and Dutch:

The most popular languages on Reddit, other than English (image by author)

These results are surprising, as German is the 12th most popular language in the world. Even French is more popular than German as a language, but the German community found more traction on Reddit.

According to Wikipedia these are the most spoken languages in the world — the list is quite different to what Reddit sees within its users:

Wikipedia: List of languages by total number of speakers

One language, many subreddits

German subreddits: Most German comments live in /r/de, the subreddit for Germany, followed by /r/Austria, and then the financial conversation in /r/Finanzen. Followed by the German version of /r/me_irl: /r/ich_iel/.

(image by author)

Spanish subreddits: The largest subreddit in Spanish is /r/argentina. This is surprising, as Mexico has 3 times their population, but /r/mexico has only 1/3rd of their comments.

(image by author)

Portuguese subreddits: The largest subreddit in Portuguese is /r/brasil, followed by /r/portugal. This is surprising, as Brazil has 20 times their population, but /r/brasil has only 1.5 times their comments. Their most popular subs talk about confessions, uncensored content, soccer, investing, and sexuality.

(image by author)

French subreddits: It’s surprising to see /r/Quebec as the second most popular French sub, after /r/france.

(image by author)

And so on: The list continues

Subreddits representing multiple languages (image by author)

One subreddit, many languages

Some subreddits deal with content in multiple language. A common mix is English plus another language — and then /r/serbia that divides its comments between Bosnian, Croatian, and Serbian:

Subreddits with comments in multiple languages (image by author)

How-to

Taking 1 million reddit comments and analyzing their language with a Java UDF inside Snowflake wasn’t hard, as these previous posts explain how:

Queries and notes

Detect language with a Java UDTF

As Snowflake recently announced the support for Java table UDFs, I decided to try them out. Surprisingly the tabular UDF performed better than my previous Java UDF. Find details about this in a section below.

UDTFs are able to read tables and output tables back. This is the UDTF implementation that produced the results for this post:

create or replace function detect_lang_udtf(v varchar)
returns table(output_value varchar)
language java
imports = ('@~/lingua-1.1.0-with-dependencies.jar')
handler='MyClass'
target_path='@~/202110xx3.jar'
as
$$
import java.util.stream.Stream;

import com.github.pemistahl.lingua.api.*;
import static com.github.pemistahl.lingua.api.Language.*;
class OutputRow {
public String output_value;
public OutputRow(String output_value) {
this.output_value = output_value;
}
}
class MyClass { static LanguageDetector detector = LanguageDetectorBuilder.fromAllLanguages().withPreloadedLanguageModels().build(); public MyClass() {
}
public static Class getOutputClass() {
return OutputRow.class;
}
public Stream<OutputRow> process(String input_value) {
String x = detector.detectLanguageOf(input_value).toString();
return Stream.of(new OutputRow(x));
}
public Stream<OutputRow> endPartition() {
return Stream.empty();
}
}
$$;

~Million comments sample

Rules:

  • Start with 14 days of reddit comments, between 2020–12–01 and 2021–12–14 (see how to load reddit comments on Snowflake).
  • Discard comments in the top 40 subreddits (no need to analyze each for language, mostly English).
  • Remove special characters , urls, and links from comments.
  • Allow only comments with more than 200 characters.
  • Discard 90% of remaining comments, keep 10% random sample.
  • Keep only comments in subreddits with more than 50 comments after the previous rules are applied (500 before sampling).
  • Total comments left to analyze: 938,131 on 3,860 subreddits.
create or replace temp table reddit_sample as 

with top_subs as (
select count(*), v:subreddit subreddit
from reddit_comments_sample
group by 2
order by 1 desc
limit 40
)

select *
from (
select *, v:subreddit subreddit
, regexp_replace(v:body, 'http[^ ]*', '') body_clean1
, regexp_replace(body_clean1, '&[a-z][a-z]+;', '') body_clean2
, regexp_replace(body_clean2, '\\[[^\\]]*\\]', '') body_clean
, count(*) over(partition by subreddit) c_sub
from reddit_comments_sample
where v:author not in ('[deleted]', 'AutoModerator')
and subreddit not in (select subreddit from top_subs)
and length(body_clean) > 200
and uniform(0::float, 1::float, random()) < 0.1
)
where c_sub > 50
;

Analyze each comment for language with a UDTF

For optimization we only gave the 200 first characters of a cleaned up comment to the function to analyze:

create table reddit_sample_languages_udtf
as
select *, output_value lang
from reddit_sample a, table(detect_lang_udtf(substr(body_clean, 0, 200))) x
;

Count English vs not

97% of analyzed reddit comments are in English
select count(*) comments
, ratio_to_report(comments) over() percent
, iff(lang='ENGLISH', 'English', 'Not English') language
from reddit_sample_languages_udtf
group by 3
order by 1 desc

UDF vs UDTF performance

My previous UDF took 4 minutes to analyze a million comments with an XL Snowflake warehouse — and the new UDTF was able to do the same in less than half the time.

When looking at the details turns out both the UDF and UDTF consumed similar amount of resources — the UDTF was faster as Snowflake parallelized better the execution of these invocations.

Don’t worry too much about this, as the Snowflake engineers are currently working on improvements for these cases. Java UDFs and UDTFs are currently in Preview mode in Snowflake, and you’ll soon be able to experience these improvements.

Performance Java UDF vs UDTF on an XL warehouse — UDTF performed better parallelization under current implementation (improvements in progress) (image by author)
select lang, count(*)
from (
select *, detect_langs(substr(body_clean, 0, 200)) lang
from reddit_sample a
)
group by 1
order by 2 desc
--4:06 UDF xl
;
select lang, count(*)
from (
select *, output_value lang
from reddit_sample a
, table(detect_lang_udtf(substr(body_clean, 0, 200))) x
)
group by 1
order by 2 desc
- 1m58s UDTF xl
;

Filtering and formatting the results

This query looks at the previous results, formats labels, looks at ratios between detected languages on a subreddit, and excludes languages with less than 2% representation per sub:

select 'r/'||subreddit sub
, initcap(lang) language
, count(*) c
, ratio_to_report(c) over(partition by sub) ratio
, sum(iff(language!='English', c, 0)) over(partition by sub) total_not_english
, sum(c) over(partition by sub) total
from reddit_sample_languages_udtf
group by 1, 2
qualify ratio > .02
order by total_not_english desc, c desc, 1, ratio desc

Credits

Updates and comments

Discussion on reddit

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. Check reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.