Moderation pipeline for user-generated content

How European Tech giant moderates its content using Toloka’s crowd

Magdalena Konkiewicz
Towards Data Science

--

Image by Gerd Altmann from Pixabay

Introduction

In recent years, the growth of AI and machine learning has come with an influx of automatic feed generators such as Microsoft Start. Those tools crawl the websites constantly and bring up the latest and most suitable articles for each user. Analyzing the ocean of updates and news, not to mention personalizing it, is an idea that has been explored by startups and tech giants alike.

Many of those same platforms have empowered everyday users to become important content creators. But that hasn’t been without problems, including the need for fast and scalable content moderation.

In this article, you will learn how Yandex Zen, a Nasdaq-listed tech Giant, uses the crowd to moderate its content.

Yandex Zen and the content moderation problem

As one of the biggest European tech companies, Yandex Zen deals with a variety of material from media as well as user content. There are thousands of authors registering daily and the pieces of content produced by them can reach six figures.

All that content needs to be moderated. Complicating the problem still further, content format ranges from regular articles to tweet-like messages, comments, and even videos.

Zen uses pre-trained machine learning algorithms to detect inappropriate content like hate speech, clickbait, unreliable medical advice, spam, and adult content. But that’s not enough.

Labeling data with the crowd

Machine learning algorithms sometimes miss their predictions, and being overly reliant on them leaves content moderators vulnerable to inappropriate content being released to the public.

With that in mind, Yandex Zen uses machine learning algorithms to moderate content initially but sends difficult cases to be handled by real people. But the large volume of data means hundreds of thousands of content pieces need to be processed by humans on a daily basis. That’s impossible for in-house employees.

Yandex Zen, therefore, uses Toloka, a crowdsourcing platform designed to automate labeling. In the platform, they ask users to perform a simple moderation task similar to the screenshot shown below in exchange for a small payment.

Image by author

Because Toloka has millions of performers around the world, Yandex Zen can moderate the overwhelming amount of content generated every day. Crowd availability even means the content is moderated almost immediately after it’s released. Additionally, the platform gives a user control over the crowd by the possibility of selecting languages they speak, location, or the fact that they have a specific skill.

Crowdsourcing problems

From the beginning, Yandex Zen made it their primary goal to manage the crowd efficiently. They needed to select the right performers for their projects to maintain the high quality of moderation.

One solution they implemented was to add and regularly update control tasks. Control tasks have a known answer, serving to score the quality of performer answers. If a performer gives too many wrong answers, they may be removed from the project. Performers who score well may earn additional monetary bonuses.

Because of the large volume of data and the fact that user-generated content is changing constantly, the Yandex Zen team found itself constantly coming up with control tasks. That made it difficult to manage and scale properly.

Ladder pipeline solution

Running several experiments led to the ladder solution shown in the image below.

Image by author

Every piece of content in the pipeline is moderated by machine learning algorithms and sent to the general Toloka crowd if the AI system isn’t sure about the label. To ensure crowd quality, the Yandex Zen team set up a smaller pool of more trusted workers called moderators that label the tasks later used as control tasks for the general crowd. Their labels are also used to create exam pools performers need to pass before being accepted onto the project.

The last rung on the ladder is expert moderators who monitor the work of regular moderators by creating control tasks for them. They are a very small number of highly qualified and trusted workers, and their weekly answers let the Yandex Zen team check the work of the moderators.

The machine learning models themselves are also trained on crowdsourced data. The team actually samples 1% of the daily data and submits it to the crowd for labeling to ensure the machine learning models are always up to date.

The Yandex Zen pipeline delivers scalable results and lets the Yandex Team quickly identify quality concerns. Previous solutions were late in noticing quality control problems, only flagging them once the performance had degraded significantly.

The above solution has helped Yandex Zen to efficiently manage content moderation and made it scalable. How do you deal with large and constantly changing data within your projects? Are pre-trained ML models enough or do you need “human in the loop”? Feel free to share your thoughts in the comments.

Summary

This article has demonstrated how user-generated content moderation can be automated using the crowd. AI and machine learning pipelines usually consume significant amounts of time and effort. If you’re dealing with large and constantly changing data as in Yandex Zen’s case, that’s especially true.

For more details, watch this presentation by Natalia Kazachenko on the ladder approach she and her team developed. If you want to learn more about how to automate data labeling pipelines you can also join this data empowered community.

PS: I am writing articles that explain basic Data Science concepts in a simple and comprehensible manner on Medium and aboutdatablog.com. You can subscribe to my email list to get notified every time I write a new article. And if you are not a Medium member yet you can join here.

Below there are some other posts you may enjoy:

--

--