Notes from Industry

Snorkel in the wild — weak supervision at Google, Intel, Apple and IBM

A synthesis of programmatic data labeling in industry.

Ernest Chan
Towards Data Science
14 min readAug 2, 2021

--

Sea turtle under water, from the perspective of a person snorkeling
Photo by Raimond Klavins on Unsplash

Feeling sensations of frustration and apathy from working with unlabeled data? Try Snorkel as a cure for your symptoms! Snorkel is a technique that promises to give you the labels you deserve. Stanford’s research in this area led to their Snorkel.ai start-up, which raised a $35M series B in April 2021.

For many ML problems, there is plenty of data but little of it is labeled. For example, where would you get labels for whether a tweet mentions a merger of two companies or whether a product listing contains an illegal product? The conventional way to solve the issue is human labeling, which can be costly and slow. A quicker, more scalable, and sometimes more performant solution is to use Snorkel. The TLDR is Snorkel lets you generate labels with code, and code is faster than humans.

Although Snorkel is exciting, it’s not magic. Through research, I cultivated a balanced view of how the industry uses Snorkel, when it’s appropriate, and its limitations. This post synthesizes papers about Snorkel from Google, Apple, Intel, and IBM to provide a nuanced view of the technique.

We’ll cover

  1. An overview of Snorkel
  2. Generating labeling functions in practice
  3. Pragmatic considerations for the generative model
  4. When to Snorkel and when to human label

Snorkel at a High-Level

Feel free to skip this section if you already know how Snorkel works.

Imagine you frequently ask your three friends, Hannah, Wei, and Jason, to predict whether a movie will be a hit, and their guesses are better than chance. You’ve asked them about the Avengers movie, Parasite and so on. Over time, you learn who votes similarly, how often each person declines to vote, and even how accurate each person is. You use this knowledge to combine your friend’s votes with a weighted average to get an estimate better than any single person. That’s Snorkel in a nutshell.

Diagram of the Snorkel pipeline. The main pieces of the offline Snorkel pipeline are labeling functions and the generative model.
Figure 1: Snorkel pipeline. Adapted from [1] by the author. The main pieces of Snorkel are labeling functions and the generative model.

Figure 1 shows the Snorkel pipeline with the two key concepts, labeling functions and the generative model, numbered. Snorkel takes place in an offline process to label training data which can be used to train a discriminative model for your ML task.

Labeling functions (LFs)

The labeling functions shown in the Figure 1 parallel Hannah, Wei, and Jason from the example above. LFs are pieces of code that can output an estimated label or abstain when given a data point. For binary classification the possible outputs are {-1, 0, 1}, where 0 means abstain.

The simplest LFs match user-defined patterns in your unlabeled input data, but more advanced LFs take advantage of external signals. For example, say the task is to identify customer emails that praise an employee. An LF can:

  1. Use a Python library to detect whether the email body contains a person’s name.
  2. If a name exists, match the name to one in the internal employee directory.
  3. Look for phrases like “thanks to”, “appreciations to” within 3 words of the name.

If all three conditions are met, output 1, else output 0 to abstain. The LF won’t be perfect, but it provides some signal for the task at hand. Because LFs are noisy, they are called a form of “weak supervision”.

Labeling functions may be correlated, conflict, and cover different parts of the dataset. After LFs are applied to each unlabeled data point, the output is a label matrix that contains LF labels for each point. The next component, the generative model. is in charge of combining these noisy votes into a single label.

Generative Model

Thinking back to our movie example, the generative model is you because you synthesize the data from your friends. Without ground truth labels, the generative model can learn the accuracies and correlations of LFs, and combine them to approximate the ground truth. The output is probabilistic labels (between 0 and 1) that you can use to train a discriminative model. You can use the probabilistic labels directly if the downstream model supports a “noise-aware” loss function, or you can threshold the labels to get 0 or 1. There’s a whole bunch of math in [1] and [2] that describes how and why the generative model works.

Note: Snorkel currently only works for classification problems (both binary and multi-class).

Wait, I don’t need labels anymore?

No, you still do. Snorkel reduces but doesn’t remove the need for high-quality sets of labeled data. Usually, you need a development set to produce LFs and a test set to test the performance of the final model. The main advantage of Snorkel is your large training set doesn’t need to be hand-labeled (assuming you’re doing supervised ML).

Generating Labeling functions in Practice

Writing LFs is not always easy. This section describes how different companies come up with LFs to label their datasets, which could provide ideas for your problems.

LFs that use organizational resources

Examples of organizational resources at Google. Some are aggregate statistics, web crawlers and knowledge graphs.
Figure 2: Image from [3]. Organizational resources involved in LFs at Google.

LFs at Google heavily use organizational resources, which are sources of signal from other services or knowledge bases at Google [3, 4].

Some examples of how Google uses organizational resources are:

  • An internal named entity recognition model. For a task of classifying whether content mentions a celebrity, an LF uses the model to detect whether a “person” entity is in the text. If there is no “person”, output the label that says “no celebrity”.
  • Aggregate statistics. For a content moderation task, a statistic is the number of times a user has been reported for policy-violating behavior.
  • Topic models that can operate over both text and images. Although too coarse-grained to detect the positive class, the models help label the negative class.

The authors describe two categories of organizational resources:

  • Servable: available to both offline LFs and online models as features.
  • Non-servable: available only to the offline Snorkel pipeline. These resources are slow or expensive to query, like an internal knowledge graph.

Although non-servable signals are only used for labeling, they improve the performance of a discriminative model trained only on servable features. The authors consider this a form of transfer learning. In these cases, you can use expensive, non-servable signals to improve your model without complicating or slowing down your online serving system!

Code-free LFs

Sometimes, some level of domain knowledge is required to generate LFs, but those with domain knowledge don’t know how to code. To get around this issue, some companies have developed code-free interfaces to capture domain knowledge for LFs.

Intel made a spreadsheet-based interface for product analysts to express rules [5]. The task is to classify whether a tweet mentions a business scenario related to a customer. For example, whether a customer formed a partnership with another company.

Spreadsheet-based interface for domain experts to express LF rules.
Figure 3: Table from [5]. Spreadsheet-based interface for domain experts to express LF rules.

In an interface like figure 3, domain experts enter keywords or regexes along with a “polarity” that indicates whether it’s for or against the business scenario. In the above example, both “partnering” and “delighted” are signs of the partnership scenario, but seeing “months ago” indicates a negative label. The interface captures domain knowledge and developers can choose how the knowledge is expressed in LFs.

To label data for chatbot training, IBM built a search interface for users to find chat logs that match a certain intent [7]. Building a chatbot involves detecting the user’s intent, like whether they’re asking about billing, trying to schedule an appointment, and so on.

Search interface to find chat logs belonging to a certain user intent. There is an input box for user queries and a list of chat log results.
Figure 4: Image from [7]. Search interface to find chat logs belonging to a certain user intent.

IBM’s system prompts the user to query logs that match an intent like “schedule”. The user enters a query like “schedule meeting time”. The system defines the top N results (set to 100) as the neighborhood but only shows k results (set to 10) to the user. These 10 results are sampled from the top, middle, and bottom of the neighborhood. Then, the user is required to label the 10 results with whether or not each belongs to the intent. If more than 60% have the same label, all examples in the neighborhood (all 100 samples) are given that label. Each neighborhood is considered an LF. This approach increases the number of weak labels at the expense of more noise.

Apple developed UIs to help engineers identify slices of data that require more or better weak labels [6]. All of their tasks are text-based. Unfortunately, the paper doesn’t go into much detail about the UIs or the specific ML tasks.

Notice the three examples of code-free LF generation operate on text data. It’s easier to design higher-level interfaces that operate on text than other unstructured data like images or video. These sorts of interfaces are still an active area of research.

Class Imbalance

Class imbalance is a good incentive to consider using Snorkel since hand labeling random samples is expensive. If your positive class makes up 1% of the data, labeling 100 random samples will only give you one positive label!

Several of the papers dealt with class imbalance problems, where the positive class ranges from 0.05% to ~10% of the dataset. In certain cases, Snorkel was adapted to handle the imbalance.

For specific problems, it’s easier to write LFs that identify the positive class with high precision than the negative class. As an example, it’s not too hard to identify obvious forms of hate speech (contains profanity, references to certain topics, etc.). But, does lack of profanity mean the content is not hate speech? And how do you detect borderline cases?

Without LFs that label the negative class and borderline examples, you’ll only label a small portion of your dataset with obvious positive examples. Your end-model might have high precision but will have low recall and may overfit to certain types of positive classes.

Label propagation

Google got around the issue by employing a graph algorithm called label propagation. They had a large set of human-labeled data alongside their unlabeled data and used the technique to transfer information from the labeled data to the unlabeled data [4] — illustrated in figure 5.

Visual representation of label propagation. At first only a small number of nodes are labeled. As the label propagation algorithm iterates, more and more neighboring nodes get labeled
Figure 5: Image from [8]. Visual representation of label propagation.

In the paper [4], each unlabeled data point gets a score that’s the weighted combination of its neighbor’s labels. Weights are the similarity between the unlabeled point and the labeled point, where similarity is either Jaccard similarity or a distance metric. LFs then output a certain label if the score exceeds a predetermined threshold. This technique helped create high recall LFs that complement the high precision ones.

For certain tasks, LFs on the output of label propagation significantly improved the generative model’s labels. Label propagation is an example of a non-servable resource that improves the end-model’s performance.

Synthetic datasets

Intel generated balanced synthetic datasets for each class to handle class imbalance for their multi-class problem, which has a rate of 0.05% for positive classes. For a class C, they considered examples that belong to “not C” as negative examples, along with examples that don’t belong to any class.

Automatic LF generation

Writing LFs may require domain knowledge, but what if your engineers don’t have that expertise? A likely answer from engineers is “let’s create LFs with code”, and that’s what they did in the second paper from Google [4].

Most of their features are categorical and they used frequent pattern mining over a labeled dataset to identify feature combinations that occur more often in one class than the other. The patterns are conjunctions of feature values (X and Y), similar to what domain experts tend to generate. A pattern is used in an LF only if it meets certain precision and recall thresholds over the development set. Pattern mining occurs over a single feature instead of multiple features to reduce correlations between LFs,.

This technique reduces the time to deploy Snorkel since it doesn’t require domain experts to hand-write LFs. Also, engineers without specialized domain knowledge can deploy the pipeline.

LF considerations

In general, your set of LFs should have high precision, recall, and coverage (covers a large portion of your data). I haven’t found concrete thresholds for minimum precision, recall, and coverage, and it probably depends on your dataset. Regardless, LF quality matters. The generative model estimates the accuracies of LFs and Google has used these estimates to fix or remove low-quality LFs.

Assuming high-quality LFs, more and a wide variety of LFs usually lead to better performance.

Consider these questions for creating LFs:

  1. What kind of organizational/external resources can you use in your LFs? Consider pre-trained models like those from HuggingFace or Amazon Sagemaker.
  2. Can you develop an interface that lets non-coders specify LFs?
  3. Do your LFs miss out on essential parts of the dataset, like certain types of positive classes, negative classes, and borderline cases?

Pragmatic Considerations for the Generative Model

When is the generative model worth it?

LFs are powerful because they can scalably attach weak labels to our data. But, when do we need the generative model to combine them? New modeling pipelines create maintenance overhead, and we have to evaluate the costs vs. the benefits.

You should compare the results of the generative model to a simple average of the weak labels or a majority vote of the weak labels (Figure 6).

Evaluating Snorkel labels against labels for simple averaging
Figure 6: Evaluating Snorkel labels against simpler baselines. Diagram by author.

A table from the Snorkel Drybell paper [3] compares labels from the generative model vs. labels from an unweighted average of the weak labels (Figure 7). They trained one discriminative model over each set of labels and compared their performance relative to a model trained on human-created labels. 100% in a cell means equal performance to the model trained on hand-labeled data.

Table that compares discriminative model performance on generative model labels vs. an average of weak labels. Scores are relative to a model trained on a hand-labeled dataset. The generative model provides a boost of 7.7% F1 score for topic classification task. The generative model provides a boost of 1.9% F1 score for the product classification task,
Figure 7: Table from [3]. Compares discriminative model performance on generative model labels vs. an average of weak labels. Scores are relative to a model trained on a hand-labeled dataset.

The generative model is a clear win for the topic classification task. But, for the product classification task, the model provides only a slight boost over “equal weights”.

The original Snorkel paper [1] discusses the performance of the generative model vs. majority vote as label density increases. Label density is the average number of weak labels assigned to each data point (remember, LFs can abstain). The generative model seems to provide an advantage for a regime of medium label density and is similar to majority vote for low and high label densities. The reasoning is that when most examples have only one weak label (low label density), the generative model can’t do much to re-weight the labels. When there is high label density, and your LFs are better than chance, majority voting quickly converges to the optimal solution with increasing label density.

A simple average may be good enough for your problem, and you may not need the generative model.

How scalable is the generative model?

The original version of Snorkel used a sampling approach that wasn’t very scalable. The current open-source version uses a matrix completion approach that is supposedly more scalable since it scales with the number of LFs rather than dataset size [9].

Google implemented a version of Snorkel in Tensorflow that’s more efficient than the original version and can run in a distributed fashion. If you have very large label matrices, see if the open-source version works for your needs. If not, see [3] for more details on the Tensorflow implementation.

Can we use the generative model for serving?

The generative model outputs probabilities. When should we use this as the production model vs a discriminative model trained on probabilistic labels?

Here are some considerations:

  • Does the generative or discriminative model perform better on the test set?
  • Are there examples where all your LFs abstain? The generative model will also abstain in those cases, and that may not be acceptable. A discriminative model can learn correlations between the input data and the probabilistic labels — providing more coverage than the generative model.
  • The generative model operates on the outputs of your LFs. Are all your LFs servable in production? Do they meet your serving SLAs?
  • If you have complex LFs that call other services, perform fancy operations, etc., what are the deployment costs of putting those in an online production environment? Maybe it’s simpler to keep LFs in an offline pipeline and deploy a model that can operate over easy-to-compute features.

When to Snorkel and when to Human-label?

Often Snorkel-generated labels are compared to human-generated labels, and Snorkel is deemed better because it:

  • Scales better. Code is faster than humans, and sometimes the human labeling pipeline is so slow that it doesn’t meet the solution requirements.
  • Results in better models (sometimes).
  • Is more agile. In that it can respond to changes in label definitions. For example, the business decides to include electric scooters in the category of “sports equipment”, and the product classification model needs updated labels. With Snorkel, developers can modify existing LFs and trigger the Snorkel pipeline to get updated labels. With human labeling, adaptation will be slower.

These are good reasons to opt for Snorkel, but it’s important to know there is often a “cross-over point” — the number of human-labeled examples at which a model trained with human labels outperforms a model trained with Snorkel labels.

Table of the cross-over point for 5 different classification tasks at Google.
Figure 7: Table from [4], green circle added by author. Cross-over point for five different classification tasks.

From figure 7, the cross-over point for different classification tasks varies from 4K to 750K human-generated labels [4]. You won’t know the cross-over point without collecting adequate human labels, but it’s useful to try to figure out. Maybe the threshold is low, and human labeling is a better solution than Snorkel. Or, maybe you choose to use Snorkel until you have enough labels to reach the cross-over point.

In the Intel paper [5], a rules-based plus human-labeling pipeline does better than Snorkel in two out of three classification tasks. The authors say the human-in-the-loop pipeline is infeasible since it imposes a scoring latency of 10 days and is expensive to maintain. Therefore, they compare Snorkel models with fully supervised models. Regardless, a slower but better-performing human labeling pipeline may better meet your business needs.

Lastly, there are more choices than Snorkel and human annotators for label-poor settings. Active learning and semi-supervised learning are other approaches to explore.

In Closing

High-quality training data is a critical asset because the other two requirements for ML, compute resources and modeling tools, are becoming commoditized. This excellent post by Eugene Wei postulates that TikTok’s success is due to its high-quality labels, not its models.

Snorkel can help you programmatically improve your critical asset, but there are caveats. For one, you still need representative, unbiased development and test datasets with “golden” labels. To help evaluate whether Snorkel is right for you:

  • Determine whether the LF approach makes sense for your problem. See LF considerations above.
  • Compare the generative model labels with simpler baselines.
  • Decide whether you need a discriminative end-model.
  • Lastly, compare the Snorkel pipeline against simpler pipelines with human labelers.

References

[1] Ratner, Alexander, et al. “Snorkel: Rapid Training Data Creation with Weak Supervision.” Proceedings of the VLDB Endowment, vol. 11, no. 3, 2017, pp. 269–82. Crossref, doi:10.14778/3157794.3157797.

[2] Ré, Christopher; Selsam, Daniel; Wu, Sen; De Sa, Christopher; Ratner, Alexander (2016–05–25). “Data Programming: Creating Large Training Sets, Quickly”. arXiv:1605.07723v3

[3] Bach, Stephen H., et al. “Snorkel DryBell.” Proceedings of the 2019 International Conference on Management of Data, 2019. Crossref, doi:10.1145/3299869.3314036.

[4] Suri, Sahaana, et al. “Leveraging Organizational Resources to Adapt Models to New Data Modalities.” Proceedings of the VLDB Endowment, vol. 13, no. 12, 2020, pp. 3396–410. Crossref, doi:10.14778/3415478.3415559.

[5] Bringer, Eran, et al. “Osprey: Weak Supervision of Imbalanced Extraction Problems without Code.” Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning — DEEM’19, 2019. Crossref, doi:10.1145/3329486.3329492.

[6] Christopher Ré, Feng Niu, Pallavi Gudipati, Charles Srisuwananukorn: “Overton: A Data System for Monitoring and Improving Machine-Learned Products”, 2019; http://arxiv.org/abs/1909.05372 arXiv:1909.05372

[7] Mallinar, Neil, et al. “Bootstrapping Conversational Agents with Weak Supervision.” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 9528–33. Crossref, doi:10.1609/aaai.v33i01.33019528.

[8] Needham, Mark. “Graph Algorithms in Neo4j: Label Propagation.” Dzone.Com, 8 Mar. 2019, dzone.com/articles/graph-algorithms-in-neo4j-label-propagation.

[9] The Snorkel Team. “Introducing the New Snorkel.” Snorkel, 14 Aug. 2019, www.snorkel.org/blog/hello-world-v-0-9#upgraded-labeling-pipeline.

--

--