
Introduction
A few weeks ago, while doing my usual search for datasets to develop my personal projects, I came across the Brazilian Chamber of Deputies Open Data Portal, which contains a lot of data – including the deputies’ costs, parties metadata, etc – everything available trough a nice API.
After a few hours of searching and inspecting, something very interesting caught my attention: The compilation of all the laws proposed by the parliamentarians with their ‘ementas’ (concise summary), author, year, and, more importantly, their themes (health, security, finance, etc…) – categorized by the Chamber’s Documentation and Information Center (Centro de Documentação e Informação da Câmara, on literal translation).
A spark shined in my brain – "I’ll make a supervised classification pipeline to predict the theme of a law using its summary, exploring some infrastructure aspect of Machine learning, like data versioning with DVC or something like that." I quickly wrote a script and gathered an extensive dataset comprising over 60,000 laws, spanning the period from 1990 to 2022.
I’ve already worked a little with Judiciary and Legislative data, so I had a feeling that this task would not be hard. But, to make things even easier, I choose to classify only whether a law proposal (LP) is about "Tributes and commemorative dates" or not (binary classification). In theory, it should be easy, as the texts are very simple:

But, no matter what I tried to do, my performance did not rise above the ~0.80 mark on the f1 score, with a relatively low recall (for the positive class) of 0.5–0.7.
Of course, my dataset is highly unbalanced, with this class representing less than 5% of the dataset size, but there is something more.
After some investigation, inspecting the data with regex-based queries, and looking into the wrong-classified records, I found several examples incorrectly labeled. With my crude approach, I’ve found ~200 false negatives, which represent ~7.5% of the "true" positives and 0.33% of all my dataset, without mentioning the false positives. See a few below:

These examples were rotting my validation metrics – "How many of them could exist? Will I have to search the errors manually?"
But then Confident Learning materialized as the Clean Lab python package, came to save me.

What is Confident Learning?
Correctly labeling data is one of the most time-consuming and costly steps in any supervised machine-learning project. Techniques like crowdsourcing, semi-supervised learning, fine-tuning, and many others try to reduce the cost of collecting labels or the need for such labels in model training.
Fortunately, we are already a step ahead of this problem. We have labels given by professionals, probably government workers with adequate know-how. But my non-professional eyes with my crude regex approach could spot mistakes as soon as they broke my performance expectations.
The point is: How many errors are still in the data?
It’s not reasonable to inspect every single law – An automatic way of detecting incorrect labels is necessary, and that’s what Confident Learning is.
In summary, it uses statistics gathered from model probability predictions to estimate errors in the dataset. It can detect noise, outliers, and – the main subject of this post – label errors.
I’ll not go into the details of CL, but there is a very nice article covering its main points and a YT video from the creator of CleanLab talking about its research on the field.
Let’s see how it works in practice.
The data
The data was gathered from the Brazilian Chamber of Deputies Open Data Portal, containing law proposals (LP) from 1990 to 2022. The final dataset contains ~60K LPs.
A single LP can have multiple themes associated with it, like Health and Finance, and this information is also available in the Open Data Portal. To make it easier to handle, I’ve encoded the theme information by binarizing each individual theme in a separate column.
As previously mentioned, the theme used in this post is "Tributes and commemorative dates". I choose it because its ementas are very short and simple, so the labels errors are easy to identify.
The data and the code are available in the project’s GitHub repository.
The Implementation
Our goal is to fix every single label error in the "Tributes and commemorative dates" automatically and finish this post with a nice and clean Dataset ready to be used in a Machine Learning problem.
Setup the environment
All needed to run this project are the classical ML/Data Science Python packages (Pandas, Numpy & Scikit-Learn) + the CleanLab package.
cleanlab==2.4.0
scikit-learn==1.2.2
pandas>=2.0.1
numpy>=1.20.3
Just install these requirements and we’re ready to go.
Detecting Label Errors with CL
The CleanLab package comes natively with the ability to identify many types of dataset problems, like outliers and duplicate/near-duplicate entries, but we’ll be only interested in the label errors.
CleanLab uses probabilities generated by a Machine Learning model representing its confidence of an entry being a certain label. If the dataset has n entries and m classes, then this will be represented by an n by m matrix P, where P[i, j] represents the probability of row i being of class j.
These probabilities and the "true" labels are used in the CleanLab internals to estimate the errors.
Let’s practice:
Importing packages
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from cleanlab import Datalab
RANDOM_SEED = 214
np.random.seed(RANDOM_SEED)
loading data…
df_pls_theme = pd.read_parquet(
'../../data/proposicoes_temas_one_hot_encoding.parquet'
)
# "Tributes and commemorative dates"
BINARY_CLASS = "Homenagens e Datas Comemorativas"
IN_BINARY_CLASS = "in_" + BINARY_CLASS.lower().replace(" ", "_")
df_pls_theme = df_pls_theme.drop_duplicates(subset=["ementa"])
df_pls_theme = df_pls_theme[["ementa", BINARY_CLASS]]
df_pls_theme = df_pls_theme.rename(
columns={BINARY_CLASS: IN_BINARY_CLASS}
)
First of all, let’s generate the probabilities.
As mentioned in the CleanLab documentation, to achieve better performance is crucial that the probabilities are generated on out-of-sample records (‘non-training’ data). This is important as the models naturally tend to be over-confident when predicting probabilities on training data. The most usual way to generate out-of-sample probabilities in a dataset is to use a K-Fold strategy, as shown below:
y_proba = cross_val_predict(
clean_pipeline,
df_pls_theme['ementa'],
df_pls_theme[IN_BINARY_CLASS],
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_SEED),
method='predict_proba',
verbose=2,
n_jobs=-1
)
NOTE: It’s important to be aware of the class distribution – Hence the StratifiedKFold object. The chosen class represents less than 5% of the dataset, a naive sampling approach could easily lead to poor-quality probabilities generated by models trained on wrongly balanced datasets.
CleanLab uses a class called Datalab to handle its error-detection jobs. It receives the DataFrame containing our data and the label column’s name.
lab = Datalab(
data=df_pls_theme,
label_name=IN_BINARY_CLASS,
)
Now, we just need to pass the previously calculated probabilities to it …
lab.find_issues(pred_probs=y_proba)
… to start finding issues
lab.get_issue_summary("label")

And is simple as that.
The _getissues("label") function returns a DataFrame with the metrics and indicators calculated by CleanLab for each record. The most important columns are ‘_is_labelissue‘ and ‘_predictedlabel‘, representing respectively if a record has a label issue and the possible correct label for it.
lab.get_issues("label")
We can merge this information in the original DataFrame to inspect which examples are problematic.
# Getting the predicted errors
y_clean_labels = lab.get_issues("label")[['predicted_label', 'is_label_issue']]
# adding them to the original dataset
df_ples_theme_clean = df_pls_theme.copy().reset_index(drop=True)
df_ples_theme_clean['predicted_label'] = y_clean_labels['predicted_label']
df_ples_theme_clean['is_label_issue'] = y_clean_labels['is_label_issue']
Let’s check a few examples:


To me, these laws are clearly associated with Tributes and Commemorative Dates; however, they are not appropriately categorized as such.
Nice ! – CleanLab was able to find 312 label errors in our dataset, but what to do now?
These errors could be either objects of a manual inspection for correction (in an active-learning manner) or instantly corrected (supposing that CleanLab did its job right). The former is more time-consuming but could lead to better results, while the latter is faster, but could lead to more errors.
Regardless of the chosen path, CleanLab reduced the labor from 60K records to a few hundred – in the worst case.
But there is a catch.
How can we be sure that CleanLab found all the errors in the dataset?
In fact, if we run the above pipeline but with the errors fixed as ground truth, CleanLab will find more errors…
More errors, but hopefully fewer errors than the first run.
And we can repeat this logic as many times as we want: Find errors, fix errors, retrain the model with the new presumed better-quality labels, find errors again …

With the hope that after some interactions the number of errors will be zero.
Iteratively fixing errors with CleanLab
To implement this idea all that we need to do is to repeat the process above in a loop, the code below does just that:
Let’s review it.
In each iteration, the OOS probabilities are generated just as shown previously: using the _cross_valpredict method with StratifiedKFold. The current set of probabilities (in each iteration) is used to build a new Datalab object and find the new label issues.
The found issues are merged with the current dataset and fixed.
I choose the strategy of appending the fixed labels as a new column instead of replacing the original one.

LABEL_COLUMN_0 is the original label, LABEL_COLUMN_1 is the label column fixed 1 time, LABEL_COLUMN_2 is the label column fixed 2 times, and so on…
In addition to this process, the usual classification metrics are also computed and stored for later inspection.
After 8 interactions (~16min) the process is finished.
The Results
The table below shows the performance metrics computed during the process.

A total of 393 label errors were found in the dataset in the 8 iterations. As expected, the number of errors found decreased with each iteration.
It’s interesting to note that this process was able to "converge" to a "solution" with only 6 iterations – staying at 0 errors in the last 2. This is a good indication that, in this case, the CleanLab implementation is robust and did not find any more errors by ‘accident’ that could lead to oscillations.
Even though the number of errors represents only 0.6% of the dataset, the f1 score increased from 0.81 to 0.90, ~11%. This is probably due to the classes being highly unbalanced, as the new 322 positive labels represent a total of ~12% in the number of original positive examples.
But was CleanLab really able to find meaningful errors? Let’s check a few examples to see if they make sense.
False negatives fixed

The above texts indeed resemble "tributes and commemorative dates", suggesting that they should be appropriately categorized as such – Point to CleanLab
False positives fixed

We have a few errors in this case, the 2nd and 4th laws aren’t false positives. Not so good, but still ok.
I’ve repeated this inspection sampling new ‘fixed’ laws several times and, in general, CleanLab has a nearly perfect performance in detecting false negatives but gets a little confused with the false positives.
Now, even though we probably don’t have a perfectly labeled dataset, I feel way more confident in training a machine learning model in it now.
Conclusion
For long the machine learning area suffered from poor quality models and lack of computer power, but this time is gone. Now, the real bottleneck for most ML applications is data. But not raw data, refined data – with good labels, well-formatted, without too much noise or outliers.
Because no matter how big and powerful a model is, how much statistics and maths you mix into your pipeline, any of this will save you from the most basic law of computer science: garbage in, garbage out.
And this project was a witness to this principle – I’ve tested several models, Deep Learning architectures, sampling techniques, and vectorization methods to, in the end, discover that the problem was on the basics: my data was wrong.
In such a scenario, investing in data quality techniques became a critical aspect to create successful ML projects.
In this post, we explored CleanLab, a package that helped us to detect and fix wrong labels in the dataset. It not only enabled us to significantly improve the quality of our dataset but it also did this in an automatic, reproducible, and cheap way – with no human intervention.
I hope this project helped you in understanding a little more about Confident Learning and the CleanLab package. As always, I’m not an expert in any of the subjects addressed in this post, and I strongly recommend further reading, see some references below.
Thank you for reading! 😉
References
All the code is available in this GitHub repository. Data used – Open Data Portal Federal Chamber. [Open-Data – Law nº 12.527] All the images are created by the Author, unless otherwise specified.
[1] Cleanlab. (n.d.). GitHub – cleanlab/cleanlab: The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels. GitHub. [2] _Computing Out-of-Sample Predicted Probabilities with Cross-Validation – cleanlab_. (n.d.). [3] Databricks. (2022, July 19). CleanLab: AI to find and fix errors in ML datasets [Video]. YouTube. [4] FAQ — cleanlab. (n.d.). [5] Mall, S. (2023, May 25). Are label errors imperative? Is confident learning useful? Medium. [6] Northcutt, C. G. (2021). Confident Learning: Estimating uncertainty in Dataset labels. arXiv.org.