The world’s leading publication for data science, AI, and ML professionals.

Identifying Duplicates in Snowflake With Zingg

Using open source to get dimension tables right!

It is a truth universally acknowledged that customer tables in warehouses are messy, with multiple records pointing to one single customer. There can be various reasons for this problem – data coming from offline stores and online channels, guest checkouts, multiple internal customer systems like CRMs and customer service…In fact, this problem is not limited to customer tables alone. Supplier tables, locations, and other non-transactional data(dimensions in traditional warehouse language) also have the same problem.

For those of us who rely on the warehouse for business decisions, these duplicates can completely toss away our metrics. How can we trust lifetime value if our customer tables do not have a solid customer ID? Segmentation, marketing attribution, personalization – what can we achieve without reliable dimension data? How about reverse ETL – feeding this data into our operational systems. If we pump our systems with duplicates from the warehouse, won’t it mess up our day to day work?

Is the investment in the warehouse even worth it if we still can not identify our core entities correctly?

On the surface, this looks like an easy problem to solve – surely we have email ids that we can leverage? Unfortunately, people use work, personal, school and other email ids and though this is a start, it doesn’t solve the problem. Let us not even get started with the different ways in which we enter names, addresses, and other details on web and print forms.

Let us take a look at our customer table in Snowflake. The data in the table is loaded from this csv.

Image by Author
Image by Author

The customer table does have an an SSN column, but that is not consistent in many cases and so we can not rely on it. The table does have identifier column, but it still has multiple records belonging to the same customer with different ids.

As an example, check the following two records belonging to customer Thomas George

Image by Author
Image by Author

Or the following five records all belonging to customer Jackson Eglinton

Image by Author
Image by Author

We could build some similarity rules and use SQL or programming to build our identifiers and match these records. However, this will soon get complex catering to the variations above. What if we use Snowflake’s edit distance function? Or fuzzywuzzy or some such library? Unfortunately, we are dealing with a beast here – knowing which pairs to compare or find edit distance for is actually pretty important else we will end up with a cartesian join on multiple attributes(!!).

As an example, take a look at the number of comparisons we can run into as our number of records increase 10 fold or 100 fold. This table assumes that we are comparing on a single attribute. Hence it is evident that scalability is definitely a big challenge and needs to be really planned through.

Image by Author
Image by Author

Fortunately, open-source has a solution(when does it not?). Looks like there is a tool called Zingg, built specifically to address this problem of entity resolution. (I need to put a disclaimer here, I am the author :))

Let us see how we can use Zingg to resolve our customers and identify duplicates.

Installation is straightforward, we need binaries of Java, Apache Spark and Zingg. Do NOT be intimidated if you are not a Java programmer or a distributed programming geek writing Spark programs on petabyte size clusters. Zingg uses these technologies under the hood so for most practical purposes, we can work off a single laptop or machine. Zingg is a learning-based tool, it trains on our data and does not transmit anything to an external party, so security and privacy are automatically taken care of when we run Zingg within our environment.

We need to tell Zingg where our Snowflake data is. For this, the Zingg config is set with our Snowflake instance and table details. Here is the excerpt of the configuration for our input CUSTOMERS table from Snowflake.

Image by Author
Image by Author

We also configure Zingg to write the output to the UNIFIED_CUSTOMERS table. This table does not exist in Snowflake yet, but Zingg will create it while writing its output, so we do not need to build it.

Image by Author
Image by Author

Let us now specify which attributes to use for matching, and what kind of matching we want. As an example, the first name attribute is set for FUZZY match type.

Image by Author
Image by Author

We do not wish to use the SSN for our matching, so that we can see how well the matching performs, so we mark that field as DO_NOT_USE. The other parts of the configuration are fairly boilerplate, you can check the entire configuration here.

Zingg learns what to match(scale) and how to match(similarity) based on training samples. It ships with an interactive learner which picks out representative sample pairs which the user can mark as acceptable matches or non-matches. Let us now build the training samples from which Zingg will learn. We pass the configuration to Zingg and run it in the findTrainingData phase. This is a simple command line execution.

zingg.sh --phase findTrainingData --conf examples/febrl/configSnow.json

Under the hood, Zingg does a lot of work during findTrainingData to spot the right representative pairs to build the training data for matching. The uncertain pairs get written to zinggDir/modelId as configured through the input json. But we do not need to worry about that. Once the job is finished, we will go to the next phase and mark or label the pairs.

zingg.sh --phase label --conf examples/febrl/configSnow.json

The above phase will bring up the interactive learner, which reads the work done by the findTrainingData phase and shows us record pairs to mark as matches or non matches. This helps Zingg build out the machine learning models tailored to our data. This is how it looks like

Image by Author
Image by Author

Zingg selects different kinds of pairs – absolute non-matches, sure matches as well as doubtful cases so that a robust training set can be built. These records are selected after a very rigorous scan of the input so that proper generalization can be made and every single variation across attributes does not have to be hand labelled by the user. As an example, the following is an excerpt of Zingg output for our data.

Image by Author
Image by Author

The phases of findTrainingData and label are repeated a few times till 30–50 positive pairs are marked. This should be good enough to train Zingg to run on millions of records with reasonable accuracy. Each and every case need not be fed to Zingg, the learner automatically selects the representatives and generalizes through that. When unsure, one can always halt the learner and check Zingg’s output and come back and train a bit more.

In our simplistic case of only 65 examples, one round of findTrainingData and label is enough and so we pause here. Now that we have the training data with the labels, we build the machine learning models by invoking the train phase. Internally, Zingg does hyperparameter search, feature weighing, threshold selection and other work to build a balanced model – one that does not leave out matches(recall) AND one that does not predict wrong matches(precision).

zingg.sh --phase train --conf examples/febrl/configSnow.json

The above will save the models and we can apply them to this and any other new data to predict matches. Retraining is not necessary as long as the schema – the attributes to be matched and the input format remains the same.

Let us now apply the models on our data and predict which records are indeed matches – or duplicates to each other.

zingg.sh --phase match --conf examples/febrl/configSnow.json

Peeking into Snowflake once the above has run shows us that a new table with the following columns has been created.

Image by Author
Image by Author

Zingg copies over the raw data, but adds 3 columns to each row of the output.

  • The Z_CLUSTER column is the customer id Zingg gives – matching or duplicate records get the same cluster identifier. This helps to group the matching records together.
  • The Z_MINSCORE column is an indicator for the least that record matched to any other record in the cluster
  • The Z_MAXSCORE is an indicator for the most that record matched to another record in the cluster.

Let us look at the records for customer Thomas George in the output. Both the records get the same z_cluster. No other record gets the same id. The scores are pretty good too, which means we can be confident of this match.

Image by Author
Image by Author

What happened to customer Jackson Eglinton? Here is what the output looks like

Image by Author
Image by Author

Again, the 5 records get an identifier distinct from the other records in the table. On inspecting the scores, we see that the minimum score of two records is close to 0.69, which means that the confidence of these record belonging to the cluster is low. Rightly so, as in one case, the street and address attributes are swapped. In the other, the last name is different from other records in the cluster.

Based on our data, we can decide how to use the scores provided. We could choose a cutoff on either of the scores to be confident of the matches and pipe the rest to another workflow – likely human review. We could take an average of the scores and use that if our case warrants it.

In the most likely scenario, the output of Zingg is used further along in the data pipeline as the definitive source of entity data. Zingg output either gets picked up for transformations by DBT and used thereof, or Zingg output is streamed to the lakehouse and utilized for data science.

In either case, the dimensions are accurate and we have a unified view of our core entities which we can trust.


Related Articles