“Is This You?” Entity Matching in the Modern Data Stack with Large Language models

An experiment in productionizing LLMs

Jacopo Tagliabue
Towards Data Science

--

Feat. Avanika Narayan

“System identifying people with artificial intelligence, in the style of the matrix” [ image generated by DALLE2 ]

Entity Matching in the age of data warehouses

The data universe is at least as busy as the Marvel universe in producing one new shiny thing after the other: after the start of the “Modern Data Stack” (MDS) with Snowflake and dbt, a plethora of tools have been helping people connect and manage all their data sources within the MDS. Do you have customer data from Salesforce? Perfect! Do you have ads data from Google Analytics? Bring them in! You have employee data from WorkDay? The more the merrier!

To give a concrete example. Imagine you’re merging supplier feeds for your forecasting and reporting function — e.g. feeds are telling you how much inventory will be available for product X. While the exact shape and form of products differ between feeds, humans are generally able, upon inspection, to confirm that product A and B are the same, while product C is not (it’s a tripod, but clearly not the same item):

  • PRODUCT A: new-targus red tg-6660tr tripod with 3-way panhead 66 — meytg6660tr, produced by targus, price 31.0
  • PRODUCT B: targus red tg-6660tr tripod with 3-way panhead, produced by targus, price 29.98
  • PRODUCT C: new-black gray flexpod gripper tripod with ball head, produced by sunpack, no price available

Product meta-data changes in a non-obvious, difficult to predict way: price may be slightly off, attributes appear and disappear, data may be missing for some items. While it’s certainly possible to write down rules and fuzzy-match your way out of it, it would require a non-trivial amount of time and code (to write, test and maintain).

In a galaxy far, far away, a new breed of NLP models have learned (some) nuances of natural language (as well as facts about our world!) by learning to predict the next word in sentence (e.g., “the cat is on the ___”) on 45 terabytes of natural language web text (e.g., Reddit, Wikipedia and more). Surprising pretty much everybody, very large neural networks that can solve our cat completion also learned how to solve a wide range of other tasks too:

what if we tell you that the same technology can also solve entity matching?

In particular, somebody in the NLP universe recently recognized that entity resolution is among the many tasks that these models can solve: describe two items to a model, and then ask to complete a sentence like “Item A and Item B are the same item? _____”!

In this post, we (Jacopo + Avanika) share an open source repo that implements an entity resolution pipeline in the MDS, powered by GPT-3, a large language model (LLM): no knowledge of Machine Learning or MLOps is required — the pipeline runs on dbt in SQL as a Snowflake external function. The only Python we need (~20 lines) is to power an AWS Lambda that connects our warehouse to the LLM.

Aside from the fun of a one-day project, building with GPT3 allowed us to glimpse inside its deeply flawed, tragically incomplete but fascinating digital brain. While the approach surprisingly works, we discuss some current limitations with our setup in the hope excited readers will start to make it better.

Clone the repo, give as a star on Github, and sing along!

Problem setup

It turns out that now that we have John Smith in Salesforce, WorkDay, Marketo and countless other sources, joining tables is hard: is John Smith from New York the same as John P. Smith from Manhattan? In other words, the problem of entity resolution starts creeping in: data is useful only if joinable, but now joins are not that deterministic anymore.

Entity resolution is the task of reconciling information between our feeds, in such a way that we can match two identical products across feeds, and mark the rest as unique: while our example features products (i.e. the Amazon-Walmart dataset, available in the open source deepmatcher repo under a BSD 3-Clause License), the same challenge will arise with companies in Salesforce and Crunchbase, or employees in Salesforce and Workday.

Entity resolution is not a new problem of course, but it’s (sort of) a new problem in the MDS context, where previous solutions may not work as well: string manipulation and simple rule-based fuzzy matching are hard to express in SQL, and they are costly to build and maintain; machine learning solutions require labeling and modeling effort, and a set of skills that may not necessarily overlap with MDS main persona, data engineer and analytics engineer.

[ If you want to get a sense of the complexity involved when doing things the “manual” way, this is a fantastic talk to get some perspective. ]

The piece of the puzzle we need to solve the problem without a devoted ML pipeline comes from “Can Foundation Models Wrangle Your Data?” (fyi, “Foundation Model” is Stanford-ese for LLM); in particular, we found out that large language models, such as OpenAI GPT-3, can solve entity resolution if framed as a question-answering problem. While a full explanation of LLMs is way beyond the scope of this post, it’s useful to remember that LLMs may look intimidating, but they are indeed just systems that take text as input, and produce more text as output.

All LLMs can do is to generate text based on some text we feed them.

The trick to their usefulness is indeed in our language: in fact, English is so flexible that we can convert standard ML problems into questions / examples, and ask a LLM to generate the answer we seek as the “right text continuation”. For example:

  • We can solve a sentiment analysis problem by providing sentences with the corresponding labels as examples (“I hate the food here: sentiment is negative”), and then ask the LLM to generate the completion for a new sentence (“I loved the pizza: sentiment is _____”).
  • We can solve a machine translation problem by asking the model to “Translate this into French” and then waiting for it to generate the appropriate completion for the target sentence: “What rooms do you have available? _____” (as we did here).
  • Finally, we can also turn entity matching into the problem of (roughly) completing the sentence “Are these products the same? _____”, after describing these products in words.
A LLM (“FM” stands for Foundation Model) can tackle entity matching if we ask the right question [ original image from the arxiv paper, co-authored by one of the author of this post ].

If this is true, we “just” need a way to connect the analytics engineer working on the MDS to the inference API of GPT-3: once the setup is done, a flexible entity resolution algorithm will be available as part of a standard dbt pipeline for data transformation. Snowflake external functions allow us to build this bridge, and make the flow magically working.

Connecting the pieces together

The bulk of the setup is connecting Snowflake to an AWS-powered endpoint (check the repo for details). Once that is done, the developer experience of the analytics engineer is the SQL-based DAG she knows and loves. When you type dbt run:

  • Preliminary transformations prepare the raw data and standardize product meta-data.
  • A final transformation invokes in SQL a function wrapping our endpoint.
  • When the endpoint is hit, a few lines of Python code prepare the meta-data as a question for GPT-3 and run it through OpenAI API for a response.
  • The parsed response becomes a boolean in a Snowflake table, indicating whether the pairs of products in the row are actually the same or not.
The functional workflow for the proposed implementation: the MDS on the left, and the intelligence supplied by GPT-3 through an API call, proxied by a Lambda. [ image by the authors ]

Both the AWS lambda talking to OpenAI and the Snowflake-to-endpoint connection are a one-off procedure that can be carried out by one data engineer in minutes. The actual entity resolution algorithm is then abstracted away in a SQL command and ready to be used by analytics engineers with no knowledge of Python, AWS and Machine Learning:

SELECT     
external_functions.lambda.resolution(em.SERIALIZED_A, em.SERIALIZED_B) AS RESOLUTION,
em.*
FROM
matching_input AS em

It just works.

As forcefully argued by our friends Piero and Chris, “the future of machine learning will depend on it being in the hands of the rest of us”: the marriage between the declarative nature of SQL and the rise of declarative ML (e.g. Predibase) promises to bring machine learning systems to the analytics world. As the efficiency and portability of large language models improve, every part of the data stack will benefit from few-shots predictions.

Does it work?

The paper reports pretty encouraging accuracy after some prompting gymnastics — that is, after experimenting with how the question-answering problem should indeed be framed, and which examples we should provide GPT-3 with:

Entity matching results over the Walmart-Amazon dataset, and several others, compared with standard industry systems. [ original table from the arxiv paper, co-authored by one of the author of this post ].

While prompt tuning is still more art than science, it has a distinctive advantage over the equivalent tuning that would be required in a standard ML setting: ML tuning is a combination of Python and math, while prompt tuning requires only English. Analytics engineers are not required to pick up new languages or tools, but simply to adjust their questions (after all, “The limits of my language mean the limits of my world”).

If you want to see the system in action, the GitHub repo contains step by step screenshots and full instructions to reproduce the pipeline.

The final table in the warehouse: LABEL is the gold label from the original dataset, RESOLUTION is the answer from GPT3, and SERIALIZED_A / B contain the product serializations that get sent to the LLM. [ screenshot by the authors ]

Limitations and future work

The obvious limitations of running a GPT-3 algorithm inside the MDS is cost and scale: if you need tens of thousands of predictions a day, the current flow will make your DAG slower and significantly more expensive. However, albeit not-very-practical yet, we found the exercise intriguing for the elegance of the approach and its alluring potential:

flexible, world-knowledge-savvy models whose magic powers can be evoked in SQL by practitioners not trained in ML.

Is there a whole set of functionalities at the intersection of MDS and LLMs waiting to be discovered and productionized at scale?

While the answer to that question is definitely outside the scope of this project, we are excited by the perspective of furthering the contamination of dataOps, scalable engineering and Large Language Models:

“You could attach prices to thoughts. Some cost a lot, some a little. And how does one pay for thoughts? The answer, I think, is: with courage”.

See you, space cowboys

This post is a joint work by Avanika and Jacopo, and part of an ongoing scheming (by J) of trading quotes from Wittgenstein for tennis lessons (by A).

Of course, this post would not have been possible without this fantastic paper (by Avanika with Ines Chami, Laurel Orr and Christopher Ré), the extensibility of Snowflake, and OpenAI APIs.

If you are working (in industry or academia) at the intersection of data and LLM, please do get in touch, as we plan what to do next!

--

--