The world’s leading publication for data science, AI, and ML professionals.

Extracting Text Patterns with User Highlights with Pattern Induction

An introduction to what text patterns are and what is our tool, Pattern Induction, that helps users quickly and accurately extract texts…

A step-by-step guide on how to quickly and accurately extract texts from several documents with just a handful of the user-given highlighted text

Extracting text patterns containing quarterly revenue earnings. Image by Author.
Extracting text patterns containing quarterly revenue earnings. Image by Author.

Do you remember the last time you had to spend countless hours finding the exact information you need from hundreds and thousands of documents? A few examples of where this is a common use case:

  • A financial analyst needing to extract the company’s revenues from the quarterly reports and market analyst reports, e.g. "revenue of 50 billion dollars" or "income: $15 billion".
  • A quality assurance professional extracting product numbers to solve customer complaints, e.g. "MR-9483" or "MR-2930".
  • A journalist trying to extract increases and decreases in criminal offenses from a dataset of FBI press releases, e.g. "offenses increased 6 percent" or "offenses decreased 5.5 percent"

This process of extracting concepts and key information out of the large bodies of texts and documents is cumbersome and time-consuming. Our team at IBM has distilled years of research and engineering into Pattern Induction, an AI-powered tool designed to substantially speed up this process.

A high-level overview of Pattern Induction. Image by Yannis Katsis.
A high-level overview of Pattern Induction. Image by Yannis Katsis.

Given just a few extraction examples and additional user feedback, Pattern Induction learns the patterns underlying the provided examples and uses them to extract similarly patterned information from the input documents.

Pattern Induction: Extracting currency amounts from a set of documents using user-highlighted text examples. Image by Author.
Pattern Induction: Extracting currency amounts from a set of documents using user-highlighted text examples. Image by Author.

In this article, we will discuss the following:

  • What is a pattern?
  • How to use Pattern Induction to extract patterns using user-provided highlighted text examples.
  • A supplementary section on how to set up an account on IBM Cloud to access Watson Discovery and then access the Pattern Induction feature.

What is a Pattern?

Before we start, let us first start out by understanding what a pattern means. A pattern is conceptually a sequence of tokens, or words, of certain regularity. For instance, consider the following examples in a financial use case:

As explained above, these instances follow a single pattern, which can be described as texts that start with the word "revenue" or "income" followed by the colon followed by a currency amount.

In the context of how we designed Pattern Induction to read texts such as these, a pattern is a regular expression over tokens, where tokens in a pattern may:

  • come from dictionaries (e.g., dictionaries that are generated from the tokens in the text, such as one containing monetary scales).
  • be consistently an exact string literal.
  • be part of a well-known category of named entities (such as currencies, locations, etc.) identified using AI techniques.
How dictionaries, literals, and named entities capture tokens in text. Image by Author.
How dictionaries, literals, and named entities capture tokens in text. Image by Author.

Using dictionaries, literals, named entities, and regular expressions, we can succinctly describe patterns using rules, which are sequences of such regular expressions over the tokens. The following is a rule describing the pattern underlying our financial use case:

where is a named entity.

Folks such as journalists, financial analysts, criminal investigators, or people of a non-technical background often find manually crafting these rules by hand challenging. Creating these rules often requires a lot of trial and error as well as some level of technical experience such as a deeper understanding of linguistic concepts. Pattern Induction automatically generates these rules from patterns present in the text examples provided by the user. By automating the aspect of manually building the rules, our implementation of Pattern Induction will help you focus on refining the extracted texts instead.

How to Use Pattern Induction to Extract Patterns

In our previous section, we explained to you briefly about the structure of patterns. In this section, we will walk you through how one would use Pattern Induction to extract text patterns.

Pattern Induction is a human-in-the-loop system that combines the expertise of domain experts with automatic learning capabilities to quickly learn a high-quality extractor. In this system, we enable humans to quickly provide examples and feedback to system suggestions to achieve domain-specific results and high coverage and quality.

Let us walk you through a typical Pattern Induction workflow from the perspective of the user. For the sake of the example, we continue using the same scenario where our goal is to extract revenue information, e.g. "revenue of $20 billion", from financial documents.

Prerequisites: Before starting, please create a Pattern Induction project, by following the few easy steps outlined in the "Try out Pattern Induction" section towards the end of this blog.

STEP 1: Highlight a few examples. Once you completed the prerequisites, start by highlighting a few strings that belong to the pattern you want to extract (see an example in Figure 1 below). Once you have provided enough examples (we recommend at least two for this version of the release), the system will learn the general pattern underlying the provided examples.

Tip: We encourage you to start off with providing two examples and waiting for the system to finish learning before you provide feedback to the learned results and/or directly highlight more examples.

Figure 1: User highlights a few examples. Image by Author.
Figure 1: User highlights a few examples. Image by Author.

STEP 2: Inspect the extractions found by the model and reply to the system’s suggestions. Once the system processes the highlighted examples and learns your first version of the extractor, it updates the screen with two types of information (see Figure 2 below): First it highlights in green all pieces of text predicted by the currently learned extractor for you to inspect. Second, the system probes a list of yes/no questions for you to verify – to understand your intent and to correct any wrong extractions.

Tip: We encourage you to answer as many questions as possible (ideally all), as these questions have been strategically chosen by the system to help differentiate between potential patterns that you may want to extract.

Figure 2: System returns a few suggestions for the user to verify. Image by Author.
Figure 2: System returns a few suggestions for the user to verify. Image by Author.

STEP 3: Wait for a while…. once the system learns an accurate extractor (composed of a small number of patterns) it will inform you accordingly.

Figure 3: Backend algorithm informs that an accurate algorithm has been learned. Image by Author.
Figure 3: Backend algorithm informs that an accurate algorithm has been learned. Image by Author.

STEP 4: Review extracted examples. __ To ensure the accuracy of the extractions, you can click on the "Review examples" pane and inspect the list of extracted examples. If you identify any mistakes or missing extractions, you can provide additional examples and/or feedback by repeating steps 1–3 above.

Figure 4: User reviews extracted patterns. Image by Author.
Figure 4: User reviews extracted patterns. Image by Author.

STEP 5: Saving your pattern. If everything looks correct you can now proceed to the final stage of the process which involves saving the learned patterns for future use. Simply type in a name for your pattern in the top left corner and then click on the "Save pattern" button on the top right corner. When saving the pattern, select a field such as "text" to enrich.

STEP 6: Visualizing the pattern extractions in the context of the original documents. If you navigate over to the "Improve and Customize" tab in the top-left corner, you will see a search bar. Hitting the "Enter" key in the search bar will result in showing a list of passages (Figure 5).

Figure 5: Viewing the passages in the "Improve and customize" tab
Figure 5: Viewing the passages in the "Improve and customize" tab

You may select "View passage in document" for any one of the search results, and on the bottom-right corner, selecting "Open advanced view" will bring up the original PDF document. Selecting any one of the saved patterns will then highlight the extractions directly in the document (Figure 6).

Figure 6: The revenue phrases from the pattern we just created are highlighted in the context of the document.
Figure 6: The revenue phrases from the pattern we just created are highlighted in the context of the document.

Supplementary section: Try out Pattern Induction

Follow these easy steps to try Pattern Induction:

STEP 1: Create an IBM account and set up a Watson Discovery project as described below: Sign up for an IBM account on Watson Discovery and then navigate over to your cloud dashboard: https://cloud.ibm.com. Click on the "Create a resource" button on the top right corner of the screen.

Figure 7: Homepage of your cloud account. Image by Author.
Figure 7: Homepage of your cloud account. Image by Author.

Search for "Watson Discovery" on your left, and click on the service titled "Watson Discovery". Select a plan suitable for you, e.g., premium, plus, etc.

Figure 8: Creating a Watson Discovery Service. Image by Author.
Figure 8: Creating a Watson Discovery Service. Image by Author.

After creating the service, navigate to https://cloud.ibm.com/resources. Here, you can view the recently created service, as shown below. Click on your "Watson Discovery" service and click on the button "Launch Watson Discovery". This will redirect you to the service where you can create a project for your extraction task.

Figure 9: List of resources. Note the "Watson Discovery" services under the section "Services and software". Image by Author.
Figure 9: List of resources. Note the "Watson Discovery" services under the section "Services and software". Image by Author.

To create a project, provide a project name, select "Document Retrieval" as project type (see Figure 10), and click "Next". Complete the steps to upload your dataset.

Figure 10: Select "Document Retrieval" as your project type. Image by Author.
Figure 10: Select "Document Retrieval" as your project type. Image by Author.

Once the documents are uploaded, it is recommended for this particular tutorial to enable the Smart Document Understanding feature. Click to manage your dataset (see the top-left corner in Figure 11). Select the "Identify fields" tab, and then select "Pre-trained models". And then confirm the choice by selecting "Submit" and applying the changes by clicking on the top-right button "Apply changes and reprocess".

Figure 11: Enabling the Smart Document Understanding feature
Figure 11: Enabling the Smart Document Understanding feature

STEP 2: Now, to follow along, you can try downloading any one of the following datasets here:

  • From the demo, you may try to extract revenues and cash flows, "revenues of $2.3 billion" or "cash flow of $45 billion", from the IBM Press Release Dataset. Click here.
  • Challenge yourself with the FBI press release dataset and extract increases and decreases of percentages related to crimes of different types, "offenses were up 5 percent" or "offenses decreased 6 percent". Click here.

After the data upload is complete, navigate to the "Improve and Customize" screen, where you can access Pattern Induction by clicking on "Patterns" under "Teach domain concepts" (see Figure 12).

Figure 12: Accessing Pattern Induction. Image by Author.
Figure 12: Accessing Pattern Induction. Image by Author.

Click on "Create" to create a new pattern, select documents to create patterns from (or let the system randomly select documents out of your document collection), and then hit "Next" (see Figure 13). This will navigate to Pattern Induction, where you may start creating patterns.

Figure 13: Select documents to create patterns with. Image by Author.
Figure 13: Select documents to create patterns with. Image by Author.

Conclusion

In this article, we introduced you to Pattern Induction, a tool that helps users quickly and accurately extract text patterns using highlighted text examples. Pattern Induction requires very little effort to jumpstart the process of extracting text. It also does not require the user to write a single line of code.

Supplementary resources

If you are applying Pattern Induction to your documents and you are looking for a more comprehensive user guide and best practices on using Pattern Induction, please check out the next post:

Authors: Dr. Maeda Hanafi, Dr. Yannis Katsis, Dr. Yunyao Li, Dr. Bikalpa Neupane


Related Articles

Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.