Introducing Ingestum

An extensible, scalable, free and open source unified ingestion framework to make it easier to create and use NLP programs

Walter Bender
Towards Data Science

--

Image by author.

The market for NLP and other language-based AI is exploding, with an estimated growth of over 20% annually [1]. As an AI company ourselves, we quickly realized that simply accessing unstructured content was a significant technical barrier.

According to a 2015 study by Gartner, 80% of enterprise data is locked away in unstructured documents [2]. Much of this unstructured data is relatively easy to access, but McKinsey has identified the final 20% as particularly challenging to extract and make machine-readable [3]. The promise of NLP rests on getting to that data with as little friction as possible, and we never want to turn away a customer because they have the “wrong” files.

When we looked for an answer, we found the market for ingestion software to be highly fragmented, with dozens of niche specialists and no single solution that fit our needs. For a young startup with a lean budget and big dreams, the lack of a solution posed a significant challenge and an additional hurdle right out the gate. And as developers with deep roots in the open source community, we were disappointed that we couldn’t find an open source solution.

So we built our own — an extensible, scalable, and easy-to-use unified content ingestion framework. We call it Ingestum™ (“ingest’em”). And because we don’t want other projects or startups to be held back either, today we’re releasing it under a Free/Libre open source license for the world to use.

We returned to study the market in greater depth recently, and found over 170 potential suppliers — and made some surprising discoveries:

  • The market is extremely fragmented.
  • There are very, very few “pure players” focused solely on ingestion.
  • There are at least a dozen AI suppliers who market their ingestion platform separately.
  • There are many dozens of AI suppliers who tie their ingestion to their AI offer.
  • A common approach is to turn everything into a PDF, then apply optical character recognition (OCR).
  • Almost to a company, the ingestion is proprietary, forcing each new company that enters the market to reinvent the wheel, slowing down innovation and creating a barrier to entry for dynamic, younger companies.

This doesn’t just affect developers. When choosing an AI product, users are locked into the available ingestion software. It can be difficult to kick the tires on ingestion ahead of time, and while many AI companies have excellent in-house solutions, they may still struggle with complex PDF structures or other documents, hindering their efficacy. Not to mention the rapidly-proliferating data formats from new social media or video conference apps. Some companies we came across gloss over the challenges, speaking of “data preparation”. But if it were simple, quickly and reliably ingesting unstructured data wouldn’t still be an unsolved problem.

Similarly, in talking with customers, we have learned that problems with data management are a recurring theme. One customer maintaining an enterprise Pharmacovigilance platform memorably told us “When I look at our safety platform, it’s embarrassing how high I’d need to count as to how many vendors, tools, connection points, that we have. It’s this gigantic, sprawling architecture diagram. It’s embarrassing. I won’t even show it.”

In this proliferation of products, we identified a potential solution. We realized that our ingestion engine framework needed to be, above all, scalable and extensible. When faced with an unfamiliar format, we can create a new Modifier in as little as three hours that is then able to vacuum up every document of that type. We wanted something that required as few point solutions as possible, that could be a fully comprehensive ingestion platform — the last ingestion solution you would ever need.

Now for the central question — why, then, are we open sourcing it? If so many companies are struggling with ingestion, and we believe we have a secret sauce, why are we posting the recipe online for everyone to see (along with, as most recipe blogs these days, a lengthy background essay explaining its origins)?

One reason is our confidence in our core technology. We think that our AI platform is differentiated enough by its Language Intelligence that we don’t need an edge in ingestion. Another reason is that we know the strengths of open source projects. Ingestion is not our core business, and while it only takes a few hours to create a new Modifier, as our company scales it would be nice to have a vastly expanded library of Modifier to reach for when we encounter a new document type, in addition to all the added benefits of attracting a vibrant developer community to continuously refine and expand upon the product.

But finally, we are big believers in open source. Like most engineering teams, we have relied heavily on standing the shoulders of the giants that have gone before us. Ingestum has, among its parts, Python components like Beautiful Soup, Camelot, PDFMiner, Pyexcel, Twython, Python-tesseract, and Deep Speech. We are deeply appreciative to the communities of developers that have created these components, thus making our work easier, and we wish to return the favor. Moreover, today, AI firms are not leveraging these great projects. Ingestum is the first free/libre open source framework to bring these projects together for ingestion.

Much of our engineering team has come together through the open source community as well. I co-founded Sugar Labs, a collaborative free/libre open-source software learning platform for children, as well as Music Blocks, a collection of tools for exploring fundamental musical concepts in a fun way. Through work in Sugar Labs, I met Martín Abente Lahaye, the lead engineer on Ingestum. Martín and Juan Pablo both contribute to the GNOME project as members of the GNOME Foundation

We all deeply believe in the mission of open source. We are extremely proud of our work creating Ingestum, and it’s our hope that it lives up to the excellent tradition of FOSS products.

Ingestum — from the Latin word to ingest — was designed to address three challenges:

  • to facilitate the writing of scripts to extract unstructured content from arbitrary sources and formats;
  • to provide a framework for extracting content from the diverse universe of source formats; and
  • to allow for the integration with both Python scripts and services at many levels of granularity.

The Ingestum approach, detailed in our documentation [4] [5], has six major concepts:

  1. Sources: source files or data streams, converted to Ingestum Documents, extensible JSON-encoded formats for further processing;
  2. Documents: intermediaries to which Modifiers are applied, for example, tabular data, free text data, or Collection documents;
  3. Modifiers: specific operations on all or part of its input, returning an output Document;
  4. Pipes & Pipelines: a Pipe is a sequence of Modifiers, a Pipeline is a collection of Pipes, the tried-and-true UNIX® approach;
  5. Conditionals: logical conditions to apply a Modifier selectively; useful with complex unstructured data to e.g. extract only tables or text without tables;
  6. Manifests: expressed in JSON, these describe Sources and Pipelines and their parameters, simplifying command-line invocations of Ingestum.

These components form a unified framework that has been used for the ingestion of sources as diverse as PDF datasheets from a casualty insurance company, research papers from Proquest and PubMed, XML documents from US government agencies, email threads, Twitter feeds, XLS/XLSX files from a pharmaceutical company, Youtube and Vimeo videos, and audio recordings of meetings. And now you can use the Ingestum framework on your unstructured data, or your customers’ unstructured data.

We think Ingestum makes a compelling sales argument for anyone in the AI space: it is free and very easy to install, prototype, test, and deploy. By its nature, processing unstructured data requires some experimentation and iteration. Ingestum shines at this since the process is broken down into small steps. Today, there are dozens of example Pipelines in the Ingestum repository on GitLab covering a broad range of sources. These provide a starting point for building new, custom pipelines. (The plethora of Pipeline examples simplifies developer onboarding, as it is easier to modify an existing Pipeline than to create a new one.) The Modifier library has broad coverage and adding new Modifiers is typically a matter of just a few hours of effort. Ingestum can be made available through a services framework such as FastAPI and incorporated into low-code and no-code environments.

Ingestum is not perfect. The default OCR library may struggle with some handwritten documents. The default speech-to-text transcoding can be improved. But it is easy to drop a preferred library in through Ingestum’s plugin mechanism. Ingestum’s modular architecture means Modifiers can be added for all sorts of document and stream types and formats, and the platform will grow as it is deployed.

We believe Ingestum will be both a competitor and an anti-competitor in the ingestion market. By this we mean: it is free/libre open source software, so you can freely evaluate it and add it to your existing workflows. And conversely, if you believe (as we do) that Ingestum can play the central role in your ingestion process, the software you use today can be added to Ingestum as a plugin. Code in C++ or Java modules? No worries, drop them in with a Python binding. If you have proprietary ingestion code you don’t want to publish, the LGPL does allow you to add your code as a plugin in your implementation. Ingestum can be a bridge between unstructured data of any type and existing solutions, or feed AI processing directly. We believe all boats will rise if AI firms contribute plugins to the framework.

Our vision for Ingestum is to advance AI by making unstructured text easily available for natural language processing (NLP), facilitating the creation of a knowledge fabric that enables further enrichment by AI technology. We believe that Ingestum will be a boon to future AI projects by removing the greatest initial barrier, lowering the cost of innovation. Download Ingestum to see for yourself.

To learn more about Ingestum or to sign up for our webinar on April 27th, go here.

[1] Natural Language Processing Market by Component, Type (Statistical, Hybrid), Application (Automatic Summarization, Sentiment Analysis, Risk & Threat Detection), Deployment Mode, Organization Size, Vertical, and Region — Global Forecast to 2026 (2020), Markets and Markets

[2] A. Dayley, D. Logan, Organizations Will Need to Tackle Three Challenges to Curb Unstructured Data Glut and Neglect (2015), Gartner Research

[3] M. Friesdorf, M. Hedwig, F. Niedermann, Best-in-class digital document processing: A payer perspective (2019), McKinsey & Company

[4] https://sorcero.gitlab.io/community/ingestum/

[5] git clone https://gitlab.com/sorcero/community/ingestum.git

--

--

CTO of Sorcero. Founding member of MIT Media Lab. Co-founder of One Laptop Per Child. Co-founder of Sugar Labs. Developer of Music Blocks.