The world’s leading publication for data science, AI, and ML professionals.

Unifying Metadata for MLOps

Recently, I wrote about some of the barriers to achieving an MLOps workflow. I am now going to focus in on Metadata, an example of a…

Recently, I wrote about some of the barriers to achieving an MLOps workflow. I am now going to focus in on Metadata, an example of a potential barrier, and illustrate how unifying data between Researchers and Engineers facilitates the adoption of MLOps.

Just as the DevOps movement was assisted by tools such as Containerization, Infrastructure as code and Continuous Integration, MLOps will require specific tools to enable its widespread adoption. I believe that Multi-model DBs are well on their way to being one of the vital tools for MLOps and I will provide an evaluation of ArangoDB as one such available tool.

Why the separation of data?

Thinking about the two major processes for the respective teams helps us to understand exactly why we have separation of data in these systems.

Research Teams

Researchers aim to leverage large amounts of labelled data, to generate the Deep Learning models which are now synonymous with Machine Learning. This data needs to be as close to the business use as possible, else performance in training will be significantly different to performance in production. This means that training sets are generally quite simple, from the schematic perspective, often utilising a single or batched tensor input (be it an image, data value, audio…) and a target label or tensors to be output by the model in optimal conditions.

The simplicity of the inputs and outputs from a data perspective requires large volumes of data to learn the complexity of the required task. This is generally the job of the Neural Network itself, to determine attributes which lead to links between inputs and targets.

Development Teams

Teams running the business applications, or putting machine learning models into production generally have very different requirements for their data. They will still have the same input and output types, but additional attributes are required and these can be more important to business decisions than the machine learning output itself. Equally, other processes are involved, costs and client requirements need to be tracked, and a deeper understanding of the system is needed. This can be achieved through metadata, usually in relational databases, sitting behind the services, websites or systems.

Human Verification

More recently, companies have been using a combination of Machine Learning and Human verification in production, in order to get to performance values which can actually be sold to customers. Often Machine Learning alone is not accurate enough, and so thresholds are set to be sent to Human manual checks, before results are sent to clients.

This information is obviously useful to research teams, to provide further labelled data, as well as the ability to detect data shift earlier, especially during growth stages. This also allows teams to be checking the results from ML solutions, as well as other business processes, when compared to Human level results. Often these results are tied to production database system, which creates a wall between Researchers and the data they need to improve models.

Requirements of a unified database

So, in order to unify this data, we will need to combine the scale of the data needed for training deep learning models and the complexity which is needed by business applications, as well as having an easy link between production services (including Human verification) and Researchers Datasets.

Separated Database pipeline example
Separated Database pipeline example

In the above diagram, we have a number of different services (A,B,C and D) all of which are pushing the common keys (k1, k2) for metadata across the pipeline in order to preserve information which might be useful at a later stage. This means that we are often duplicating data that we don’t need, simply passing it on, and it is very difficult or impossible to query across the different databases, which can often be in different formats, regions, or architectures.

Often, in the above architecture, it is very difficult to track the process of an asset or piece of data through the workflow and gain insights from this tracking. Also, it is difficult for Researchers to combine information from multiple sources into a dataset which can be used for training / fine-tuning existing models.

Solution

Multi-model databases have been growing in popularity since around 2010, providing a good all-round solution to complex requirements which cannot be met by traditional Relational Databases. Redis is one example, originally providing a key-value store, before expanding into document (JSON), property graph, streaming and time-series data. With specific reference to large-scale computer vision tasks, I have tested various Multi-Model DBs recently, and will thus focus these results for the remainder of this article, but please be aware that there are several more options available for different requirements.

ArangoDB for Machine Learning

ArangoDB combines a Document store and Graph Database, built on top of RocksDB, and therefore also providing a key-value store by default. Scaling horizontally is achieved by moving connected data closer to each other, to avoid expensive cross-server queries as much as possible. This allows the solution to be very effective at high scale, while achieving all of the benefits of a Document store and Graph Database.

Document Store

Production systems often use a document store to hold data relevant to their usage, as an alternative to the traditional relational database where each row can have many attributes, usually only having a few indexed or connected attributes. Production systems generally deal with updates / inserts of a single object in the pipeline at a time, so relational DBs and documents stores are equally effective in this use case.

GraphDB

Graph Databases have been around for a long time, and their usage started to take off in the 90s due to the requirements of the internet. Graphs are very efficient at modelling complex connections, where query speed is massively reduced from what would normally involve massive index lookups in a relational format. This is exactly what we need in Mlops to understand our complex data formats, especially those passing through multiple systems, or using a complex ML Pipeline.

In ArangoDB, documents are stored in collections. As with any document store, each document is indexed by a _key attribute. Graph functionality is then added through a special type of collection called an Edge Collection, which indexes _from and _to attributes, references collection/_key within the same database. This provides the requirements given by both teams above.

In Practice

Unified store pipeline example
Unified store pipeline example

Instead of passing Data along the pipeline, as previously, we can create a unified data store to hold all of this information in a single place. Each service in the process then reads the information it needs based on a unique identifier passed in the request, and then writes updates which can be needed later in the pipeline to the unified store. This both gives Engineers a view into the system as a whole and gives Researchers a single database to use to create datasets.

Before Multi-Model databases it would have been difficult to leverage this information from both perspectives in a single solution. For researchers wanting to create interesting data sets with which to train models, graph database technologies make linking this information faster and easier. From their perspective, connections are more important. For engineers wanting to be able to understand and analyse the entire pipeline in terms of performance metrics, unique keys for metadata documents allow this to be tracked in a single place. From their perspective, document metadata and indexing is more important.

Performance

But, just how well does it perform at each of these tasks? Let’s start by comparing it as a graph database. I mostly took to comparing ArangoDB to DGraph during this project, since DGraph also seemed to meet a lot of the scaling and complexity requirements of the project.

The following are taken from Arango’s own performance pages, for two of the standard graph db performance metrics:

2018 performance metrics from ArangoDB.
2018 performance metrics from ArangoDB.

N.B. It is worth ignoring the (ArangoDB MMFiles) metrics IMO, as I see no reason why you wouldn’t want to use RocksDB from my experience, and this is the default. I think that these are just left over since these are older images.

Neighbors of Neighbors + a filter, with count distinct. I also attempted this test over 2 "hops", (i.e. Neighbors of Neighbors), 3 and 4 "hops" and additionally added filters at various stages.

Graph Performance Metrics
Graph Performance Metrics

Now let’s look at document collection performance metrics. Generally, you would expect ArangoDB to be performing fairly poorly given how well it performs as a GraphDB, but the above shows how ArangoDB actually massively outperforms MongoDB at read/write speed and often outperforms relational competitors. Aggregations, supposedly one of the strengths of the relational format, are also of similar or better performance on ArangoDB.

My tests included single index lookups in large collections, as well as write syncs and aggregations.

Relational Performance Metrics
Relational Performance Metrics

As you can see, this is an impressive set of results, in that ArangoDB performs almost as well as MySQL (a pure relational database) against these kind of standard relational tests, and massively outperforms Document and Graph only stores. Other relational formats were not compared, since they would not have given the full functionality as required above, but comparing to an existing MySQL instance was useful for bench-marking as part of these tests.

Scaling

As I mentioned earlier, scaling is always a concern in Machine Learning databases. When data is often the most important business asset, you have to make sure that the solution you are building for storing data is correct for this year, and the following years according to projected scales. Replacing the solution repeatedly as the business (and hopefully therefore data) grow exponentially is a costly process, which can also have an effect on the morale of the engineers who build the solution, as well as researchers who have to continually build new tools.

Considerations

When considering a database solution, always ensure that you have thought about the opportunity cost, over time, of not testing out a potential product or solution at scale. At the same time, be realistic. What is the actual probability of reaching that scale in the next 2 years? If it’s fairly low, then it might be preferable to consider a solution which is easier to use for current scales, rather than using the wrong tools for the current size of the business. This is a difficult balancing act, and will be different for each business case, so be mindful of the options and have the testing and numbers to back up your decisions earlier rather than later.

Conclusion

ArangoDB was by a fairly large distance the best option for use as a single Unified Datastore behind a Machine Learning based Pipeline. The Multi-model approach allows users with completely different requirements and perspectives to use and share information and knowledge bases, to attempt to bring these separated teams together.

The main reason to choose ArangoDB over other options from the perspective of machine learning was the ease of use for complex requirements of different teams. ArangoDB had to perform incredibly well across multiple use cases in order to convince me that it was the correct choice, but it did this and gave a solution which I felt would cover the diverse use-cases which were required better than any other option.

I feel that multi-model databases are going to be vital in the coming years to aid with the complex and divergent datasets which a true widespread adoption of MLOps is going to require. If this is going to be a widely adopted tool, functionality that meets the needs of all parts of the business and enables each team to meet their goals will be crucial for success, as well as starting to bridge the gap between teams and break down the walls preventing MLOps.


Related Articles