Sampling isn’t enough, profile your ML data instead

Production logging approaches for AI and data pipelines

Isaac Backus
Towards Data Science

--

By Isaac Backus and Bernease Herman

Photo by Petri R on Unsplash

It’s 2020 and most of us still don’t know when, where, why, or how our models go wrong in production. While we all know that “what can go wrong, will go wrong,” or that “the best-laid plans of mice and [data scientists] often go awry,” complicated models and data pipelines are all too often pushed to production with little attention paid to diagnosing the inevitable unforeseen failures.

In traditional software, logging and instrumentation have been adopted as standard practice to create transparency and make sense of the health of a complex system. When it comes to AI applications, logging is often spotty and incomplete. In this post, we outline different approaches to ML logging, comparing and contrasting them. Finally, we offer an open source library called WhyLogs that enables data logging and profiling only in a few lines of code.

What is logging in traditional software?

Logging is an important tool for developing and operating robust software systems. When your production system reaches an error state, it is important to have tools to better locate and diagnose the source of the problem.

For many software engineering disciplines, a stack trace helps to locate the execution path and determine the state of the program at the time of failure. However, a stack trace does not give insight on how the state has changed prior to failure. Logging (along with its related term, software tracing) is a practice in which program execution and event information are stored to one or many files. Logging is essential to diagnosing issues with software of all kinds; a must have for production systems.

Image free for use by monkeyuser.com

How is data logging different?

Statistical applications, such as those in data science and machine learning, are prime candidates for requiring logging. However, due to the complexity of these applications the available tools remain limited and their adoption is much less widespread than standard software logging.

Statistical applications are often non-deterministic and require many state changes. Due to the requirement of handling a broad distribution of states, strict, logical assertions must be avoided and machine learning software will often never reach an error state, instead silently producing a poor or incorrect result. This makes error analysis far more difficult as maintainers are not alerted to the problem as it occurs.

When error states are detected, diagnosing the issue is often laborious. In contrast to explicitly defined software, datasets are especially opaque to introspection. Whereas software is fully specified by code and developers can easily include precise logging statements to pinpoint issues, datasets and data pipelines require significant analysis to diagnose.

While effective logging practices in ML may be difficult to implement, in many cases they can be even more necessary than with standard software. In typical software development, an enormous amount of issues may be caught before deploying to production by compilers, IDEs, type checking, logical assertions, and standard testing. With data, things are not so simple. This motivates the need for improved tooling and best practices in ML operations which advance statistical logging.

The generic requirements for good logging tools in software development may apply equally well in the ML operations domain as well. These requirements may include (but are of course not limited to) the following.

Logging requirements

  1. Ease of use
    Good logging aids in development by exposing internal functioning early and often to developers. If logging is clunky, no one is going to use it. Common logging modules in software development can be nearly as straightforward to use as print statements.
  2. Lightweight
    Logging should not interfere with program execution, therefore it must be lightweight.
  3. Standardized and portable
    Modern systems are big and complex, and we must be able to debug them. Logging requires multi-language support. Output formats should be standard and easily searched, filtered, consumed, and analyzed easily from multiple sources.
  4. Configurable
    We must be able to modify verbosity, output location, possibly even formats, for all services without modifying the code. Verbosity and output requirements can be very different for a developer or a data scientist than on a production service.
  5. Close to the code
    Logging calls should live within the code/service they refer to, and logging should let us very quickly pinpoint where the problem occurred within the service. Logging provides a systematic way to generate traces of the internal, logical functioning of a system.

Which approaches are available when it comes to ML logging?

Standard in-code logging

In data science, much can and should be done with standard logging modules. We can log data access, what steps (training, testing, etc…) are being executed. Model parameters and hyperparameters and greater details can be logged as well. Services and libraries focused on ML use cases (such as CometML) can expand the utility of such logging.

While standard logging can provide much visibility, it provides little to no introspection into the data.

Pros

  • Flexible and configurable
  • Can track both intermediate results and data of low complexity
  • Allows reuse of existing non-ML logging tools

Cons

  • High storage, I/O, and computational costs
  • Logging format may be unfamiliar or inappropriate for data scientists
  • Log processing requires computationally expensive search, particularly for complex ML data
  • Lower data retention due to expensive storage costs; less useful for root cause analysis of past issues

Sampling

A common approach to monitoring the enormous volumes of data typical to ML is to log a random subset of the data, whether during training, testing, or inference. It can be fairly straightforward and useful to randomly select some subset of the data and store it for reference later. Sampling-based data logging does not accurately represent outliers and rare events. As a result, important metrics such as minimum, maximum, and unique values can not be measured accurately. Outliers and uncommon values are important to retain as they often affect model behavior, cause problematic model predictions, and may be indicative of data quality issues.

Pros

  • Straightforward to implement
  • Requires less upfront design than other logging solutions
  • Log processing identical to analysis on raw data
  • Familiar data output format for data scientists

Cons

  • High storage, I/O, and computational costs
  • Noisy signals and limited coverage; small sample sizes required to be scalable and lightweight
  • Not human-readable or interpretable without statistical analysis processing step
  • Rare events and outliers will often be missed by sampling
  • Outlier-dependent metrics, such as min/max and unique values, cannot be accurately calculated
  • Output format is dependent on the data, making it more difficult to integrate with monitoring, debugging, or introspection tools

Data profiling

A promising approach to logging data is data profiling (also referred to as data sketching or statistical fingerprinting). The idea is to capture a human interpretable statistical profile of a given dataset to provide insight into the data. There already exist a broad range of efficient streaming algorithms to generate scalable, lightweight statistical profiles of datasets, and the literature is very active and growing. However, there exist significant engineering challenges around implementing these algorithms in practice, particularly in the context of ML logging. One project is working on overcoming these challenges.

Pros

  • Ease of use
  • Scalable and lightweight
  • Flexible and configurable via text-based config files
  • Accurately represents rare events and outlier-dependent metrics
  • Directly interpretable results (e.g., histograms, mean, std deviation, data type) without further processing

Cons

  • No existing widespread solutions
  • Involved mathematics and engineering problems behind solution

Making data logging easy and uncompromising! Introducing WhyLogs.

The data profiling solution, WhyLogs, is our contribution to modern, streamlined data logging to ML. WhyLogs is an open source library with the goal of bridging the ML logging gap by providing approximate data profiling and fulfilling the five logging requirements above (easy, lightweight, portable, configurable, close to code).

The estimated statistical profiles include per-feature distribution approximations which can provide histograms and quantiles, overall statistics such as min/max/standard deviation, uniqueness estimates, null counts, frequent items, and more. All statistical profiles are mergeable as well, making the algorithms trivially parallelizable, and allowing profiles of multiple datasets to be merged together for later analysis. This is key for achieving flexible granularity (since you can change aggregation levels, e.g., from hourly to daily or weekly) and for logging in distributed systems.

WhyLogs also supports features that are suitable for production environments such as tagging, small memory footprint, and lightweight output.. Tagging and grouping features are key for enabling segment-level analysis and to map segments to core business KPIs.

Portable & Lightweight

Currently, there are Python and Java implementations, which provide Python integration with pandas/numpy and scalable Java integration with Spark. The resulting log files are small and compatible across languages. We tested WhyLogs Java performance on the following datasets to validate WhyLogs memory footprint and the output binary size.

We ran our profile on each dataset and collected JMX metrics:

Ease of use, Configurable & Close to code

WhyLogs can be easily added to existing machine learning and data science code. The Python implementation can be `pip` installed and offers an interactive command line experience in addition to the library with an accessible API.

WhyLogs jupyter notebook example

More Examples

For more examples of using WhyLogs, check out the WhyLogs Getting Started notebook

Powerful Additional Features

The full power of WhyLogs can be witnessed when combined with monitoring and other services for live data. To explore how these features pair with WhyLogs, check out the live sandbox of WhyLabs Platform running on a modified version of the Lending Club dataset and ingesting WhyLogs data daily.

WhyLabs Platform screenshot
WhyLabs Platform screenshots capturing the model health dashboard and a feature health view. Image by author

Let’s make data logging a gold standard in production ML systems!

Data science, machine learning, and the technology surrounding them are developing at a breakneck pace, along with the scale of these operations and the number of people involved in them. Along with that rapid growth comes the inevitable explosion of problems. Best practices remain very nascent in ML, but as has been the case with software and systems engineering, best practices must continue to grow and develop. Effective logging must certainly take a primary role among best practices for operating robust ML/AI systems. Projects like WhyLogs will be required to address the unique challenges of these statistical systems.

Check out WhyLogs for Python here and Java here, or get started with the documentation. We love feedback and suggestions, so join our Slack channel or email us at support@whylabs.ai!

Thanks to Bernease Herman, my WhyLabs teammate, for co-authoring the article. Follow Bernease on twitter

--

--