The world’s leading publication for data science, AI, and ML professionals.

Data sources are getting standardized; can analytics, data science, and ML keep up?

A stance on the opportunity for out-of-the-box insight for standard schemas

Ye olde Salesforce opportunity table! Image by author.
Ye olde Salesforce opportunity table! Image by author.

Over the last year leading product, then solutions at Sisu Data, I’ve interviewed over 50 analysts, BI leaders, and data scientists. Ask any of them about their biggest pain point and the answer is uniform: the prep, wrangling, transforming, and munging necessary to make analysis tenable and useful, from ETL jobs to reducing dimensionality to numericizing factor levels.

This is underscored by a CrowdFlower survey showing that 70% of data analyst and scientist time is spent on data prep, and (no shock) is perceived as the least enjoyable part of the job. In sum, it’s a time consuming and low value necessary evil.

But I have hope: In the next few years, I believe this kind of data wrangling will dwindle to only 20% of the data workflow due to two trends.

  • The first, which is already occurring, involves the increasing standardization of data at both the source and the use case level.
  • The second, which is the opportunity on the horizon, involves the ability for the ML ecosystem to deliver actionable insights out-of-the-box on standard schemas

With the second trend, both the analyst and the data scientist will increasingly be tasked with either the creation of systems for out-of-the-box insight, or the problem of translation, interpretation, and business user empowerment.

Data source standardization is enabling the automation of data prep

SaaS is standardizing schemas. Image by author.
SaaS is standardizing schemas. Image by author.

There are two kinds of standardization occurring for business use cases. One is a stronger form of standardization at the source level. Another is a weaker form of standardization based on common data structures instead of sources. Both enable standardization of data prep.

Strong standardization: Dominant SaaS vendors are determining the base schema

If there’s one good thing that’s come out of SaaS monopolies, it’s data standardization.

For the past 20 years, the Salesforce Opportunity object has dominated the way companies look at pipeline, win rates, and net new revenues. As a result, their prevalence has defined a standard data format for sales and revenue operations across industries. In turn this standardization has created an entire ecosystem of companies that focus on minimizing friction and maximizing value out of the Salesforce opportunity table.

We see similar trends in traffic acquisition (Google Ads, Facebook Ads) and financial transactions (Stripe). Even the last frontier – product and web Analytics – is approaching a common standard, with companies like Heap, Segment, and Amplitude putting forth opinionated structures on the transformation of event-log data, while gaining high market share over the last few years.

Segment enables easy and painless event-level logging, with basic charts and visualization out of the box. Screenshot by author.
Segment enables easy and painless event-level logging, with basic charts and visualization out of the box. Screenshot by author.

However, data standardization at the source level hasn’t brought the boom in a standard application of useful analytics as one might hope, due to both interpretability and technical problems. It sets up the framework for consistently insightful algorithms and tools to be applied, but outside of the most basic charts, we still require analysts to manually transform data to create reporting tools for the business. This is because the tools for transforming and analyzing the data, outside of the standard dashboard visualizations, still require advanced technical skill, even though the tasks are becoming repeatable. We’re still one step removed from true insight.

The tools for transforming and analyzing the data, outside of the standard dashboard visualizations, still require advanced technical skill, even though the tasks are becoming repeatable

Weak standardization: data structure, ETL, data joins

If sources are becoming standard across use cases, the gap we need to close is standardizing the way we aggregate data and join it across disparate data sources. For example, we might be able to analyze the Salesforce opportunity table, but many teams would like to combine the data in theirSalesforce table with product analytics (e.g. Heap, Amplitude, Segment), marketing attribution, and firmographic data in order to understand how feature usage relates to upsell opportunities or churn likelihood.

Fortunately, there’s impressive work underway to rid ourselves of this friction. Standardization of ETL/ELT is occurring with companies like Fivetran, which just asks you to point to a data source and ingests, loads, and (with a partnership with DBT) transforms in an automated way. This ensures that the join across disparate data sources can also be done in a somewhat standard fashion.

Perhaps the weakest standards are with respect to the final transform for business users, or the dangling end of ETLT. For example, it’s still difficult to understand how to aggregate transactions to the right grain or along the right windows of time to satisfy a business user’s specific need.

But pseudo-standard data structures, even without a single data source, is making the dream of 1-click auto-prep possible. For example, when it comes to a standard marketing funnel analysis, listing every step of the funnel as a field, and denoting whether that step was reached with a timestamp, is rapidly becoming a best practice, which sets the stage for data to be transformed by business users.

Funnel analytics involve aggregating different steps of the funnel into timestamps, and attaching features to different elements on the funnel. Screenshot by author.
Funnel analytics involve aggregating different steps of the funnel into timestamps, and attaching features to different elements on the funnel. Screenshot by author.

As another example, product event-log data contains many possible sources, but roughly the same structure. Some companies will create their own Kafka stream from the backend; others rely on a third-party like Segment. But event log structures are largely the same everywhere. So one area ripe for friction reduction is the very common and tedious analysis of aggregating the right event data to the right timeframe. For example, a user can look at the number of times each user clicked ‘Submit’ within 90 days, or the number of times a user clicked ‘Submit’ in the first seven days, or the number of times a user commented in the first year.

With the infinite scalability of the cloud, why not automatically aggregate a number of different events across a number of different timeframes, without writing a single line of SQL?

If business data is becoming standardized, why is it still so frustrating to spit out an interesting insight?

As a recap, here are four places where standardization is occurring

Source: Author
Source: Author

But, even without the standardization of all parts of the analytics stack, building even the simplest models from our nice, high-value Salesforce opportunity table is frustratingly cumbersome.

The big opportunity is for ML and compound modeling is to blend the analytics and the ETL together in a single uniform step.

To illustrate the problem today, let’s ask the simplest business question possible of our Salesforce opportunity table – what’s correlated to a high rate of closed-won opportunities, so that I can focus on this subgroup? In 2020, I’d want to be able to pop this into a tool with a one-liner and get the answers right away.

In my case specifically, I wanted to run a quick LASSO on the table to understand the features that seem to be most important. LASSO is a nice technique for this because its regularization via a loss function enables noisy coefficients to get rounded to zero. In other words, LASSO, if I have a bunch of numerical variables, will highlight the most influential variables. But immediately, I ran into a number of frustrating things one after the other, even though thousands of data scientists end up writing the same pre-processing logic everywhere:

Ingest

Oftentimes, the dataset can be too big to open in the first place with many cloud tools failing at a 10GB or so cap (our Salesforce table is small, but bear with me for a second). Then, a data scientist must sample in order to open it in a local Jupyter instance.

But it starts to break when the end result depends on having the entire data set. For example, executives want to see total sales or total "closed won" counts across a period. Sampling in this case obviously leads to drastic undercounts. In these cases, to reduce size, analysts will resort to building a data cube or running GROUP BY. And unfortunately, the instant something gets into a cube format, there are few tools that can take advantage of recent advances in ML modelling.

Some tools like Anyscale and Ray are starting to focus on this part of the data stack, enabling infinite scalability on datasets of any size. It just depends on how much you want to pay.

Luckily, our Salesforce table is small, but I would have had a hard time even just obtaining the data if it was bigger.

Transform and aggregation

Let’s say our Salesforce dataset is small enough to load in a notebook. Again, I’m specifically interested in understanding what’s correlated to a high rate of closed-won opportunities, and would love to use LASSO for feature selection. But then, we notice that the schema involves a whole host of non-numeric fields.

Ye olde Salesforce opportunity table! Source: Author.
Ye olde Salesforce opportunity table! Source: Author.

So my natural option now is to one-hot encode the categorical features, which now leads to problems of sparsity and multi-collinearity, which then requires even more munging. And this doesn’t even examine the rich text fields. Though out-of-the-box packages make it easier to deal with these problems now, product managers or data-minded businesspeople won’t have time to learn open-source scikit packages.

Plus, the world is evolving such that "insights" will be on-demand with a few clicks, not multi-line scripts.

I’m imagining a point-and-click tool that is essentially "LASSO for Salesforce schema" or "LASSO for Python-series data". It’s likely already in production in some places, and I can’t wait for it to just become standard.

We haven’t even gotten to exploring the event-log data that I’ve referred to in the past. For one thing, there are near-infinite possibilities on the number of features to generate. LASSO helps with selecting features from a list, but not the generation itself: Do I want to look at the number of logins in the first 7 days after sign-up? Or the total session duration in the first day after payment? Without any kind of machine-aided help in scanning through the features that are most important, I can only rely on my gut feel, or whatever biases my business users may already have.

Interpretation

After getting to this step, I get a vector of statistically significant coefficients.

GLM output from a student project. Screenshot by author.
GLM output from a student project. Screenshot by author.

But imagine translating this to your business stakeholders! Instead, I have to go back to SQL and make it much more interpretable. In my case, I identified one coefficient that seemed to have a very low p-value and a reasonable coefficient even after regularization: B2C = 1 and BI_Tool_looker = 1. hHen, I simply looked up the conversion rates of the coefficients that surfaced (in this case, Salesforce opportunities that were B2C were especially high converting), and showed the conversion percentage.

After all that work, it was really just one insight, after scanning through all the p-values and coefficients that impressed the team:

  • B2C customers based in San Francisco using Looker and a new lightweight sales process are 3x more likely to get to late stages of our sales funnel.

All that for this? Oof! At this point, I’m just tired.

Early glimmers of hope

With all the standardization occurring in the data world, what if Data Science tools and vendors could match standard data with standard, multi-stage models? What if we could enable business users to go straight from a few standard data sources to insight?

One company that is marching in this direction is Alteryx, which enables the kind of recipe building that integrates ETL and multi-stage modelling into a DAG.

DAGs, like this one in Matillion, enable point-and-click data prep. Alteryx enables ML model creation, training, and productionalization as part of the DAG as well. Screenshot by author.
DAGs, like this one in Matillion, enable point-and-click data prep. Alteryx enables ML model creation, training, and productionalization as part of the DAG as well. Screenshot by author.

As seen above, changing the schema, creating the right fields, conducting an aggregation and then doing a logistic regression is done with drag-and-drop clicks. But in a world in which data is becoming standard and all business-users must be data-driven, even a DAG may be overwhelming.

At Sisu Data (I lead solutions and data products at Sisu), we believe that even these recipes can be simplified. Our vision is that for certain data structures – product event logs, Salesforce opportunities – you shouldn’t need a DAG or Python or even SQL to get insights. In dealing with the Salesforce table above, we have a multi-stage model that, in stages, models for sparsity, collinearity, and text parsing; and even makes the outputs highly interpretable.

As business users demand insights, not just reporting or charting, there’s a giant opportunity for a verticalized, opinionated workflows on standard data structures that goes straight from source to insight with a few clicks

As business users demand insights, not just reporting or charting, there’s a giant opportunity for a verticalized, opinionated stack on standard data structures that augments the path from sources to insight:

  • Ingest: the advent of cloud warehouses make it so that ingesting TB-scale data sets no longer requires a team of overworked data engineers. Instead, this should be as simple as a business-user’s dialogue box.
  • Machine-aided aggregation: Aggregation of raw data like transactional event logs across windows of time should become recommendations, instead of a dizzying array of possibilities.
  • Machine-aided joins: With a few standard data sources, we can suggest interesting transforms and joins across different data sources, with the machine intelligently choosing the data that makes the most sense.
  • Modeling: For standard joins and schemas, instead of a multi-stage model, a solution can apply the 80/20 solution across multiple models that can deliver interpretable insight in the best way possible.
  • Insight and interpretation: No more coefficients or variable rank importance charts. The output should be interpretable and in the pithy kind of insight that business users expect, like Facebook’s famous "7 friends in 10 days" insight.

Where does that leave the analyst and the data scientist? If standardization of sources leads to the standardization and automation of the process to gain insight, data scientists will be free to go deep into long tail of cases for which out-of-the-box insight no longer works. Analysts will increasingly find themselves in PM-like roles, shepherding the flow of insight and translating it into action. Both will be free of the "least enjoyable part of the job" and add business value where bespoke work is most needed and most interesting.


Related Articles