The world’s leading publication for data science, AI, and ML professionals.

The analytical application stack

Opportunities and challenges in building "data apps"

Image by Author
Image by Author

Over the last few years, data infrastructure has evolved tremendously. Much has been written about the "modern data stack", and at this point there are an immense number of startups covering areas like data quality, data monitoring, reverse ETL, and similar.

However, one area where less attention has been paid, yet where it seems like there is still significant opportunity, is the stack for building analytical applications. By "analytical application", I mean an end-user facing application that natively includes large scale, aggregate analysis of data in its functionality. This is sometimes also referred to as "data apps".

As an example, consider marketplaces companies. Almost all marketplaces have a user interface that enable sellers to see aggregate data around their "stores". Many of these user interfaces allow the seller to slice and dice this data by the various dimensions that might be relavent to them, such as SKUs or geographies of buyers. In certain cases, such an interface may also provide richer functionality such as diagnosing why sales have changed in a given timeframe, or what the impact of a recent SKU launch was.

Shopify's Analytics Dashboard is one example of an "analytical application"
Shopify’s Analytics Dashboard is one example of an "analytical application"

As data becomes a core part of virtually everything we do, this type of product use case is become increasingly common. When you click on a home list in in Zillow, a large number of calculations run to show you all of their data about a given home listing, such as their "Zestimate". When you interact with Wealthfront’s "Path" feature to help plan out your financial future, a complex set of underlying forecasting and aggregations are happening under the hood. The same is true for when you click on "Who Viewed My Profile?" on LinkedIn. Stripe Sigma is essentially an embedded data Analytics tool for their merchants.

Stripe Sigma allows merchants to write SQL and explore their business data
Stripe Sigma allows merchants to write SQL and explore their business data

Yet, while there are an increasing number of examples of companies building products like this, the tooling to enable them is still nascent. Having spent a lot of time talking to both practitioners dealing with this challenge as well as to emerging startups trying to solve it, I wanted to give an overview of the current state of this market, and open gaps I see that still need to be solved.

Moving beyond embedded dashboards

At first glance, you might ask – "Isn’t this essentially the same thing as an embedded business intelligence dashboard? The Shopify example above is even called ‘Dashboard’! We’ve had tools to solve this for years."

Indeed, the traditional way of putting analytical functionality into applications has been embedded BI tools such as Looker Embedded Analytics and Tableau Embedded Analytics. Such product offerings are extremely successful, representing a very large portion of the revenue of these BI products (>30% by some estimates).

However, at the same time, these tools are almost universally disliked and represent woefully inadequate solutions to what most modern companies need – you would be hard pressed to see Stripe decide to build Stripe Sigma around embedded Looker.

There are a few core reasons for this. First, embedded analytics products are typically delivered as iframes. While this approach is simple, it also means that you have extremely limited control over the analytical application. This poses challenges both from a customization perspective – the frontend team is prevented from deeply integrating custom application logic with the Data visualization logic – and from a design perspective – it is essentially impossible to make the iframe feel "native", resulting in an inelegant, low-quality product experience.

The fact that embedded analytics tools have historically been oriented around iframes is also illustrative of a broader issue plaguing this space – traditionally, data science and analytics teams have not been used to adopting rigorous software engineering standards (versioning, tests, reviews, incremental rollouts and rollbacks, etc), yet they have been the owners of the embedded iframes. Product teams hate this because it often leads to broken product experiences they have little to no control over. The lack of attention paid to carefully building and maintaining "data products" is an overarching issue in the industry, which tooling like dbt is starting to improve at the data infrastructure level. However, these principles have not really made their way to the analytical application yet. Put simply – the iframe is not the right contract between product engineering and data teams.

The second core issue with these embedded solutions is that they are very non-performant, leading to constant loading bars and poor responsiveness (largely because they don’t take advantage of many of the recent advances in cloud data infrastructure).

Finally, these products are ultimately highly constraining in terms of the end-user experience delivered – they will always look and feel like a BI tool, yet dashboards typically have limited utility, and the most interesting use cases for data almost always tie into a workflow.

This set of constraints is fundamentally incompatible with the types of use cases that many modern, data-driven companies (such as Zillow, Wealthfront, LinkedIn, and Stripe mentioned above) have. As such, for the past decade, most of these companies have been left to do everything themselves, from database optimization to SQL management to the visualization layer, and everything in between.

Analytical applications have unique constraints

Part of the reason it has been so challenging for companies to build all of this themselves is that, in many ways, the constraints of analytical applications are somewhat unique in the data space, making it difficult to simply repurpose traditional data tooling. This takes shape in a few ways.

  1. High concurrency – Because analytical applications ultimately serve end users, rather than internal employees or internal systems, you must support dramatically higher numbers of concurrent requests than most traditional data tools were designed for.
  2. Low latency -Customers typically expect fast, snappy product experiences. While your internal data scientists may be okay waiting 5–10 minutes for a query to run (though likely a bit grumpy!), this is a complete non-starter for many user-facing applications, especially if the data analysis is required as part of the core workflow that your product enables.
  3. Software engineering teams become stakeholders – Getting analytical applications to work requires a tight collaboration between data teams (data platform, data engineering, Data Science) and the product teams (frontend, backend) actually delivering the product. Most product teams know nothing about large scale data infrastructure, and they are a set of users most data tools were never designed around. Furthermore, these two personas are used to completely different standards and processes for deploying products – software teams typically follow a rigorous software development process, as discussed in the context of iframes above, whereas techniques like testing, versioning, and similar are only starting to be adopted in data. This can create a lot of friction.
  4. Analytical functionality must be integrated, not isolated – Ultimately, analytical applications involve data analysis becoming a core component of the workflows that a product offers. It is rarely sufficient to simply show some visualizations and be done – the data analysis, visualization, and compute needs to be deeply integrated into the product itself. This is quite different from the way most data tools work, where the notebook, the SQL editor, the DBMS, or similar are designed to do nothing besides analyze data. As an illustrative example – over the past decade, we have seen payments go from something that feels very "third party" (think of getting kicked out to a separate PayPal processing website when want to buy on eBay), to something that is deeply embedded into checkout and eCommerce flows (think of Shopify Pay). This has led to immense benefits in user experience and conversion rates. Data, similarly, shouldn’t feel "bolted on" to a product.

Solving for these constraints typically requires doing substantial work across the database, the caching layer, the "middleware" layer between data and application, and the visualization layer. Luckily, we are beginning to see the emergence of tooling that helps address each of these categories.

Anatomy of the key layers of an analytical application, alongside some of the key players at each layer. There are also various "All in One" tools that aim to simplify the creation of simple data apps, though these tools have historically lacked the performance and flexibility needed to power core parts of externally facing applications. Image by Author.
Anatomy of the key layers of an analytical application, alongside some of the key players at each layer. There are also various "All in One" tools that aim to simplify the creation of simple data apps, though these tools have historically lacked the performance and flexibility needed to power core parts of externally facing applications. Image by Author.

Layers of an Analytical Application

The Database Layer

For the most part, you can’t run an analytical application directly on a data warehouse like Snowflake, a "Lakehouse" such as DataBricks Delta Lake, or a traditional non-columnar relational database like MySQL. The former two categories were primarily designed for the lower-concurrency, "batch" operations you typically see associated with internal business intelligence and data science workloads, while the latter category is designed more for row-by-row queries (and write heavy workloads) than large scale analytical aggregations.

As a result, over the past 3–4 years we have seen the emergence of a number of "real-time" or "low-latency" in-memory databases which make tradeoffs more suited towards analytical applications, such as Apache Druid (Imply), Apache Pinot (StarTree), Clickhouse (Altinity), Rockset, and RillData. Many of these tools emerged as in-house projects at companies specifically dealing with the challenges associated with building analytical applications – for example, Pinot emerged at LinkedIn as a tool to help simplify the complexity of powering "Who Viewed My Profile?". Among various other architectural differences, these products generally optimize for keeping data in memory and handling very high frequencies of concurrent queries, whereas data warehouses like Snowflake optimize for lower-cost, on-disk storage. This blog post by StarTree does a good job of diving deeper into pros and cons of different approaches for user-facing applications.

It is worth noting that some of these products also emphasize the "operational" or "real-time" element of their product, in which case they are referring to how quickly a new write into an upstream source database gets reflected in reads by the application. It may take minutes to hours for this propagation to occur in a traditional data warehouse setup, but it can be a matter of seconds with these tools (assuming you add in some other technologies like Kafka and Debezium). Such functionality is not necessarily always needed for analytical applications, but is of very high importance in certain situations where the user must react very quickly to how the world is changing.

The Caching and Pre-Aggregation Layer

The next step in creating an analytical application typically involves building some kind of caching or pre-aggregation layer. This is essential if you want to minimize the amount of user-facing latency your application has, and also has cost-reduction benefits as you can often significantly reduce the number of queries hitting your database with intelligent caching. While some of the aforementioned low-latency databases do natively include caching functionality (such as Rockset), others do not (such as Pinot). Furthermore, in many cases it is unfeasible for a company to set-up or fully rely on a low-latency analytical database, in which cases a secondary caching layer is absolutely essential. Some examples of this include:

  1. You do not have the data engineering bandwidth to setup a secondary data store beyond Snowflake, but you want to launch some basic analytical application functionality on top of the Snowflake or BigQuery
  2. Your application needs to run on top of a relational, rather than a columnar database, such as MySQL or Postgres, but your application is trending towards including more and more "aggregate" queries (vs. pure row-level queries) and this is significantly slowing your application
  3. It is unfeasible from a cost perspective to keep all your data in an in-memory, low-latency database. You can only keep data from the past 24 hours in this "hot" environment, but users sometimes run queries that involve much "older" data.

Traditionally, there have been a few ways to solve caching and pre-aggregation, all of which have been quite cumbersome and generally poor solutions.

On the database side, one approach is to store pre-aggregations as additional rows in the database – for example, if your database stores data about page views by user, you might add secondary rows which are refreshed asynchronously on some cadence and represent "Trailing Month Total Page Views" per user. You might then update your query logic to preferentially just pull these aggregated values when they match the incoming query. Unfortunately, this approach has numerous issues, such as maintaining consistency between pre-aggregations and raw data, adding significant storage overhead in high-cardinality or high-dimensionality scenarios, and generally adding substantial complexity and maintenance overhead to both frontend and database logic.

On the application side, the other approach is to use something like Redis or Memcached (or a hosted version of these, such as AWS ElastiCache). While such technologies allow for the creation of key-value caches at the application layer, they suffer from very tightly coupling caching logic with frontend business logic. This introduces massive complexity to product development, as now most frontend changes will require the frontend team to reason through topics like— "Do I need to update the caching logic? Will this lead to stale-reads? Do I need to adopt transactions to concurrently write to the cache and the DB? When must I ensure I go straight to the DB and invalidate the cache?".

Whichever of these approaches that you take, the other challenge is that it is ultimately the onus of the developer to figure out what specific things to cache or pre-aggregate. This is quite non-ideal, as the complexity and scale of most modern analytical applications is immense. Which dimensions, values, and traits should be cached vs. not and what the optimal set of partial pre-aggregations are is a decision far more suited to a machine than a human, especially since there is natural drift and evolution in the nature of queries in most applications over time.

Luckily, we are finally starting to see the emergence of a few tools that significantly simplify these issues, such as Cube.dev, Polyscale, Readyset, and TakeoffDB. While each of these tools is slightly different in terms of the type of stack they are targeting, the exact functionality, and the technical approach, all generally represent interesting ways of solving caching and pre-aggregation for modern analytical applications. If you’re interested, Cube’s blog post here is an excellent overview of some of the different approaches to caching and pre-aggregations and their associated tradeoffs, and the Noria paper is also good further reading in this space.

Middleware Layer

The next question to figure out is – how do you make the data itself readily accessible to the product team that needs to incorporate it into the application? This often ends up being a particularly tricky question, because it involves solving for the interface between the data team and the product team, who are often not used to speaking each other’s language.

The first issue here is simply creating an API that is easy for the product team to use. Ideally, such an API would include SDKs for popular frontend frameworks like React and Vue and remove the need to manage complex backend web services just to manage the DB connection (this follows the broader trend towards the so-called Jamstack). Cube.dev, which I mentioned earlier, provides functionality in this vein. Stepzen is another interesting company making it much easier to build GraphQL endpoints on top of all of your data sources.

The second issue is query management. Many data practitioners would be aghast to see the way that SQL queries are managed by frontend teams – there is typically immense duplication, lots of incorrect logic, highly unoptimized queries, poor formatting, and a general lack of sophistication that leads to high implementation and maintainability cost. There is a lot of excitement right now around "metrics layer" tools such as Minerva and Transform which help standardize metric logic in companies, and similar functionality will likely be needed at the interface of data teams and product teams building analytical applications (it is unclear to me whether tools like Transform will end up serving both traditional data stack use cases as well as analytical application use cases, or whether these two segments will diverge). Cube is the only company I see today that provides robust data modeling and query management functionality specifically oriented around analytical applications. It is possible that companies going after "Headless BI" may also eventually address this.

Visualization Layer

Last but not least, to build an analytical application you need to figure out how to visualize and create interaction patterns around data. As has been discussed, it is essential that such visualizations and interactions feel native to the product and are deeply embedded into the product. In other words, the data can not feel like a disjoint set of dashboards independent from the rest of the product – interacting with the data should influence workflows in the product and vice versa. (This is the type of interaction that is fundamentally impossible to achieve with embedded iframes)

It feels like there is substantial room for improvement in the tooling for Data Visualization embedded deeply into products. The state of the art is, for the most part, various open source libraries such as Chart.js, D3.js, HighCharts. In general, tooling in this space is either very simple but extremely limiting (ugly, weak customization, poor support for chart types) or highly complex but very powerful (hard to learn, difficult to master, low time to value, high learning curve). Indeed, D3 is so complex that even the company built by the creator of D3 is exploring various abstractions on top of it to simplify data visualization work in javascript.

There are also various traditional "BI-esque" products like Metabase which do allow rapid creation of interactive data apps, but these are really designed to solve the traditional, internal data exploration use case, not power a modern analytical application. They suffer the same problems we have discussed around performance, feeling native, and embedding deeply into an application, as well as not supporting SLDC principles like testing.

Topcoat Data is one interesting newer company in this space. Topcoat makes it extremely simple for a data scientist or similar who is familiar with DBT to rapidly create full fledged embedded dashboards in a highly self-service way. All the data modeling is done as an extension of DBT, common charting libraries are natively integrated, and custom CSS can be applied. This represents a quantum leap improvement over using something like Looker Embedded. Note, however, that as of today this sticks primarily to the "embedded dashboard" use case, versus allowing for highly customized analytical applications.

"All in One" Tools

While we have now covered all of the key individual facets of building an analytical application and the various tools and techniques that can help enable them, there is a secondary class of products worth considering in this space which vertically integrate a few components of the stack and provide more of a simplified, "all-in-one" experience. Most of these products aim to make it as simple as possible to spin up a data app, ideally in a pseudo-self-service way for data scientists.

One example of this is Plotly Dash, which makes it quite easy to deploy a simple Flask web app based purely on a python script which combines python analytical functions and Plotly’s visualization library. Streamlit is another beloved product in this category focused on making it so that any data script can become a sharable web app in seconds, with zero understand of frontend software engineering. Observable makes it easy to build javascript based data exploration notebooks which double as sharable, interactive web apps.

There are also various products emerging that make it easy to create quick web-apps on top of data science notebooks – for example, Hex and Count let you turn data science notebooks into interactive data apps.

While these products are simple to use and awesome developer experiences, the challenge is that, for the most part, they are primarily useful for rapid prototyping and internal collaboration. It is very easy to quickly hack something together, but to go from there to a fully functioning web-app that can become a core component of your product is a huge jump. Performance will become an issue, as will deeper integration with the broader product. As such, while these products certainly have a clear use case and place, they don’t really solve the issue of building analytical applications. With that said, there are some emerging companies which argue their architecture can support more "production-grade" use cases – one such example is BaseTen, which focuses primarily on ML-driven applications.

Opportunities and gaps in the market

Although the state of tooling in this market is already orders of magnitude better than what it was just 5 years ago, in many respects it still feels like it is still in the early innings. Based on numerous conversations with practitioners in this space, the following feel like the most acute lingering pain points to me:

Caching and Pre-Aggregations

The pain that modern application teams feel when dealing with homegrown Redis and Memcached solutions alongside custom-maintained pre-aggregations in the DB is immense. These solutions add so much complexity that they frequently slow application development to a crawl. While companies building on top of newer tools like Rockset are often able to ignore these issues, there is a massive market of companies without low-latency DBs, or who need to do build on top of relational DBs like Postgres, without a good solution.

Visualization

There needs to be tooling that better threads the needle between simplicity and power in custom visualization work as part of analytical applications. However, from a startup perspective, the challenge here is that it has been traditionally difficult to monetize frontend frameworks as opposed to lower level compute and storage layers. Nonetheless, better solutions here would be greatly appreciated by so many companies I have spoken with.

Unification of Databases (Convergence of Metrics Infrastructure)

It is increasingly looking like it may be technically possible to have databases that can serve a broader range of use cases, removing some of the need for specialized data stores for BI use cases vs. analytical application use cases vs. other use cases. Chris Riccomini, a distinguished engineer at WePay, has an interesting tweet thread about this here. Apache Iceberg is also taking us more in this direction in terms of allowing for even traditionally "slow" data storage systems like S3 to have extremely fast analytical support due to more sophisticated metadata and index management.

It remains to be seen whether tools like Pinot or Druid will really be able to become truly general purpose tools, but advances in this direction may be able to substantially reduce the sprawl and complexity you see in many data environments today, and bring infrastructure for analytical applications closer to other data use cases.

The Simplicity of a Prototyping Tool With the Power of a Custom Stack

Companies like Hex, Streamlit, and Plotly have built very elegant products. The question is – can we get to the point where we maintain the simplicity of these tools, while preserving the power of a "DIY" stack in this space in terms of performance and composability with broader application development frameworks. Preset’s recent blog post touches on various aspects of this – perhaps an open source, community oriented BI platform may be able to provide some of this as well.

Ultimately, it is unclear to me whether companies like this will try to move in the direction of powering full fledged user-facing applications, rather than focusing on internal collaboration, prototyping, and dashboards or non-mission critical user-facing applications. Alternatively, perhaps new full-stack analytical application companies may emerge that focus more on applying these ease-of-use and time-to-value principles to mission critical applications.

Solving the friction between product engineering and data

One of the things I like about Topcoat’s approach is that it enables a totally self-service deployment of embedded dashboards by the data scientist, but in a way that is much better than embedded BI tools (not an iframe, fully customizable, built to leverage DBT, has a clean metrics layer, supports testing and the full range of software development principles, etc), and likely more appreciated by a product team. This will not solve every analytical application use case, but there is a clear market for it.

Expanding on this, it feels like there is still opportunity to build tooling that continues to abstract the interface between data platform and product engineering. The more that these teams can be self service without requiring substantial knowledge of the other’s toolkit, the better. Cube.dev is the other product most clearly leading the way in this area today, but there is likely a lot more to build.


Ultimately, as more and more companies are built around data as a core competency, a higher and higher percentage of products will need to include large scale analytical processing as part of their core user workflows. Data will increasingly become an essential, inseparable facet of the way we interface with the products we use, and as a result, the importance of tooling for building such analytical applications is going to increase dramatically. Such applications need to feel native, be performant at scale (high concurrency, high frequency, low latency), work well with the modern cloud data stack, richly blend application and data logic in a highly customizable way, and most importantly, be simpler to build.

If you have struggled with any of these issues, or are building a product addressing any facet of this space, I would love to talk to you – [email protected].

Thanks to Artyom Keydunov, Chris Riccomini, and Jon Natkins for providing feedback on this article.


Related Articles