The world’s leading publication for data science, AI, and ML professionals.

Strategies for Optimising Enterprise-Level Data Consumption

Middleware service, Data Warehousing with ETL/ELT and MASA (Mesh Apps and Services Architecture)

Photo by geralt on Pixabay
Photo by geralt on Pixabay

Middleware service, Data Warehousing with ETL/ELT and MASA (Mesh Apps and Services Architecture)

As a consultant, it is quite frequently to meet such kind of requirements from customers:

  • We need to consume data from multiple information systems, so we have to call multiple APIs from them to get all the data, which is a big pain.
  • We do have all the data in one database, but we need to query dozens of tables to join them together to get the expected results.
  • We believe that we have a pretty good Data Architecture. The DB instance contains multiple databases for different purposes, and we design everything in 3-norm. However, it is still painful when we want to do some statistics over the entire system crossing multiple databases.

Use Cases

Photo by KOBU Agency on Unsplash
Photo by KOBU Agency on Unsplash

Usually, 3 types of requirements could result in the above problems.

BI (Business Intelligence) Dashboarding

This is probably the most common use case that hits the limitation.

While the company is still in the start-up phase, it might be relatively easy to acquire any kind of data very quickly. In this phase, even Microsoft Excel is quite enough for most of the requirements. Also, it is not recommended to have complex data architecture as it cost too much for a start-up.

However, suppose that the company grows well, more and more departments might be established, as well as more business areas are extended, the data persistence layer would be much more complicated. For example, there are multiple separated databases for the financial, shipment, IT departments. If the business is B2b, there may also be multiple databases for different customers. Of course, the number of tables in each database can be huge, too.

Now, the executive wants to have BI dashboards to support decision making. It can be seen that some high-level reports might need to involve a large number of tables in several databases.

Required in Business Flow

It is also common that some business scenarios require complex SQL queries as parts of the flow, such as validation.

For example, there are multiple information systems that the company relies on. In some specific business scenarios, we need to validate customers’ requests against several information systems.

In this case, it would be keener to optimise the company-wide data architecture because the performance will have impacts on the normal business flows.

Advanced Analytics

If the company is "data-driven", or its business transactions generate a large volume of data with high velocity, it is quite common to involve advanced analytics such as machine learning techniques to improve the business efficiencies, identify potentialities, etc.

In this case, it could be a nightmare to extract data from multiple systems for training a machine learning model or feeding correct data into deployed models.

Causes and Possible Solutions

Photo by Cesar Carlevarino Aragon on Unsplash
Photo by Cesar Carlevarino Aragon on Unsplash

The major difficulties/causes are:

  • The query itself could be extremely complex, which is consist of various techniques such as joining and concatenating tables, IF-ELSE determinations and lots of conditional filtering.
  • Since there are too many tables involved, the performance cannot be guaranteed at all.
  • For some of the DBMS, it might not be easy to run a query across multiple databases.

To my experience, there are 3 typical solutions for this corporation data architecture problem.

Middleware Services

Economy: ★★★★☆ Maintainability: ★☆☆☆☆ Flexibility: ★☆☆☆☆

Photo by zibik on Unsplash
Photo by zibik on Unsplash

This solution refers to building a middleware service to communicate multiple systems and concatenating/aggregating the results before return to the data-consuming entity.

The technology to achieve this can be various, which can be Python, Java, Node.JS, C#, PHP, etc. Based on different front-end data-consuming entities, these techniques have their pros and cons.

This solution can be the one that costs the least resources, as it is strongly requirement-driven. That is, whenever you need another complex query, you will either need to add more components to the existing middleware service or even develop a new one from scratch (this could happen because of the completely different interfaces used by the DBMS).

Therefore, the maintainability and the flexibility of this solution is the worst, although sometimes it is still the best solution if it is foreseeable that the requirements are not going to be expanded in short term.

Advantages:

  • Requirement-oriented, only develop for what you need
  • Relatively low cost

Disadvantages:

  • There is no "platform" established, so the new similar requirements in the future will need almost the same amount of effort.
  • There is no standard for this solution. How to implement it is solely depending on the developer.
  • Every request for the data will trigger the query to all the systems, so it increases the burden on all the existing systems/databases.
  • Difficult to maintain.

Enterprise Data Warehouse

Economy: ★★★☆☆ Maintainability: ★★★★☆ Flexibility: ★★★☆☆

Photo by emy on Unsplash
Photo by emy on Unsplash

Data Warehousing is probably the most popular solution in the industry at the moment. The basic idea is to establish a system that can extract that from multiple information systems, and then store all the data complying with data warehousing standards (star or snowflake schema) in a data warehouse.

An enterprise data warehouse usually includes the following components:

  • ETL/ELT data pipelines which extract the data from various data sources and then transform the data to load them into the data warehouse. With the development of Big Data concepts, ELT is becoming more and more popular than ETL in the industry, as well as a series of cloud providers such as Azure and a bunch of distributed/parallel computing engines such as Apache Spark.
  • Data Warehouse is yet another standard regarding how the data is stored in the DBMS. Unlike the regular transactional databases which focus on improving the inserting/updating/deleting performance, data warehouses are report-oriented which is optimised for reading. Therefore, it is very common to extract the data from multiple data sources and pre-aggregate them before storing them into the data warehouse, so the data can be consumed by simply query a few tables rather than dozens.

Advantages:

  • This is a "Non-invasive" solution. That is, the consumption requests are satisfied by the data warehouse, so there is no extra burden on the transactional systems.
  • Regardless of the types of data sources, there must be a corresponding technology to extract the data from them in the ETL/ELT process.
  • Data can be pre-transformed. For example, different systems may use different indicators for the genders of users (M/F, Male/Female, 0/1, etc.), but we can unify them in the transformation.
  • Data can be pre-aggregated. For example, the daily/weekly/monthly report can be generated in the ETL/ELT process, so the consumer can read the actionable data very efficiently.
  • With the SCD (Slowly Changing Dimension) design, the data warehouse can capture the history of the data sources evaluation. For example, if there is a product was moved to a new category, it is completely traceable in the data warehouse, but you’ll probably lose this change in the transactional DB.

Disadvantages:

  • There will be a delay for the data to be available in the data warehouse. For example, the data will be up to date until "yesterday" every day, if the ETL/ELT process is running daily.
  • If one of the data sources changed its data structure, the ETL/ELT needs to be changed accordingly, which may result in a lot of efforts, as well as the potential data discrepancy during the change was applying.
  • Rely on the data source features. For example, if the original data source doesn’t have reliable timestamps, the ETL/ELT process cannot achieve incremental loading (only load the new/changed records since the previous load) in an elegant way.

MASA (Mesh App and Service Architecture)

Economy: ★☆☆☆☆ Maintainability: ★★★★★ Flexibility: ★★★★★

Photo by Adrian Schwarz on Unsplash
Photo by Adrian Schwarz on Unsplash

MASA is relatively a new concept that was raised by Gartner. It is summarised from the experiences of a large number of Enterprises that are data-intensive such as Uber [1].

Surprisingly, the idea of MASA is closer to the first solution – Middleware Services. However, MASA architecture is focused on establishing an entire comprehensive API Mediation Layer between the Enterprise back-end systems (not customer-facing) and the front-end systems (internal/external customer-facing). Therefore, the major difference between MASA and the MIddleware services is that the former aims to build unified, systematic and managed services for different grained back-end systems.

MASA architecture consists of 3 major layers [2]:

  • Multigrained Services, which includes all the Enterprise back-end services. The "multigrain" means that the back-end services having different scales so that they have different abilities to support external data consumption. For example, ERP/CRM systems usually contain numerous sub-components that are designed to work closely with each other internally rather than providing data feed to external. Therefore, it is important to identify the granularity of the back-end services to apply different strategies.
  • API Mediation, which is responsible for mapping the requests from outer APIs (interfaces to data consumers) to inner APIs (interfaces to back-end services). During the delegation between the two API layers, it also needs to transform the data on the fly, ensure security and monitor usage and performance.
  • Multiexperience Apps, which are also called "Fit-for-Purpose" Apps. No matter who is the end-user of the data and what is the purpose, the frond-end should be designed to fulfil their requirements, as well as providing the data in the desired presentation.

Advantages:

  • MASA is truly a "Single Source of Truth" because it doesn’t replicate data in another form as data warehouses do.
  • Maximise the flexibility and agility of all the data consumption at the enterprise level. For example, it can achieve "self-service" data consumption in the whole company.
  • Theoretically no delay in data availability.

Disadvantages:

  • Expensive
  • Expensive
  • Expensive

Summary

Photo by Helloquence on Unsplash
Photo by Helloquence on Unsplash

In this article, I have explained my opinion regarding the optimisation of enterprise data platforms. In fact, all the 3 solutions might be utilised during the development of a data-driven company. That is, when a company is still in the start-up phase, there may not be any "optimised" solution at all. As some needs are raised, the company may start to fix it using whatever easier solutions due to the limited resources. Then, as the company grew to be more mature, a data warehouse might be introduced. I believe most companies may only need this solution and stay with the data warehouse for a very long time. Finally, if the company become a Giant in its industry, MASA will become the ultimate solution.

Join Medium with my referral link – Christopher Tao

If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)

Reference:

[1] Searle et al. Top 10 Strategic Technology Trends for 2017: Mesh App and Service Architecture. Gartner Research. https://www.gartner.com/en/documents/3645328/top-10-strategic-technology-trends-for-2017-mesh-app-and

[2] B. Dayley. MASA: How to Create an Agile Application Architecture With Apps, APIs and Services. Gartner Research. https://www.gartner.com/en/documents/3980382/masa-how-to-create-an-agile-application-architecture-wit


Related Articles