
I have previously written about the need to redefine the current data engineering discipline. I looked at it primarily from an organizational perspective and described what a data engineer should and should not take responsibility for.
The main argument was that business logic should be the concern of application engineers (developers) while all about data should be the data engineers’ concern. I advocated a redefinition of Data Engineering as "all about the movement, manipulation, and management of data".

Now, as a matter of fact, the representation of application engineers’ created logic actually also results in data. Depending on which angle we look at this from, it means that we either have a technical gap or too much overlap at the intersection of data and logic.
So let’s roll up our sleeves and commonly take on the responsibility for maintaining the dependency between logic and data.
What exactly is data, information and the logic in between?
Let’s go through some basic definitions to better understand that dependency and how we can preserve it.
- Data is the digitalized representation of information.
- Information is data that has been processed and contextualized to provide meaning.
- Logic is inherently conceptual, representing reasoning processes of various kinds, such as decision-making, answering, and problem-solving.
- Applications are machine-executable, digital representations of human-defined logic using programming languages.
- Programming languages are formal representation systems designed to express human logic in a way that computers can understand and execute as applications.
- Machine Learning (ML) is the process of deriving information and logic from data through logic (sophisticated algorithms). The resulting logic can be saved in models.
- Models are generated representations of logic derived from ML. Models can be used in applications to make intelligent predictions or decisions based on previously unseen data input. In this sense, models are software modules for logic that can’t be easily expressed by humans using programming languages.
Finally, we can conclude that logic applied to source data leads to information or other (machine-generated) logic. The logic itself can also be encoded or represented as data – quite similar to how information is digitalized.
The representation can be in form of programming languages, compiled applications or executable images (like docker), generated models from ML (like ONNX) and other intermediate representations such as Java bytecode for JVM, LLVM Intermediate Representation (IR), or .NET Common Intermediate Language (CIL).
If we really work hard to maintain the relation between source data and the applied logic, we can re-create derived information at any time by re-executing that logic.
Now what does this buy us?
Business context is key to derive insight from data
Data without any business context (or metadata) is by and large worthless. The less you know about the schema and the logic that produced the data, the more difficult is it to derive information from it.
Regrettably, we often regard metadata as secondary. Although the required information is usually available in the source applications, it’s rarely stored together with the related data. And this despite the fact we know that even with the help of AI it’s extremely challenging and expensive to reconstruct the business context from data alone.
Why is context lost?
So why do we throw away context, when we later have to reconstruct it at much higher costs?
Remember, I’m not only talking about the data schema, which is generally considered important. It’s about the complete business context in which the information was created. This includes everything needed to re-create the information from the available sources (source data or the source application itself, the schema, and the logic in digitalized form) and information that helps to understand the meaning and background (descriptions, relations, time of creation, data owner, etc.).
The strategy to keep the ability to reconstruct derived data from logic is similar to the core principles of functional programming (FP) or data-oriented programming (DOP). These principles advice us to separate logic from data and allow us to transparently decide whether we keep the logic and the source data or else also cache the result of that logic for optimization purposes. Both ways are conceptually the same, as long as the logic (function) is idempotent and the source data is immutable.

Now, I don’t want to add arguments to the discussion for or against the use of functional programming languages. While functional programming is increasingly used today, it was noted in 2023 that functional programming languages still collectively hold less than 5% of mind share.
Perhaps this is the reason why, at the enterprise level, we are still by and large only caching the resulting data and thus losing the business context of the source applications.
This really is a lamentable practice that data and application engineering urgently need to fix.
If we base our data and application architecture on the following principles, we stand a good chance of retaining business relevance as data flows through the enterprise.
Save and version all logic applied to source data

Idempotency
Referencing functional programming principles, applications in our enterprise architecture can and should act like idempotent functions.
These functions, when called multiple times with the same input, produce exactly the same output as if they were called only once. In other words, executing such an application multiple times doesn’t change the result beyond the initial application.
However, within the application (at the micro level), the internal processing of data can vary extensively, as long as these variations do not affect the application’s external output or observable side effects (at the macro level).
Such internal processing might include local data manipulations, temporary calculations, intermediate state changes, or caching.
The internal workings of the application can even process data differently each time, as long as the final output remains consistent for the same input.
What we really need to avoid are global state changes that could make repeated calls with the same input produce different output.
Treat logic as data
The representations of our applications are currently managed by the application engineers. They store the source code and – since the emergence of DevOps – everything else needed to derive executable code in code repositories (such as GIT). This is certainly a good practice, but the relationship of the application’s logic actually applied to a specific version of data is not managed by application engineering.
We don’t currently have a good system to manage and store the dynamic relationship between application logic and data with the same rigor as we do this for data on its own.
Digital representation starts with data, not logic. The logic is encoded in the source code of a programming language, which is compiled into machine-executable applications (files with machine-executable byte codes). For the operating system, it’s only data until a special, executable file is started as a program.
An operating system can easily start any application version to process data of a specific version. However, it also has no built-in functionality to track which application version has processed which data version.
We urgently need such a system on enterprise level. It’s needed as urgently as databases were once needed to manage data.
Since the representation of the application logic is also data, I believe both engineering disciplines are called upon to take responsibility.
Actively maintain relationships between logic and data
There are two main approaches how logic and its associated data are managed in systems today. Either it’s application-centric as practiced in application engineering or it’s data-centric as practiced in data engineering.
The type of logic management mainly practiced in the enterprise today is application-centric.
Applications are installed on operating systems primarily using application packaging systems. These systems enable to pull application versions from central repositories, handling all necessary dependencies.
By default, the well-known APT (Advanced Package Tool) does not support installing multiple versions of one application at the same time. It’s designed to manage and install a single version only.
Since container technology emerged on Linux, application engineering enhanced this system to better enable the management of applications in isolated environments.
This allows us to install and manage several versions of the same application side by side.
In a Kubernetes cluster, for instance, the executable docker images are managed in an image database called registry. The cluster dynamically installs and runs any application (a micro-service if you like) of a specific version requested in an isolated pod. Data is then read and written using persistent volume claims (PVC) from and to a database or data system.

While we do see advancements in managing the concurrent execution of several application versions, the dynamic relation of data and applied logic is still neglected. There is no standard way of managing this relationship over time.
Apache Spark, as a typical data-centric system, treats logic as functions that are tightly coupled to its source data. The core abstraction of a Resilient Distributed Dataset (RDD) defines such a data object as an abstract class with pre-defined low-level functions (map, reduce/aggregate, filter, join, etc.) that can sequentially be applied to the source data.
The chain of functions applied to the source data are tracked as a directed acyclic graph (DAG). An application in Spark is therefore an instantiated chain of functions applied to source data. Hence, the relationship of data and logic is properly managed by Spark in the RDD.
However, directly passing RDDs between applications is not possible due to the nature of RDDs and Spark’s architecture. An RDD tracks the lineage of logic applied to the source data, but it’s ephemeral and local to the application and can’t be transferred to another Spark application. Whenever you persist the data from an RDD to exchange it with other applications, the context of applied logic is again stripped away.
Unfortunately both engineering disciplines cook their own soups. On one side we have applications managed in file systems, code repositories, and image registries maintained by application engineers. And on the other side we have data managed in databases or data platforms allowing application logic to be applied but maintained by data engineers.
Unfortunately no single discipline invented a good common system to manage the combination of data and applied logic. This relation is largely lost as soon as the logic has been applied and the resulting data needs to be persisted.
I can already hear you screaming that we have a principle to handle this. And yes, we have object-oriented programming (OOP), which has taught us to bundle logic and data into objects. This is true, but unfortunately it’s also true that OOP failed to deliver completely.
A good solution for the persistence and exchange of objects between applications running in completely different environments was not provided here either. Object-oriented database management systems (OODBMS) have never gained acceptance due to this restriction.
I think data and application engineering has to agree on a way to maintain the unit of data and applied logic as an object, but allow both parts to evolve independently.
Just imagine RDDs as a persistable abstraction that tracks the lineage of arbitrarily complex logic and can be exchanged between applications across system boundaries.
I described such an object as the abstraction ‘data as a product using a pure data structure’ in my article "Deliver Your Data as a Product, But Not as an Application".
Note, that this concept is different to completely event-based data processing. Event-based processing systems are architectures where all participating applications only communicate and process data in the form of events. An event is a record of a significant change or action that has occurred within a system and is comparable to data atoms described in the next chapter.
These systems are typically designed to consistently handle real-time data flows by reacting to events as they happen. However, processing at the enterprise level typically requires many more different ways to transform and manage data. Especially legacy applications may use completely different processing styles and can’t directly participate in event-based processing.
But as we’ve seen, applications can locally act in very different styles as long as they stay idempotent at the macro level.
If we adhere to the principles described, we can integrate applications of any kind at the macro (enterprise) level and prevent the loss of business context. We do not have to force applications to only process data atoms (events) in near real-time. Applications can manage their internal data (data on the inside) as needed and completely independent of data to be exchanged (data on the outside) with other applications.

Create source data in atomic form and keep it immutable
Now, if we are able to seamlessly track and manage the lineage of data through all applications in the enterprise, we need to have a special look at source data.
Source data is special because it’s original information that, apart from the initial encoding into data, has not yet been further transformed by application logic.
This is new information that cannot be obtained by applying logic to existing data. Rather, it must first be created, measured, observed or otherwise recorded by the company and encoded to data.
If we save the original information created by source applications in immutable and atomic form, we enable the data to be stored in the most compact and lossless way to be subsequently usable in the most flexible way.
Immutability
Immutability forces the versioning of any source data updates instead of directly overwriting the data. This enables us to fully preserve all data that has ever been used for application transformation logic.
Data immutability refers to the concept that data, once created, cannot be altered later.
Does that mean that we can’t change anything, once created?
No, this would be completely impractical. Instead of modifying existing data structures, new ones are created that can easily be versioned.
But isn’t most of the information in the enterprise derived from original information, instead of being created as new?
Yes, and as discussed, this derivation can best be tracked and managed by the chain of application versions applied to immutable source data.
Besides this core benefit of immutable data, it offers other benefits as well:
• Predictability
Since data doesn’t change, applications that operate on immutable data are easier to understand and its effect can be better predicted.
• Concurrency
Immutable data structures are inherently thread-safe. This enables concurrent processing without the complexities of managing shared mutable state.
• Debugging
With immutable data, the state of the system at any point in time is fixed. This greatly simplifies the debugging process.
But let me assure you once again: We do not have to convert all our databases and data storage to immutability. It’s perfectly fine for an application to use a conventional relational database system to manage its local state, for example.
It’s the publicly shared original information at the macro (enterprise) level which needs to stay immutable.
Atomicity
Storing data in atomic form is an optimal model for source data because it captures every detail of what has happened or become known to the organization over time.
As described in my article on taking a fresh view on data modeling, any other data model can be derived from atomic data by applying appropriate transformation logic. In concept descriptions of the Data Mesh, data as a product is often classified as source-aligned and consumer-aligned. This is an overly coarse classification of the many possible intermediate data models that can be derived from source data in atomic form.
Because source data can’t be re-created with saved logic, it’s really important to durably save that data. So better setup a proper backup process for it.
If we decide to persist (or cache) specific derived data, we are able to use this as a specialized physical data model to optimize further logic based on that data. Any derived data model can in this setup be treated as a long-term cache for the logic applied.
Check my article on Modern Enterprise Data Modeling for more details on how to encode complex information as time-ordered data atoms and organize data governance at enterprise level. The minimal schema applied to encode the information enables extremely flexible use, comparable to completely unstructured data. However, it allows the data to be processed much more efficient compared to its unstructured variant.
This maximum flexibility is especially important for source data, where we do not yet know how it will be further transformed and used in the enterprise.
By adhering to the principles described, we can integrate applications of any kind at the macro or enterprise level and break the loss of business context.
If both engineering disciplines agree upon a common system that acts at the intersection of data and logic, we could better maintain business meaning throughout the enterprise.
- Application engineering provides source data in atomic form when consumers and their individual requirements are not yet known.
- Data and application engineering agree on the management of data and logic relationship by a common system.
- Data engineering doesn’t implement any business logic, but leaves this to application engineers.
- Data engineering abstracts away the low-level differences between data streaming and batch processing as well as eventual and immediate consistency for data.
This modern way of managing data in the enterprise is the backbone of what I call universal data supply.