The world’s leading publication for data science, AI, and ML professionals.

Managing Data as a Data Engineer – Part 3: Key Principles & Lessons Learnt

In the last two articles, we have looked at how users view data and how data changes over time. In this part of the article, I am going to…

In the last two articles, we have looked at how users view data and how data changes over time. In Part 3 of Managing Data As A Data Engineer series, I am going to share some of the key principles and lessons learnt while managing data as a data engineer.

Don’t Be Blind

In order to manage the ever changing scope of Data, one of the first things to do is to not work in the dark. As data changes over time due to various reasons, these changes may break the system in different ways. It is important to be able to monitor the data flow in the system and make sure that they are working as intended.

To not be blind, first, we define a set of metrics to measure from a data system. Some of the example metrics that can be measured are the warehouse health metrics such as CPU load, memory load, number of connections to the warehouse, duration of query queue times, and storage capacity. Besides, other metrics such as the ‘freshness’ of the data, the number of NULL values in a column and number of duplicated rows are very useful indicators of whether a data system is healthy.

Using the metrics we identified, we can then set baselines to determine what we define as ‘healthy’. Monitoring can then be operationalised by building alert systems that automatically notify stakeholders when the unexpected happens. This allows immediate attention and remediation actions to be taken quickly.

Channeling the alerts back to the right stakeholders is also important, so that the communication can be as frictionless as possible. For example, the product manager, data scientist and data engineer of the same product are alerted together when a data pipeline of the product breaks. The stakeholder lists and the alerts should be reviewed from time to time, so that the alerts are always reaching the right people and not ignored.

With a robust monitoring system along with an effective feedback mechanism to the respective stakeholders, we will be able to see improvements in data availability and reliability. Data engineers will be able to direct their effort more accurately, as less time is required in troubleshooting. This will also help data engineers in capacity planning as we can better anticipate workload changes.

Apply The 80/20 Rule

As the company grows and as the data grows exponentially with it, we quickly find that we do not have enough resources to spend equal amounts of effort in all of the data pipelines. The 80/20 Rule comes in handy to direct our effort and optimise the data system more effectively.

The Pareto Principle states that, 20% of your activities will account for 80% of your results. In the world of data, we found that this principle holds true. Some of the examples that we observed are:

  • 20% of the tables take up 80% of the storage space in the data warehouse
  • 20% of the tables are being used to power 80% of the critical business dashboards
  • 80% of the bottlenecks in the data warehouse are often caused by 20% of the queries/workloads

This principle is especially useful in identifying bottlenecks in the warehouse, such as freeing up storage space, and improving query performance. Before doing any kind of optimisation work, we should always try to assess the magnitude of the impact. It is often not the biggest nor the most complex effort, but the most accurate, that will give the greatest impact in the overall system. Any kind of optimisation done anywhere else besides the bottleneck is a waste of time.

Keep It Simple

As guardians of the data warehouse, it is part of our responsibilities to organise data in the warehouse, as well as to design data permissions and user roles.

On a day-to-day basis, we process a number of data creation, data migration and data access requests as business requirements change or as users join and leave the company. As the company and its data evolve, even a simple task of granting data access can become unnecessarily complicated if the data is not well organised. This is because business users could change roles and switch between different projects and data use cases may change over time as the company product matures.

For simple tasks such as data creation and granting data access, standardisation can go a long way. We do not need ten different ways of creating data assets and granting access. We just need one way that is easy to understand, and can be followed by the whole team managing the data warehouse. Having a standardised process around data creation helps to reduce the steps required for data engineers to perform the tasks. It also reduces cognitive workload of the data engineers when creating new data assets and access.

After standardizing, we can create a process around data creation and data access, and subsequently create tools to automate them. This helps to reduce human errors, and also enable audit trail.

Remember That Data Is Heavy

Once a user starts consuming the data, it will be hard to make changes. As more users use the data and as more applications or dashboards consume the data, there will be more dependencies. When we apply a change to the data, be it just a simple change of renaming the column, or changing the data type, it risks breaking all the dependencies of the data. This is why data migration is always so slow and painful.

It is ideal to design and create the data model right from the start, before it is used by the consumers. If changes are required after the data has been pushed to production, do it as soon as possible before more dependencies are built on the data. For long existing data assets, it is best to avoid introducing changes to the data, unless the risk of not doing it is high and it is absolutely necessary.

More often than not, data migration is not avoidable. Product evolves so quickly that it is inevitable that there’ll be new requirements and even new tools are required to support the change. In cases as such, continuous communication with stakeholders is important when making changes to the data or as old data sets get deprecated. Plan for your migrations and make the transitions as smooth as possible for your stakeholders.

Build A Sustainable Knowledge Base

It is found that data users actually spend more than 50% of their time just discovering and understanding data to make sure that they are using the right data. Documenting the data as it is being created goes a long way, as this helps future users trust and use the data.

The keyword here is ‘sustainable’. While it is important to create documentation for data, keeping the documents alive and up-to-date is as important. More often than not, data users create their own data dictionaries using Google Sheets or personal notes to understand data as they work with data. However, these forms of documentation are hard to share with the rest of the company and difficult to maintain.

There are some popular open-sourced tools that can be used to create data dictionaries. Tools such as dbt allow users to create data dictionaries using .yml files and also allow users to generate a web static page to view the data dictionaries in a pretty website interface. Dbt repositories can also be stored in git for version control, and allow multiple data users to collaborate easily when creating data models.

In recent years, there is an increased number of open-sourced projects offering metadata engines and data discovery solutions, such as Amundsen and DataHub. These tools aim to help data users document data dictionaries as well as other metadata. Users can also easily discover new and existing data sets with the tool and find all the relevant information in one place, thus improving their productivity when using data.

Expect Change

Especially in growing startup companies, exponential growth in data is expected and we have to anticipate change. It is always good to have the question in mind, what if X changes in the future?

For important moving parts in a data system, we should centralise and track the changes with a version control tool. Tools such as Terraform help to store infrastructure configurations as code and allows us to store them in a git repository so we can track any changes made to our infrastructure resources. A dbt repository is another example of a tool used to manage data changes. It helps us to centralize our transformed data assets, track changes and create dbt data tests to ensure that the data behaves as expected.

To manage changes caused by product changes, it is better to organise data to be aligned with how the business domains or products are structured in the company. Business users and roles change easily over time, whereas while the products evolve, the data generated can always be tied back to the business domain. As the company and application evolves, how the data is organised needs to follow closely.


Communication Is Key

Last but not least, even though it is a cliche, communication is key. At the end of the day, data can only become useful information if the end users and applications consume them correctly. Think of data as a means to communicate to the business users in the company, as well as to the end users using your company products.

Business requirements are always changing, and the changes will be reflected in the data generated. Context is important and its implications are often not obvious by just looking at the data generated. This is why documentation is essential to communicate context.

These are some examples of the available tools that can help to organise data, automate process, as well as communicate changes to stakeholders effectively.

  • Version control: git
  • Organize data models: dbt, holistics
  • Data documentation: Amundsen, dbt, DataHub
  • Alerting system: Slack, PagerDuty, Email, SNS
  • Monitoring: NewRelic, DataDog, CloudWatch, any BI tool
  • CI/CD: Codefresh, GitLab, Jenkins

Communication is key either you are a software engineer generating application data, a data engineer working with the data, or you are an end user using the data. It is important for the different counterparts to communicate to be able to understand, use and feedback the data correctly to create the best results. This will enable the company to build first class products as well as to generate accurate business insights for critical decision-making. It is up to all of us, not just the data engineers, to make sure that the data pipelines are well-oiled and running smoothly.

Conclusion

Well, this is the final piece of the 3-part series of Managing Data as a Data Engineer. If you have missed the previous articles, here are the links to all three:

I hope that you will find them useful as you embark on your data journey. Thanks for reading!


Related Articles