The world’s leading publication for data science, AI, and ML professionals.

Managing Data as a Data Engineer: Part 2 – Understanding Data Changes

Understand how data changes in a fast growing company makes working with data challenging

In the last article, we looked at how users view Data and the challenges they face while using data. If you’ve missed it, you can find it here at Part 1.

As data engineers, we work with all kinds of data assets ranging from data storage systems, data pipelines and data queries. If data is static, understanding and managing data would be easy. We would come out with a solution, and that solution would work forever. But that is often not the case. Working with data is challenging, because data is always changing and growing. With that, the solutions and the ways we understand and manage data need to change too. Today, let’s look at how data changes in a growing company.

Company Growth

Company growth is one of the direct causes of data changes. When the company is small, the company probably only has a couple of products. Each of these products generate some user application data, which is stored in the production database. For analytics purposes, this data is piped into a data warehouse. Because the use case of data is small, the data in the data warehouse would probably be grouped into more general groups, such as general_analytics. This data model would work for awhile as it serves the purposes of the company products and the users.

As the company grows, there would probably be addition of new products and new features, which also results in new data sets being generated in the production database. This data will also be piped over to the data warehouse. But soon, we find that too much data is being grouped under general_analytics, and it made understanding and searching for data harder. So we start splitting data into more product-specific groups.

As the company grows further, the data also grows more diverse. Some products grow larger, which results in a larger data set being stored. As the analytics team grows, they also find more use cases with the data, sometimes also deriving new data sets by combining the raw data sets. These data insights further drive the product growth within the company, improving the product and also allowing more cross-product synergies.

This cycle continues and further grows the data. As the company grows, the number of products grow, the teams working with the product also grow, and this continues to create more enriched and diverse data sets.

User Growth

Another reason that directly impacts data growth is the growth of user base. As the user base of a product grows, the volume of the data being generated also grows. This makes sense because each user activity on the application generates some data to be fed back to the product backend system. As there are more users using the product, there will be more application data generated and being fed back to the system. Hence, the growth of the user base is almost linear with the growth rate of the data volume generated by the product.

Data Growth

Even if the user base and the product stays the same, data by itself, grows overtime. This is because we keep the data generated from the application even after some time. Historical data is usually used for analytics to understand trends and user behaviour over time. Also it is usually necessary to be able to keep a track record of the user past activities in the application for tracking and auditing purposes. It is rare to clear the data on a daily basis or ‘refresh’ the database every day. Hence, data volume tend to grow over time as the company ages.

Product Changes

When changes are being introduced in the product, sometimes it also cause changes in the data being stored. Sometimes, it is an additional field being added, like a flag to check whether a user is active. Sometimes, it is a new state being introduced, like a PENDING state in addition to a previous SUCCESS and FAILURE state. Sometimes the changes are much more intrinsic, like the way the loan interest is being calculated. Like using a new formula, or computing the numbers on a different condition. These kind of changes are harder to tracked, and have to be communicated across teams for users to understand and use correctly.

Sometimes unintended product changes would break downstream data applications. Inconsistent data types e.g. changing from a string type to an integer, or a different timestamp format are some of the common data errors. These kind of data changes are likely to introduce bugs in the application. Changes in data types are best to be discussed and standardised across teams for data to flow smoothly across applications.

Team Growth

As data grows, the Data Engineering team and their function grows with it. When the company’s data set is small, usually only one or two data engineers are required. The work of these data engineers revolve mostly around building ETL scripts and managing the data warehouse.

Roles of a small data engineering team
Roles of a small data engineering team

As data becomes larger, the data infrastructure grows, and the data engineers pick up different roles to serve the needs of the data users. For example, we took up the roles of data custodians and analytics engineers to support the data users more effectively. We introduce Engineering practices such as git and storing query as code to enable version control. We also help build data documentations to help users find and understand data easily. As we work with many data storage systems, we also act as the data consultants to different teams to help optimise queries and high data workloads.

We used to only use one server for running ETL scripts, and one server for scheduling. Now, as we maintain hundreds of data pipelines and manage terabytes of data, we have also switched to more scalable data systems. Our set of data infrastructure now consists of components such as Airflow with Celery Worker cluster, Spark cluster, a bigger Redshift data warehouse, and various types of data tools and storage systems such as S3, dbt, etc.

Roles of a modern data engineering team
Roles of a modern data engineering team

As we now have a more diverse set of data tools to work with, each of us began to specialise in order to work more effectively. Some of the data engineers focus on serving analytics users on transformed data, some manage the data infrastructure and data pipelines, while others research on better and more effective data tools. Instead of building single end-to-end data pipelines, we changed the way we work, and create reusable modules that can help build each part of the data pipelines better.

Conclusion

In this article, we looked at some of the factors that cause changes in data, such as company growth, user base growth, data growth and product changes. As data grow, the data engineering team also evolves to serve the changing needs of data. Now that we understand how data changes, next let’s look at how to manage data in Part 3. Stay tuned!

Read Part 1: Understanding Users


Related Articles