Notes from Industry
When I was told to lead the DataOps initiatives at work, I didn’t know where to begin. So I started from where it was the easiest, by googling it.
The definition of DataOps is a boring one. Here goes:
DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. — Wikipedia
Okay, so anything to do with processes, policies, frameworks, to shorten the cycle of analytics, to improve data quality, to ensure data correctness, that emphasise collaboration and automation, is DataOps.
At first, it sounds overwhelming. The definition of DataOps is so vague and wide, that it seems to cover everything under the data sun. I used to spend hours discussing and debating with my manager about what is and what is not DataOps. I needed to define DataOps clearly, how else am I supposed to lay out plans and roadmaps for the DataOps team?
Thankfully, after some time researching and knocking off ideas with my manager, I came to grasp a better understanding of DataOps. What I’ll attempt to share here, is my personal understanding of the work related to DataOps. While there are specifics that are only unique to my workplace, I believe there are also general lessons that can be applied anywhere.
Part 1: Why DataOps?
In a nutshell, DataOps is a way to manage increasingly huge and complex data. If you only have a handful of data pipelines, you don’t need DataOps. An engineer can manage the pipelines easily, with or without a framework.
When you have more data, say hundreds of pipelines, created by hundreds of different sources, managing them is going to be a lot trickier. If you want to scale quickly, and efficiently, you have to have a coherent way to go about collecting, piping, and understanding data. You need some kind of framework, and that is DataOps.
Part 2: How to do DataOps?
I’m gonna try to explain this by giving you some of the common themes of DataOps-related work.
1. Find common patterns, generalise for each use case
Yes, no surprise here. Similar to coding, finding a good abstraction helps make data easier to manage and less error prone.
In a company undergoing high growth, the data volume and number of data pipelines grow exponentially. The Data Engineering team do not grow as fast or even close to the rate of data growth, so how do we handle growing data demands?
While the demands for new pipelines increase quickly, thankfully, the number of different types of data problems, is a handful. By standardisation, and recognising the common patterns of data problems, we are able to build generic frameworks to work with data. When we create a generic framework that solves for one type of data problem, we solve for all data problems of the same type. This can be applied to any problem with growing demands, such as creation of new data pipelines, new data models, managing data access, etc. For each problem, look out for the common patterns, and design a solution for it.
A good tip here is to start by listing out the use cases and examples, before setting any rules and designing solutions. Listing out the use cases can help you visualise and recognise patterns in the existing problem space. Categorize each pattern or use case by placing them in separate buckets. List out the attributes of the items in each bucket. Then design a rule that defines what goes into the same bucket. If the rules have to be complicated to determine which use case belongs to which bucket, chances are the categories of the use cases have not been clearly defined. Get back to redesigning the buckets, instead of bending the rules or creating a complicated classification rule. Rinse and repeat.
A good test of whether a use case has been clearly defined is to try to classify a new data problem into one of the buckets. If all the new data problems fit easily into the buckets defined, congratulations! You have found the common pattern, and you can now proceed to design a solution for each of them. Basically, it’s slicing and dicing the data patterns, until you reach a good enough trade off: a solution that is robust enough to fulfill any variation of known data problems and have a high coverage of all the different use cases.
The hard part of this process is to frame the problem right. Designing the solution is easier. If you don’t have a solution, chances are, someone else on the internet does. But making sure that the problem is being framed right takes more consideration and thinking. It is worth taking the time to frame the problem right. This is how you can ensure that when you design a solution for it, your efforts spent is worthwhile.
2. Document, document, document
As the company grows, it is inevitable that teams across departments grow further apart, and it becomes harder to keep up with the same level of team-cohesion. While verbal communication can only do so much, written and recorded communication last forever. Documentation is the most underrated work in the organisation. The importance of documentation only increases as the company scales, in size, and in diversity. When done well, documentation frees up time, retains knowledge, and multiplies users’ productivity. It is the most effective way for cross-department colleagues to communicate understandings. As employees of the company come and go, documentation helps to onboard users quickly, and ensures that context is not lost when they leave.
With data, this rings a truer bell. Data is often complicated to understand, as there are many dimensions to it. It is more than just the columns and rows you see in the tables. To understand data, the user needs to understand the 5W1H of the data. Where did it come from? Who created it? When was it generated? How was it generated? What was the data created for? Who uses it? The answers to these questions may come from several parties, depending on which part of the data stream that we are enquiring about.
If you don’t want to be stuck in the vicious cycle of asking and digging around for answers to simple questions about data all the time, data documentation goes a long long way. A data discovery tool such as Amundsen, is one of the best open-sourced tools available that aims to tackle data understanding and discovery problem at its core.
Documenting about the way we work, and about how tools are used in the company, is essential when scaling teams. Writing good documentations is a huge time-saver, perhaps the biggest time-saver of all. When new users join, they can go through the documentations, and learn about the guidelines and frameworks created. It streamlines the time spent in handholding and repeating ourselves to onboard every newcomer. It will also potentially save the user hours spent on figuring things out from scratch and re-inventing the wheel. As users go through the documentation, they would probably discover better way of doing things. The user could then provide feedback to the original document creator, and they could have a constructive discussion about improving the process. The documentation could be revised immediately to reflect a better and improved process, and the outcome will be benefited by the following users that read the documentation.
As the data engineers design frameworks and introduce better and newer tools to the company, it is also our responsibility to share guidelines and resources about them. Recorded workshops, and useful reading materials are also documented to allow users to discover and learn about them. This encourages knowledge sharing and self-learning within the organisation. Users can read and learn about new things beyond their roles on their own time. With better knowledge, users and engineers create better quality pipelines and data assets, following the outlined best practices. With improved data quality, the consumers of data also get better and smoother experience when using the data.
Documenting data issues is also a practice that our team has adopted. This practice helps us to articulate pain points, evaluate the impact, and allow us to create a clearer narrative of the problem. If anything, documenting helps us think more clearly. Whenever we tackle a new problem, we start by creating a one-pager. In the one-pager, the author first describes the current state by listing the pain points and impact to the users. The author then explores a few approaches to tackle the pain points, and for each approach, provides details on how much effort is required, and then weighs its priority based on the impact. A discussion is then held among the team members to go through the document. This exercise helps walk the team through the author’s thought process, and aligns everyone onto the same page of the problem. It provides a framework for teams to evaluate a problem and subsequently to design solutions in a systematic and scientific manner. It also allows the team to leverage on the inputs from a larger audience in designing the best solution. The more complex a problem, the more important that we document and circulate it with the team, before designing a solution for it.
3. Reduce friction with data
Data has lifecycle. As data gets created, it grows and changes over time. Finally the data gets deprecated when the source is deprecated. In data, changes is always expected. Therefore, the pipeline that pipes the data, as well as the understanding in data also needs to be updated as the behaviour of data changes over time. To manage data changes, we design systems to reduce friction with data.
When a change is introduced, it could either break the pipeline (data in the warehouse goes stale), or it may introduce changes in behaviour that the users are not aware of. Either way, the users will notice something funky going on in their reporting dashboards, and will come knocking on the data engineers’ doors asking questions. Data engineers will go digging for answers by knocking on the producers’ doors. Usually it takes several rounds of back and forth between producers and users, to fully understand what went wrong. Because of lack of context, data engineers often do not fully understand the problem in the data that the users are seeing, as well as the change in behaviour of the data that’s introduced by the producers. This process is often time-consuming and frustrating, for both users and data engineers. It is an interruption of their day-to-day workflow. When communication breaks down, the trust in data decreases, and the value of data decreases. How can business users trust the insights generated from data, when they can’t trust the data?
One of the attempts to alleviate this pain in communication is by designing alerting systems to notify relevant stakeholders whenever a pipeline incident occurs. Pipeline alerts are published to public channels surrounding a specific data topic where both producers and users are present. Any users within the company who use the same data are welcomed to join the channel and subscribe to the alerts. This ensures that all stakeholders are notified when a pipeline failure occurs. Bugs and data issues can be discussed immediately and shared transparently with all the stakeholders within the channel. This helps to improve communication between producers and consumers, shortens the feedback cycle, which in turn reduces data down time. Open communication between producers and consumers also helps to foster a culture where changes in data are informed to data engineers and downstream users upfront, instead of after the fact.
Other than open data channels, surfacing additional metadata of data so that it is made easily available also helps reduce friction when using data. Many times during a day, it is common that data users ask data engineers repetitive questions about the data. When is the pipeline scheduled to run? When did the pipeline last run successfully? What are the sources of the data? This information and other useful metadata can be published to a centralised platform, where users can self-service and discover for themselves. It improves data user’s productivity as answers are only one-click away, and it frees up data engineers resources from needing to answer repetitive questions.
4. Adding checkers and blockers to govern data
Now that we have designed frameworks that fit common data use cases, document resources and guidelines for communication, and enable quicker data feedback loop, the next challenge is to get data users to actually utilise the right frameworks and tools available!
As we provide capabilities to end users to perform actions in the data warehouse, we also want to govern and ensure that the users are using data safely and securely. According to Murphy’s Law, anything that can happen, will happen. Therefore, we implement checks and blockers to guard against unauthorised actions. This allows us to protect the data assets against human errors, to ensure that the frameworks are being used as intended, as well as to make sure that best practices are followed.
This is where an enforcement process helps. We design workflows where automatic checks are embedded in the process, and controls that block an action, whenever the user is not using the intended workflow.
For example, we implement checks to ensure that data assets are created according to the guidelines set. We have a centralised repository where we store and track all data transformation logic. Users can raise a pull request to create new data assets or make changes to the existing data assets. Any changes will trigger automated checks to ensure that the changes to be performed follows the guidelines and standards set. Failing the checks will block the user’s ability to create new data assets.
We’ve struggled while trying to enforce data documentation practices within the organisation. One thing that we’ve learnt the hard way is that assigning data engineers to define data dictionaries of the data generated by the respective products do not work in the long run. Even though the data engineers spend most of our time working with data and pipelines, we are not the original creators of the data. Being the middle person that collects data dictionaries on behalf of each product team and documenting them, does not scale. As soon as the data engineer is removed from the product teams to work on other tasks, the data dictionaries drift and become stale. To ensure sustainable data documentation, data dictionaries need to be crowd-sourced from their rightful owners, who are the original creators of the data. To enforce data documentation, it has to be baked as part of the requirements of the data creation process. This ensures that any data that is piped into the warehouse, is documented.
It is important that the data team is aware of the frameworks and systems in place, and plays an active role in directing users to raise requests through the right channels. For example, when a user faces a data issue, we would direct them to discuss about it in the open data channel, where all the other stakeholders are present. This reduces redundant communication and reduces miscommunication. When a user has a data access request, we would direct them to raise the request via the service desk, and we will require them to provide sufficient information for the request and get approval from the appropriate owners. For anything that is self-serviceable, the data engineers would direct users to the appropriate documentation links that would help them get off their feet quickly.
To guard critical datasets against unexpected changes, we also add data tests to monitor them. Some of the examples of data tests are data uniqueness check, nullity check, distribution check, and accepted range/values check. When a data test fails, an alert is sent to notify stakeholders. The stakeholders will have to clarify whether the change in data is expected. The data test will be updated from time to time to reflect the changes in the data behaviour.
All checks and blockers implemented should sufficiently cover all the entry points to the warehouse, and are embedded into the user’s workflow. This ensures that the data warehouse is fully guarded, and that there are no loopholes that allow users to bypass the process.
Conclusion
Dataops is really about identifying common issues within the workflow of data users, and improving them. It is about closing gaps, enhancing communication, and improving reliability and ease of use of data. Depending on the data gaps in your organisation, the DataOps tasks may vary, but chances are they surround the common themes of:
- Finding common patterns, and generalising for each use case
- Document, document, document
- Reducing friction with data
- Adding checkers and blockers to govern data
Hope this article helps to provide you an idea of what DataOps work looks like. 🙂