The world’s leading publication for data science, AI, and ML professionals.

5 Common Data Governance Pain Points for Analysts & Data Scientists

Understanding the guardrails that support innovation

Image: Headway (Unsplash)
Image: Headway (Unsplash)

Are you an analyst or data scientist at a large organisation?

Raise your hand if you’ve ever come across these head-scratchers:

  • Finding data felt like going on a Sherlock expedition.
  • Understanding data lineage was impossibly frustrating.
  • Accessing data became a showdown with the red tape monster.

Here’s a common quip I hear from citizen and professional analysts alike:

"Those data governance guys sure know how to make life interesting…"

It’s time to cut them some slack.

Drawing from my experience as an engineer and data scientist at one of Australia’s banking giants for half a decade, I’ve had the privilege of straddling both sides of this heated fence: being a hungry consumer of data while simultaneously standing as a gatekeeper for others.

Update: I now chat about analytics and data on YouTube.

In this article, I’ll do a three-part dive into…

  1. Fundamentals: How data flows through organisations. It’s messy!
  2. Understanding common pain points encountered by data users.
  3. Enlightenment: Appreciating how guardrails support innovation.

The third point is really important.

Organisations worldwide are scrambling to become data-driven firms. There is a constant tension between fostering innovation yet having appropriate controls in place to keep a company’s customers, staff and reputation safe.

As new data use cases arrive – and that’s all the timeData Governance structures try to evolve in tandem. And it’s typically a struggle, because unbridled innovation has no natural speed limit.

An all-you-can-eat data buffet sounds good until your firm gets lobbed a multimillion dollar penalty from regulators for having customer data leaked onto the dark web.

Whoops, should’ve had those controls.

Circle of Life…for Data!

Data flows like a downstream river through an organisation.

From capture to consumption, the evolving data needs to be managed.

The data lifecycle. Image by author
The data lifecycle. Image by author

Each stage of this data lifecycle has its own…

  • Stakeholders;
  • Unique risks;
  • Business and technical considerations;
  • Ethical conundrums;
  • Regulatory requirements…

…all of which needs to be carefully governed.

Let’s briefly review each stage. I’ll draw some examples from my industry of banking and finance.

1. Capture

The first stage describes the birth of data within the firm.

The data could be created in an operational source system, such as a new customer name or address. It could be a a Customer ID or Staff ID saved into the computer by our front line bankers at a branch. Or perhaps a calculated measure, like total profits or risk-weighted assets, that musters up data from a bunch of other dimension tables. It could even be external data, such as a central bank’s new cash rate that would have ripple affects on our savings and mortgage products.

During capture, it’s crucial to define what we’re capturing and define expectations for data quality.

If a number represents a currency, we need to know whether it’s the US Dollar, Great British Pound, Chinese Yuan, and so on. Is it OK to have empty fields? Should it always have a certain number of digits, like post codes (zip codes)?

This then determines the controls we need to have in place to ensure that future data meets expectations. The purpose of data controls is to mitigate data risk.

Finally, privacy and compliance. If the data relates to our customers and people – data that’s potentially sensitive and could be used to identify them – do we even have the right consent to capture it?

Getting this wrong can destroy businesses down the line when a major hacks occur.

2. Process

Now we’re doing something with the data.

It could be moving the data from system to system without changing it, such as dumping a copy into the enterprise data lake for data scientists to consume later. (But leaders need to ensure the data doesn’t get stale!)

Or applying business rules to filter, aggregate, or otherwise transform the data – often integrating it with other data sources— to yield a refined output dataset that’ll sit in an enterprise data warehouse, all set for business consumption.

30,000 feet view of the enterprise data landscape. Source: Z. Dehghani at MartinFowler.com with amendments by author
30,000 feet view of the enterprise data landscape. Source: Z. Dehghani at MartinFowler.com with amendments by author

In this stage, it’s important to consider the following:

  • Data quality: No errors are introduced during processing;
  • Traceability: ** Track data lineag**e so that changes over time can be clearly understood;
  • Efficiency: Design transformations strategically to reduce redundant ETL pipelines, minimising technical debt;
  • Accountability: Assign a designated data owner within the firm for the evolving data. This is crucial, as any gap in accountability along the data lineage can compromise data quality and lead to preventable risks.

Guess what? Large companies struggle with all of these.

Errors happen, and figuring out where can be a hassle – particularly when the right processes and tech aren’t in place to effectively track data lineage.

In addition, most organisations have suffered an unhealthy habit of letting a jumble of data pipelines build up, as individual teams work in silos to whip up a fresh ETL pipeline for each new project. (This is where enterprise data products come in.)

Finally, both assigning data ownership and having data users find them can be a perennial struggle – more on that later.

In recent years, firms like Microsoft have introduced all-in-one analytics platforms. These cloud solutions offer a seamless and unified experience for data ingestion, processing, analytics, and management, ultimately making working with enterprise data easier.

3. Retain

Here we’re looking at data storage. (But also backup and recovery!)

These might sound boring, but the implications of getting it wrong is disastrous. Some key considerations:

  • Availability. Saving the data is the easy part. Retrieving it, possibly in bulk or real-time – at scale – is difficult. Additionally, is the data seamlessly discoverable and accessible for all knowledge workers? Data on its own has little value; it’s the information and insights derived from data through human input and analytics that truly unlocks value.
  • Architecture. Do we store the data as tables in a schema-on-write data warehouse that’s great for the SQL junkies? How should the data be modelled? Or do we house the data as unstructured flat files in a schema-on-read data lake that offers more flexibility at the cost of performance for smaller datasets? How should we organise the partitions? Is real-time processing and analytics possible? Do we have adequate backup and recovery in case the platform goes down?
  • Privacy & Security. Is our data encrypted and securely stored against unauthorised access and hacks? Especially our protected and sensitive data that could be used to identify our customers and people. Do we have the right role-based accesses (RBAC) to ensure that only the right people have the right accesses? In regulated sectors, breaches in these areas frequently lead to substantial multimillion-dollar fines, plummeting shares and a significant blow to company reputation. It’s no joke – I know from experience.

4. Publish

Ah, the exciting part.

This is where we turn data into insights and publish them as information or reports for internal and external stakeholders to consume.

Data quality is crucial here because Garbage In = Garbage Out (GIGO).

You are what you eat, and the same goes for your reports and ML models.

Bad data leads to unreliable insights, which represent a major source of data risk in any organisation.

Since it’s nigh impossible to ensure good data quality for the huge mountain of data every company is saddled with, large companies typically employ a directed strategy of focusing their data quality checks on what are called Critical Data Elements (CDEs).

A company’s CDEs form the vital data they need to meet their customer, investor, and regulatory obligations, and they require the highest level of governance and scrutiny.

At the moment, my bank is tracking 1,500 CDEs that flow across 1,700+ systems. After a previous Australian government crackdown across the whole banking industry, we’ve set up an ‘operating system’ comprising multiple platforms that monitor these CDEs’ data quality 24/7 and automatically create incidents when things start slipping. It’s serious business.

What do I mean by data quality?

Here are some key data quality dimensions, in decreasing order of severity:

  • Completeness: Incomplete data implies that we might be missing vital information to keep the business running and meet compliance requirements. That’s how you go out of business.
  • Validity: While we have the records, they may not be valid – meaning they don’t align with the defined schema we established in Step 1. This includes real-world limitations (like a negative height) and table-specific rules (such as unique keys).
  • Accuracy: Data can be present and valid, yet flat out wrong. Imagine your bank accidentally depositing $10 million into your account by mistake. (Wouldn’t that be a pleasant surprise?)
  • Consistency: If you’re dealing with data that’s accurate and valid, but represented in multiple ways, you’re looking at consistency issues. This is a source of technical debt that prevents the organisation from wielding a single source of truth. A classic example is an address being written differently across various source systems. This creates a domino effect downstream, as the same customer might appear multiple times under the same (but differently represented) address. Damn.

The other crucial consideration is compliance and the ethical use of data.

Does a particular data analytics project pass the pub test? Is it legal? Do we need consent to use customer data for a specific purpose?

A good example is the outcomes of a marketing analytics project that might involve sending targeted offers to customers. However, they may not have given us the necessary consent to leverage our insights for marketing purposes at all.

Big data is being leveraged to understand customer preferences better than they do. Image by author
Big data is being leveraged to understand customer preferences better than they do. Image by author

These are the sorts of things that could, and often have, landed firms in hot water with customers and regulators, resulting in major financial and reputation loss.

5. Archive

The last stage is the frequently overlooked archival or disposal of data, which is when data comes to the potential end of its life.

The general rule is that unnecessary data should be purged. However, certain data must be retained for regulatory purposes, often 7 years in the financial services industry in Australia.

The issue is that firms often struggle to ensure timely data purging, and extended retention simply amplify data risks.

Picture this: a data breach occurs, and 20 years’ worth of customer data is stolen, even though 13 years of it should have been deleted after its use-by-date. This is a common scenario, and entirely preventable with good data governance.

Governance Gatekeepers versus Analyst Aspirations

I’m going to listen to you vent.

Having discussed the data lifecycle to gain an appreciation for the various governance challenges at every stage, let’s dive into some common grievances raised by analysts and data scientists on their quest for data!


"Why can’t I find the data I’m looking for? Am I blind?!"

Nope! Totally relatable. Most organisations have struggled to build up the infrastructure needed to stay afloat amidst an ever-expanding sea of data and the pressure to compete on analytical insights.

With decades worth of technical debt looming in the form of a grand mess of ETL pipelines into data warehouses, and a colossal sea of data dumped into modern data lakes, locating data has evolved into a Herculean task for anyone – and yes, even the governance officers themselves at times.

In line with firms worldwide, my bank is relying on a data products and data democratisation strategy to tackle these productivity-killers.


"Who is the data owner for these tables? Is this supposed to be easy?"

The answer lies in three parts. It’s challenging to…

  1. Assign who should own the data. As it flows throughout the organisation and becomes integrated with other sources, who should own it? Our firm’s Data Management experts have recently introduced a "Producer, Processor, and Consumer" framework. This approach segments the entire data journey – from capture to consumption – into distinct zones, ensuring that there is someone accountable for data in every stage of its existence. The foundation of a data-driven organisation begins with end-to-end data accountability.
  2. Convince people they should own the data. That’s because owning data means owning the risk to that data being improperly used, plus the extra labour involved in helping to govern that data. That’s a lot to ask for without much in return. (Data owners often don’t get paid an extra stipend for wearing this hat.) As a result, some natural data owners are reluctant to take on the role.
  3. Maintain the list of data owners. Colleagues come and go. People switch roles. And there’s just not much of an incentive for them to offboard their data owner role. This results in an already-overburned governance team trying to keep tabs of who is the current owner of thousands of datasets across an organisation.

It’s a struggle.


"Why do governance stakeholders keep asking me to do so much paperwork to access data? Team resources are scarce and we have to spend so much effort on red tape!"

Because the goals of analysts itching to squeeze insights from data don’t naturally align with the need to remain compliant in the use of that data.

The data innovators are slamming on the accelerator, while the risk-conscious bunch has their foot on the brakes.

Take my bank, for example. Any attempt to play with sensitive customer data triggers the need for a fresh Privacy Impact Assessment to be performed. Yes, it takes time, and yes, it’s annoying, but it’s a legal requirement rooted all the way back to the watershed General Data Protection Regulation (GDPR) laws initiated in Europe and subsequently adopted worldwide.

And here’s the kicker: this PIA document serves as the frontline defence against key regulators knocking on our door with a jaw-dropping $50 million fine for each breach instance, because it demonstrates that we’ve considered the impact of our data use case before we went ahead with it.

Insurance sounds expensive…until you need it.


"Why are there so many conditions on my data discovery environment? It feels like I’m in a straitjacket. I just want a nice environment and all the data I want."

Because everything you do in your discovery environment entails risk.

The data could be leaked, unethically used, or even result in misguided insights that damages the firm.

That’s why the data governance bunch make you be very specific in your data and environment needs upfront, so they can get an understanding of the risk profile of your data discovery sandpit.

These playpens should be set up and organised according to use cases (use case segregation principle) and retired immediately after the project is completed.

It may sound harsh, but data discovery environments are notorious for…

  1. Scope creep, where project teams try to squeeze more use cases – hence more analysts and data – into the same environment. This means more people seeing more data they shouldn’t be able to see, which means data risk.
  2. Longevity creep, where teams keep extending the life of their environment, which increases the chances of their work supporting a business-as-usual (BAU) process without appropriate controls and governance. This means, yep you guessed it, more data risk.

It’s all about understanding and minimising data risk.


"Why is it so hard to get data out of a data discovery environment?"

Because what you can do with that data downstream is, quite frankly, a nightmare for the risk-conscious.

Imagine leaks, misuse, or data influencing decisions that adversely affect millions of customers, or land us into hot water with regulators.

It’s a minefield.

Popular all-in-one analytics platforms, which are heavily governed at large firms. Image by author
Popular all-in-one analytics platforms, which are heavily governed at large firms. Image by author

That’s why discovery environments are like guarded sandboxes, purely for experimentation. If you’re gunning for your insights to hit prime time – i.e. production-like processes that runs BAU – brace for some intense vetting.

Locking down the data is a blanket control that mitigates a lot of data risk.

Final Words

It seems my answers to the above pain points only highlight the disconnect between data governance controls and the iterative work style demanded by analysts and data scientists.

And it’s extra easy to treat the governance crowd as the enemy when you’re under the deadline pump.

But without these guys, the fire on the stove will eventually engulf the kitchen. Your data governance teams ultimately ensure that…

  1. Your company’s data assets are managed and organised, so you can find and access what you’re looking for as the firm’s data assets scale. (Well, that’s the goal at least.)
  2. Guardrails are in place to ensure the appropriate use of data and prevent tragic data hacks that could destroy customer lives, and everyone’s bonus at the end of the year. (Yes, it’s true.)

To reflect even further, this tension between innovation and regulation is found in every endeavour, large or small. Unbridled leaps are often followed by a period of recovery and introspection, setting the stage for the next bound.

Bear with me.

At the global macro level, we see powerhouses like China and the United States let their big tech sectors sour sky high with little oversight, before reining in their entrepreneurs a generation later. At the individual level, we know the importance of proper recovery after strenuous physical exercise, a logical pit stop for an overworked body.

It shouldn’t come as a surprise that this yin and yang dynamic exists within the enterprise data landscape at work.

With an exponential surge in data volume worldwide, increasingly diverse data use cases, and the burgeoning computing power being harnessed to extract insights, it becomes imperative for companies to take a regular breather, perform some housekeeping, and ensure the data stack is properly managed and controlled.

There is generally a symphony of two competing goals at large firms:

First, leveraging data for offensive plays that enable the firm to compete more effectively. This attack angle is typically taken by an aggressive Chief Data Officer (CDO), Chief Digital Officer (the other CDO) or the broader analytics community.

Second, managing data defensively with a focus on compliance. The goal here is to solve business performance and regulatory problems by aligning processes and systems that streamline the flow of data across the firm, from the front-office and source systems into the back-end data platforms and reporting tools.

The cream of the crop get pretty good at nailing this fine balancing act: giving their analysts and data scientists the freedom to explore and innovate, all the while making sure the organisation stays on the up-and-up in its data usage – compliant and ethical in their use of sacred customer data.

What’s your experience with data governance?

Find me on Twitter & YouTube [[here](https://youtube.com/@col_shoots)](https://youtube.com/@col_invests), here & here.

My Popular AI, ML & Data Science articles

  • AI & Machine Learning: A Fast-Paced Introduction – here
  • Machine Learning versus Mechanistic Modelling – here
  • Data Science: New Age Skills for the Modern Data Scientist – here
  • Generative AI: How Big Companies are Scrambling for Adoption – here
  • ChatGPT & GPT-4: How OpenAI Won the NLU War – here
  • GenAI Art: DALL-E, Midjourney & Stable Diffusion Explained – here
  • Beyond ChatGPT: Search for a Truly Intelligence Machine – here
  • Modern Enterprise Data Strategy Explained – here
  • From Data Warehouses & Data Lakes to Data Mesh – here
  • From Data Lakes to Data Mesh: A Guide to Latest Architecture – here
  • Azure Synapse Analytics in Action: 7 Use Cases Explained – here
  • Cloud Computing 101: Harness Cloud for Your Business – here
  • Data Warehouses & Data Modelling – a Quick Crash Course – here
  • Data Products: Building a Strong Foundation for Analytics – here
  • Data Democratisation: 5 ‘Data For All’ Strategies – here
  • Data Governance: 5 Common Pain Points for Analysts – here
  • Power of Data Storytelling – Sell Stories, Not Data – here
  • Intro to Data Analysis: The Google Method – here
  • Power BI – From Data Modelling to Stunning Reports – here
  • Regression: Predict House Prices using Python – here
  • Classification: Predict Employee Churn using Python – here
  • Python Jupyter Notebooks versus Dataiku DSS – here
  • Popular Machine Learning Performance Metrics Explained – here
  • Building GenAI on AWS – My First Experience – here
  • Math Modelling & Machine Learning for COVID-19 – here
  • Future of Work: Is Your Career Safe in Age of AI – here

Related Articles