Building a Data Engineering Center of Excellence

Essential components for building a functioning data engineering practice relevant to the current day at scale.

Richie Bachala
Towards Data Science

--

Data Engineering Center of Excellence
Image by Author

As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice and why data engineering is becoming increasingly critical for businesses today, and how you can build your very own Data Engineering Center of Excellence!

I've had the privilege to build, manage, lead, and foster a sizeable high-performing team of data warehouse & ELT engineers for many years. With the help of my team, I have spent a considerable amount of time every year consciously planning and preparing to manage the growth of our data month-over-month and address the changing reporting and analytics needs for our 20000+ global data consumers. We built many data warehouses to store and centralize massive amounts of data generated from many OLTP sources. We've implemented Kimball methodology by creating star schemas both within our on-premise data warehouses and in the ones in the cloud.

The objective is to enable our user-base to perform fast analytics and reporting on the data; so our analysts' community and business users can make accurate data-driven decisions.

It took me about three years to transform teams (plural) of data warehouse and ETL programmers into one cohesive Data Engineering team.

I have compiled some of my learnings building a global data engineering team in this post in hopes that Data professionals and leaders of all levels of technical proficiency can benefit.

Evolution of the Data Engineer

It has never been a better time to be a data engineer. Over the last decade, we have seen a massive awakening of enterprises now recognizing their data as the company's heartbeat, making data engineering the job function that ensures accurate, current, and quality data flow to the solutions that depend on it.

Historically, the role of Data Engineers has evolved from that of data warehouse developers and the ETL/ELT developers (extract, transform and load).

The data warehouse developers are responsible for designing, building, developing, administering, and maintaining data warehouses to meet an enterprise's reporting needs. This is done primarily via extracting data from operational and transactional systems and piping it using extract transform load methodology (ETL/ ELT) to a storage layer like a data warehouse or a data lake. The data warehouse or the data lake is where data analysts, data scientists, and business users consume data. The developers also perform transformations to conform the ingested data to a data model with aggregated data for easy analysis.

A data engineer’s prime responsibility is to produce and make data securely available for multiple consumers.

Data engineers oversee the ingestion, transformation, modeling, delivery, and movement of data through every part of an organization. Data extraction happens from many different data sources & applications. Data Engineers load the data into data warehouses and data lakes, which are transformed not just for the data science & predictive analytics initiatives (as everyone likes to talk about) but primarily for data analysts. Data analysts & data scientists perform operational reporting, exploratory analytics, service-level agreement (SLA) based business intelligence reports and dashboards on the catered data. In this book, we will address all of these job functions.

The role of a data engineer is to acquire, store, and aggregate data from both cloud and on-premise, new, and existing systems, with data modeling and feasible data architecture. Without the data engineers, analysts and data scientists won't have valuable data to work with, and hence, data engineers are the first to be hired at the inception of every new data team. Based on the data and analytics tools available within an enterprise, data engineering teams' role profiles, constructs, and approaches have several options for what should be included in their responsibilities which we will discuss in this chapter.

Data Engineering team

Software is increasingly automating the historically manual and tedious tasks of data engineers. Data processing tools and technologies have evolved massively over several years and will continue to grow. For example, cloud-based data warehouses (Snowflake, for instance) have made data storage and processing affordable and fast. Data pipeline services (like Informatica IICS, Apache Airflow, Matillion, Fivetran) have turned data extraction into work that can be completed quickly and efficiently. The data engineering team should be leveraging such technologies as force multipliers, taking a consistent and cohesive approach to integration and management of enterprise data, not just relying on legacy siloed approaches to building custom data pipelines with fragile, non-performant, hard to maintain code. Continuing with the latter approach will stifle the pace of innovation within the said enterprise and force the future focus to be around managing data infrastructure issues rather than how to help generate value for your business.

The primary role of an enterprise Data Engineering team should be to transform raw data into a shape that's ready for analysis — laying the foundation for real-world analytics and data science application.

The Data Engineering team should serve as the librarian for enterprise-level data with the responsibility to curate the organization's data and act as a resource for those who want to make use of it, such as Reporting & Analytics teams, Data Science teams, and other groups that are doing more self-service or business group driven analytics leveraging the enterprise data platform. This team should serve as the steward of organizational knowledge, managing and refining the catalog so that analysis can be done more effectively. Let's look at the essential responsibilities of a well-functioning Data Engineering team.

Responsibilities of a Data Engineering Team

The Data Engineering team should provide a shared capability within the enterprise that cuts across to support both the Reporting/Analytics and Data Science capabilities to provide access to clean, transformed, formatted, scalable, and secure data ready for analysis. The Data Engineering teams' core responsibilities should include:

· Build, manage, and optimize the core data platform infrastructure

· Build and maintain custom and off-the-shelf data integrations and ingestion pipelines from a variety of structured and unstructured sources

· Manage overall data pipeline orchestration

· Manage transformation of data either before or after load of raw data through both technical processes and business logic

· Support analytics teams with design and performance optimizations of data warehouses

Data is an Enterprise Asset.

Data as an Asset should be shared and protected.

Data should be valued as an Enterprise asset, leveraged across all Business Units to enhance the company's value to its respective customer base by accelerating decision making, and improving competitive advantage with the help of data. Good data stewardship, legal and regulatory requirements dictate that we protect the data owned from unauthorized access and disclosure.

In other words, managing Security is a crucial responsibility.

Why Create a Centralized Data Engineering Team?

Treating Data Engineering as a standard and core capability that underpins both the Analytics and Data Science capabilities will help an enterprise evolve how to approach Data and Analytics. The enterprise needs to stop vertically treating data based on the technology stack involved as we tend to see often and move to more of a horizontal approach of managing a data fabric or mesh layer that cuts across the organization and can connect to various technologies as needed drive analytic initiatives. This is a new way of thinking and working, but it can drive efficiency as the various data organizations look to scale. Additionally — there is value in creating a dedicated structure and career path for Data Engineering resources. Data engineering skill sets are in high demand in the market; therefore, hiring outside the company can be costly. Companies must enable programmers, database administrators, and software developers with a career path to gain the needed experience with the above-defined skillsets by working across technologies. Usually, forming a data engineering center of excellence or a capability center would be the first step for making such progression possible.

Challenges for creating a centralized Data Engineering Team

The centralization of the Data Engineering team as a service approach is different from how Reporting & Analytics and Data Science teams operate. It does, in principle, mean giving up some level of control of resources and establishing new processes for how these teams will collaborate and work together to deliver initiatives.

The Data Engineering team will need to demonstrate that it can effectively support the needs of both Reporting & Analytics and Data Science teams, no matter how large these teams are. Data Engineering teams must effectively prioritize workloads while ensuring they can bring the right skillsets and experience to assigned projects.

Data engineering is essential because it serves as the backbone of data-driven companies. It enables analysts to work with clean and well-organized data, necessary for deriving insights and making sound decisions. To build a functioning data engineering practice, you need the following critical components:

Data Engineering Center of Excellence

The Data Engineering team should be a core capability within the enterprise, but it should effectively serve as a support function involved in almost everything data-related. It should interact with the Reporting and Analytics and Data Science teams in a collaborative support role to make the entire team successful.

The Data Engineering team doesn't create direct business value — but the value should come in making the Reporting and Analytics, and Data Science teams more productive and efficient to ensure delivery of maximum value to business stakeholders through Data & Analytics initiatives. To make that possible, the six key responsibilities within the data engineering capability center would be as follow -

The 6 pillar core-responsibilities of a Data Engineering team.
Data Engineering Center of Excellence — Image by Author.

Let's review the 6 pillars of responsibilities

1. Determine Central Data Location for Collation and Wrangling

Understanding and having a strategy for a Data Lake.(a centralized data repository or data warehouse for the mass consumption of data for analysis). Defining requisite data tables and where they will be joined in the context of data engineering and subsequently converting raw data into digestible and valuable formats.

2. Data Ingestion and Transformation

Moving data from one or more sources to a new destination (your data lake or cloud data warehouse) where it can be stored and further analyzed and then converting data from the format of the source system to that of the destination

3. ETL/ELT Operations

Extracting, transforming, and loading data from one or more sources into a destination system to represent the data in a new context or style.

4. DATA MODELING

Data modeling is an essential function of a data engineering team, granted not all data engineers excel with this capability. Formalizing relationships between data objects and business rules into a conceptual representation through understanding information system workflows, modeling required queries, designing tables, determining primary keys, and effectively utilizing data to create informed output.

I've seen engineers in interviews mess up more with this than coding in technical discussions. It's essential to understand the differences between Dimensions, Facts, Aggregate tables.

5. Security and Access

Ensuring that sensitive data is protected and implementing proper authentication and authorization to reduce the risk of a data breach

6. Architecture and Administration

Defining the models, policies, and standards that administer what data is collected, where and how it is stored, and how it such data is integrated into various analytical systems.

The six pillars of responsibilities for data engineering capabilities center on the ability to determine a central data location for collation and wrangling, ingest and transform data, execute ETL/ELT operations, model data, secure access and administer an architecture. While all companies have their own specific needs with regards to these functions, it is important to ensure that your team has the necessary skillset in order to build a foundation for big data success.

Besides the Data Engineering following are the other capability centers that need to be considered within an enterprise:

Analytics Capability Center

The analytics capability center enables consistent, effective, and efficient BI, analytics, and advanced analytics capabilities across the company. Assist business functions in triaging, prioritizing, and achieving their objectives and goals through reporting, analytics, and dashboard solutions, while providing operational reports and visualizations, self-service analytics, and required tools to automate the generation of such insights.

Data Science Capability Center

The data science capability center is for exploring cutting-edge technologies and concepts to unlock new insights and opportunities, better inform employees and create a culture of prescriptive information usage using Automated AI and Automated ML solutions such as H2O.ai, Dataiku, Aible, DataRobot, C3.ai

Data Governance

The data governance office empowers users with trusted, understood, and timely data to drive effectiveness while keeping the integrity and sanctity of data in the right hands for mass consumption.

As your company grows, you will want to make sure that the data engineering capabilities are in place to support the six pillars of responsibilities. By doing this, you will be able to ensure that all aspects of data management and analysis are covered and that your data is safe and accessible by those who need it. Have you started thinking about how your company will grow? What steps have you taken to put a centralized data engineering team in place?

Thank you for reading!

https://twitter.com/richiebachala

--

--

Distributed SQL, Data Engineering Leader @ Yugabyte | past @ Sherwin-Williams, Hitachi, Oracle