The world’s leading publication for data science, AI, and ML professionals.

So You Want to be a Data Engineer

You Better Understand Architecture and Business Analysis

Office Hours

Many people incorrectly assume that great data skills make great data engineers. It makes sense on the surface. After all, data engineers transform raw data from multiple sources into consumable packages. Hence, data skills. While various technical skills are needed by data engineers, they are more like prerequisites than differentiators. Data engineers need to grasp the entire business use case and design complete data solutions.

To gain a better grasp on this, let me tell you a story about Sue. Sue’s story starts at a restaurant chain named Krispy Crabcakes with hundreds of stores nationwide. They want to increase sales, so they start email marketing. Customers sign up for free coupons on in-store kiosks and coupons are sent in email campaigns every week or so. Being old school, they don’t trust cloud or SaaS. No, they do everything in house.

· IT manages the email lists and the mail server.

· Web developers create tracking images sourced in the emails to see how many people open them.

· Coupons are scanned in the point-of-sale system when used.

Not very efficient in today’s environment, but it has worked fine for a couple months. Sales are up and lots of coupons are being redeemed. Management asks how email marketing is going, and someone in IT says, "great!" Of course, that’s not good enough for management, they ask how many emails have been sent, how many opened and how many coupons redeemed? Reasonable questions if you ask me, but there’s a problem. No one ever thought about measuring the campaigns. They were just told to send out a bunch of coupons. Now, they’re in a pickle. Enter Sue, the data engineer.


Sue knows she needs to report out metrics on the email campaigns. She hasn’t even spoken to anyone who manages the technology, because she doesn’t care about that right now. Her goal is to understand how email marketing is used to impact the business. So, after speaking with three or four people, she finally gets a picture in her mind. She also understands the metrics management wants to see.

She finds the why and what of her project long before thinking about the how. In fact, she even draws a simple diagram with some notes. This diagram doesn’t really follow any industry standard, but it gets the job done and allows her a reference point for further discussion. She documents the business process in an easy-to-understand picture with a few bullet points.

Provided by the author
Provided by the author

A great start to the project, if you ask me. Sue now understands the business process, so she can start to design a system. She also knows who to talk to when questions arise, but what did she make? On the left side, there are short, concise bullet points explaining the business process used by Krispy Crabcakes and three metrics needed by management.

On the right, a simple map of relationships between the entities involved in the process. Arrows show relationships between the entities with notations on the number of related entities that might be found. For example, a coupon can be redeemed 0 or 1 time. A campaign will have at least one recipient, but no maximum amount. By naming these entities, everyone can discuss the process using the same language.

Now that Sue fully understands the business process, it’s time to speak with IT and see what kind of technology is being used. She finds the IT manager and asks for an architecture diagram to better understand how the email campaigns are being implemented. As expected, no such diagram exists, and to make matters worse, no single person understands the full process. Sue has been doing this a long time, so she isn’t surprised. Nor is she worried.

After some digging, lots of coffee, and a little wine, Sue has found all the pieces of the puzzle. She combines her knowledge of the business, with everything she learned from IT and web development to build an architecture diagram of the current technology. Once again, she doesn’t follow any particular diagramming standard. In fact, she often just pastes logos for the technology being used. She draws arrows with actions between the various components to show the flow of activity. Since everything starts with the customer, she puts them right in the middle.

Can you follow the diagram? Take a few minutes to figure out the flow.

Provided by the author
Provided by the author

With the information Sue learned, she found four data sources. While no one knew how to track emails sent before, she learned about the mail server’s logging. All emails are logged with important metadata to comply with internal auditing requirements. As luck would have it, these logs include just enough information to identify emails sent by each store for specific campaigns.

Provided by the author
Provided by the author

Now Sue is in a great place. She understands how the business uses email marketing to improve sales. She also pieced together all the nuts and bolts of the technical environment supporting everything. She can start thinking about solving the problem. Does she login to a database or fire up her text editor? No, of course not. She now needs to design the architecture of her solution.

While she would much rather use cloud services or at least Kubernetes to manage things, Crispy Crabcakes prefers in-house technology and offers a Linux server for running scripts and scheduling CRON jobs. That’s OK. Sue is a data engineer, not a technology evangelist. She designs an elegant solution using the most appropriate tools for the job.

Notice that her new diagram only includes the data sources from the previous one. Her solution doesn’t interact with many of the components in the IT infrastructure, so she only adds what is needed for this specific project.

Provided by the author
Provided by the author

Sue also creates a database diagram to show what the data warehouse will look like. She has a single fact table with two dimensions. Over time, Sue expects the business to request more metrics and reports. She starts with a star schema that can be easily expanded, yet doesn’t add complexity to the current solution.

Provided by the author
Provided by the author

Finally, after all this, Sue starts implementing her solution. It seems like a lot of work to get to this point, but it is not. It takes her a day or so to write up the scripts and start scheduling jobs to collect data. Within a few days, she has production reports management can view from their browser. Everything works. Since she understands the business process, she is already recommending better metrics and ways to enable a feedback loop into the system. She can do this because she understands architecture and Business Analysis. In fact, Crispy Crabcakes already wants her to oversee better tracking of campaigns and individual emails.


Data engineers need to grasp the entire business use case and design complete data solutions. Too often we jump to SQL or Python, before fully understanding the business processes we support. Sue, on the other hand, takes time to understand the business and current technology before designing a solution. This allows her to provide valuable insight to both management and IT. It allows her to design solutions that fit today and are flexible enough for tomorrow. If you want to be a data engineer, be like Sue.


Related Articles