Making Data Useful

Data science… without any data?!

Why it’s important to hire data engineers early

Cassie Kozyrkov
Towards Data Science
6 min readNov 13, 2020

--

“What challenges are you tackling at the moment?” I asked. “Well,” the ex-academic said, “It looks like I’ve been hired as Chief Data Scientist… at a company that has no data.”

“Human, the bowl is empty.” — Data Scientist. Image: SOURCE.

I don’t know whether to laugh or to cry. You’d think it would be obvious, but data science doesn’t make any sense without data. Alas, this is not an isolated incident.

Data science doesn’t make any sense without data.

So, let me go ahead and say what so many ambitious data scientists (and their would-be employers) really seem to need to hear.

What is data engineering?

If data science is the discipline of making data useful, then you can think of data engineering as the discipline of making data usable. Data engineers are the heroes who provide behind-the-scenes infrastructure support that makes machine logs and colossal data stores compatible with data science toolkits.

Meme: SOURCE.

If data science is the discipline of making data useful, then data engineering is the discipline of making data usable.

Unlike data scientists, data engineers tend not to spend much time looking at data. Instead, they look at and work with the infrastructure that holds the data. Data scientists are the data-wranglers, while data engineers are the data-pipeline-wranglers.

Image: SOURCE.

Data scientists are the data-wranglers, while data engineers are the data-pipeline-wranglers.

What do data engineers do?

Data engineering work comes in three main flavors:

  1. Enabling data storage (data warehouses) and delivery (data pipelines) at scale.
  2. Maintaining data flows that fuel enterprise operations.
  3. Supplying datasets to support data science.

Data science is at the mercy of data engineering

You can’t do data science if there’s no data. If you get hired to be head of data science in an organization where there’s no data and no data engineering, guess who’s going to be the data engineer…? You!

Exactly.

What’s so hard about data engineering?

Grocery shopping is easy if you’re just cooking something for your own dinner, but large scale turns the trivial into the Herculean — how do you acquire, store, and process 20 tons of ice cream… without letting any of it melt?

Similarly, “data engineering” is fairly easy when you’re downloading a little spreadsheet for your school project but dizzying when you’re handling data at petabyte scale. Scale makes it a sophisticated engineering discipline in its own right.

Image: SOURCE.

Scale makes it a sophisticated engineering discipline in its own right.

Unfortunately, knowing one of these disciplines in no way implies that you know anything about the other.

Should you learn both disciplines?

If you’ve just felt the urge to run off and study both disciplines, you might be a victim of the (stressful and self-defeating) belief that data professionals have to know the everything of data. The data universe is expanding rapidly — it’s time we started recognizing just how big this field is and that working in one part of it doesn’t automatically require us to be experts of all of it. I’d go so far as to say that it’s too big for even the most determined genius to swallow whole.

Working in one part of the data universe doesn’t automatically require us to be experts of all of it.

Instead of expecting data people to be able to do all of it, let’s start asking one another (and ourselves), “Which kind are you? Let’s embrace working together instead of trying to go it alone.

But isn’t this an incredible opportunity to learn?

Maybe. It depends how much you love the discipline you already know. Data engineering and data science are different, so if you’re a data scientist who didn’t train for data engineering, you are going to have to start from scratch.

Building your data engineering team could take years.

This might be exactly the kind of fun you want — as long as you’re going in with open eyes. Building your data engineering team could take years. Sure, it’s nice to have an excuse to learn something new, but in all likelihood, your data science muscles will atrophy as a result.

As an analogy, imagine you’re a translator who is fluent in Japanese and English. You’re offered a job called “translator” (so far, so good) but when you arrive at work, you discover that you were hired to translate from Mandarin to Swahili, neither of which you speak. It might be stimulating and rewarding to take the opportunity to become quadrilingual, but do be realistic about how efficiently you’ll be using your primary training (and how terrifying your first performance review may be).

Who doesn’t love a good bad translation? Image: SOURCE.

In other words, if a company doesn’t have any data or data engineers, then accepting a role as Chief Data Scientist means putting your data science career on hold for a few years in favor of a data engineering career — that you might not be qualified for — while you build a data engineering team. Eventually, you’ll gaze proudly at the team you’ve built and realize that it no longer makes sense for you to do the nitty-gritty yourself. By the time your team is ripe for those cool neural networks or fancy Bayesian inference that you did your PhD on, you have to sit back and watch someone else score the goal.

Image: SOURCE.

Advice for data science leaders and those who love them

Tip #1: Know what you’re getting into

If you’re considering taking a job as a head of data science, your first question should always be, Who is responsible for making sure my team has data?” If the answer is YOU, well, at least you’ll know what you’re signing up for.

Before taking a data science job, always ask about the *who* of data engineering.

Tip #2: Remember that you’re the customer

Since data science is at the mercy of data, merely having data engineering colleagues might not be enough. You might face an uphill struggle if those colleagues fail to recognize you as a key customer for their work. It’s a bad sign if their attitude reminds you more of museum curators, preserving data for its own sake.

Tip #3: See the bigger (organizational) picture

While it’s true that you’re a key customer for data engineering, you’re probably not the only customer. Modern businesses use data to fuel operations, often in ways that can hum along nicely enough without your interference. When your contribution to the business is a nice-to-have (and not a matter of your company’s survival), it’s unwise to behave as if the world revolves around you and your team. A healthy balance is healthy.

Tip #4: Insist on accountability

Position yourself to have some influence over data engineering decisions.

Before signing up for your new gig, consider negotiating for ways to hold your data engineering colleagues accountable for collaborating with you. If there are no repercussions to shutting you out, your organization is unlikely to thrive.

Thanks for reading! Liked the author?

If you’re keen to read more of my writing, most of the links in this article take you to my other musings. Can’t choose? Try this one:

How about an applied AI course for everyone?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.

--

--

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita