Author Spotlight

The Right Skill Set Is the One That Allows You to Pursue Your Interests

Barry Smyth reflects on his dual-path career in academia and industry and on the changing landscape for young data scientists

TDS Editors
Towards Data Science
11 min readSep 6, 2022

--

In the Author Spotlight series, TDS Editors chat with members of our community about their career path in data science, their writing, and their sources of inspiration. Today, we’re thrilled to share our conversation with Barry Smyth.

Photo courtesy of Barry Smyth

Barry Smyth (BSc, PhD, Hon. DTech (RGU), MRIA, EurAI) is a scientist and entrepreneur. He holds the Digital Chair of Computer Science at University College Dublin. Barry is a member of the Royal Irish Academy, a Fellow of the European Association for Artificial Intelligence, and a Founding Director of the Insight Centre for Data Analytics.

Barry’s research interests include machine learning, user modelling, personalization, and recommender systems, and he has published over 400 peer-reviewed articles on these topics. As an entrepreneur, he co-founded ChangingWorlds and HeyStaks, and he currently serves as a board member and/or advisor for several European startups in the field of artificial intelligence.

You launched a series of data-analysis posts back in January revolving around Wordle, the viral word game. What drew you to this topic, and why did you want to explore it with as much depth?

To answer this question I need to give a bit of background about what I do. One of my teaching responsibilities is a course called Data Science in Practice. It is offered to third-year undergraduate data science students who have two options: they can take on a 5-month paid industry placement, working with an approved industry partner; or they can take my Data Science in Practice course and then have their summer months available to pursue their interests or a shorter internship. If they choose Data Science in Practice then they must devote 10 weeks entirely to my course: approximately 300+ hours of effort.

Having full-time access to students over the 10 weeks means that Data Science in Practice has been designed to be very ambitious and challenging, far from a traditional lecture-based course. In it, students work together in groups (typically two students per group) to implement a data science project of their own design. They work together on the design and coding, and produce a final class presentation with a detailed written report. During the 10 weeks, they also learn a host of complementary skills, from source-code management and code reviews to presenting their progress to the class during bi-weekly update meetings, as well as experiencing the pros and cons of collaborating with a project partner.

So that’s the background context. What about Wordle? Well, every year, for the first week of this module, we engage in what I like to call our Data Science Bootcamp. This is an intensive preparatory week in which I present the students with a sample data science project, from start to finish, to give them a sense of what is expected from their group efforts. Every year I try to create a new project on a contemporary topic. In the past, I have covered topics such as marathon running (linked to Elide Kipchoge’s efforts to break the 2-hour barrier), Hollywood movies (is it true that movies are rarely as good as the books on which they are based?), and the COVID pandemic, among others.

During the bootcamp week, I describe how to take a topic from a vague project idea to a concrete set of suitable research questions, how to assemble an appropriate dataset, how to clean and analyze the data, and how to use the results of the analysis to carefully answer the research questions in a clear and compelling way.

This year, the topic of choice was Wordle, mostly because of how popular the game became and how well it lent itself to a data-driven analysis. As I developed the project for the bootcamp week, I tested out my ideas by publishing several articles about specific Wordle topics on TDS. It proved to be a very popular set of articles, and the responses helped to shape the project for my class.

As someone who’s published hundreds of peer-reviewed articles, why did you decide to write for a broader, less-specialized audience?

This turned out to be one of those happy accidents. Some time ago I started running marathons — my mid-life crisis, I guess — and I also wanted to get back into coding and performing some of my own personal research, instead of just supervising the research of others. As an academic, I spend a lot of time supervising PhD students, who usually get to do all of the fun technical work.

I figured that a good way to get back into the world of coding would be to start my own side project. As I set out on my marathon journey, I decided I would try to learn what I could about running by analyzing the race data published by various big city marathons, from Boston and New York to London and Dublin.

As my analysis progressed, I realized that it was revealing some interesting insights about how people run the marathon and I was eager to present my findings. I started writing up various studies and race reports on Medium under my own publication, Running with Data, and later via TDS. Soon my data-science side project became one of my main academic research topics, as I deepened my research and applied what I knew from the work of machine learning to what I was finding out about sports science and marathon running. My blog posts led to dozens of new academic research papers.

I found that I really enjoyed writing for a more general audience and the immediacy of the feedback received through Medium and TDS. It was a great complement to the slower pace of academic publishing (it can take months for a new article to be reviewed and to appear in print).

My Medium articles also allowed me to reach an audience that I might otherwise miss: the running public, journalists and writers, and sports scientists. In fact, it proved to be a very effective way to raise the profile of my new interest in sports analytics and to make connections to others in the field that have since led to some very productive and enduring collaborations.

What kinds of problems do you find the most interesting these days— whether in your academic work or in your other projects?

I’m not sure if there is an easy-to-define set of questions or problems that I am drawn to, really. Broadly speaking my (current) data science interests fall into three main areas:

  1. My traditional academic interests focus on machine learning and recommender systems, usually with applications in e-commerce or media.
  2. Recently my foray into sports analytics has led me to adapt my machine learning interests to the field of sports science, and I am now involved in several projects exploring various ways that data science and machine learning can impact how we train and race. It has been great to be able to apply machine learning ideas in this novel domain, and there is clearly a huge appetite for it among the sports science field and the wider marketplace.
  3. Additionally, I am always on the look-out for other topics, often with a public-interest angle. These rarely start out as formal research projects but sometimes they take me in some very interesting directions. For example, recently I became interested in some aspects of number theory and published several TDS posts on reciprocals and prime numbers. In the process, I also discovered some novel integer sequences that were accepted by the Online Encyclopedia of Integer Sequences (OEIS), an important number-theory resource.

In summary, I’m drawn to questions or problems in a variety of topics, big and small, and often pursue projects just because they seem to be fun or because they allow me to try out some new coding ideas or visualization techniques. It is one of the greatest privileges of being an academic that I can explore new technologies and fields in this way.

Your career has been anchored in the academic world, but you’ve also worked extensively with industry entities. What do you see as the benefits of this dual-track path?

To answer this let’s go back to the 1990s. I was a junior academic working in artificial intelligence (AI) and machine learning. I was teaching classes, building a research group, writing grants, and publishing the results of my research.

At the time, the dot-com boom was in full swing and I was interested in exploring the potential to commercialize some of the technologies we had been developing in the lab. I was working on personalization technologies to help websites and applications to learn about, and adapt to, the needs and preferences of their users. It occurred to me that not only was there a clear market need for such technologies, but commercializing them would also provide me with an opportunity to evaluate them in the lives of real users, rather than the more limited lab-based evaluations that were possible in an academic setting.

I co-founded a company called ChangingWorlds with one of my early PhD students, and together we built personalization products for media/TV companies and mobile operators. This was before the iPhone, so the web looked very different than it does today, but the experience was very rewarding. I learned a lot about creating a successful company, hiring the right people, building teams, and productizing complex machine learning technologies.

I was fortunate to be able to remain an academic and continued to share my time between the commercial world and academia. One of the most surprising aspects of the whole experience was the value that this dual existence brought to both my business and academic life. For example, my academic experience helped ChangingWorlds demonstrate the impact and value of our products and services to customers because we knew how to set up and run large-scale user-trials; this was long before A/B testing was discovered by the competition.

At the same time, my academic life benefited hugely from the business side, because I was able to produce research papers with real-world experiments and large-scale evaluations, which really helped my research to gain recognition. Additionally, I came to understand that some of the most interesting and challenging problems existed in industry, which helped to shape my future research at a time when industry was increasingly looking to academia as an important innovation partner. I’ve always maintained a balance between fundamental and applied research, but the latter has been greatly enriched by working with industry partners large and small.

On a related note, many data and ML professionals face the choice, between industry and academia early on in their career. What questions should they ask themselves before making a decision?

The nature of the “academia vs. industry” debate has been changing over the last decade or so, maybe even longer. It used to be true that people pursued a PhD (in computer/data science) primarily because they were interested in an academic career. In my opinion, this is no longer the norm, and most PhD candidates appear to be driven by the industry research opportunities that a PhD now offers. Indeed, some of the best research in the world is happening inside industry research labs, as evidenced by the volume of industry-led research at the leading AI and ML conferences.

When it comes to the questions that data science and ML students and professionals should ask themselves, I think that the focus should be on developing the right skill set to pursue a career track that interests them. Sometimes we over-index on the technical skills (coding, mathematics, statistics), but these days the collaboration, software engineering, and communication skills are also critically important. Data scientists need to be able to work well with others (often as part of interdisciplinary teams), they need to follow best-practice software engineering principles, and they need to be able to communicate the results of their work in a clear and compelling way.

I’d like to see students exposed to much more of this during their undergraduate training. It is why I designed my Data Science in Practice course the way that I did and I think the students all agree that, challenging as it is, they all learn a lot about what it means to be a professional data scientist.

As someone with a long career in tech- and data-related fields, where do you hope to see the field moving in the next couple of years?

We can expect to see the continued commoditization of data science and machine learning. Not so long ago, if you wanted to use machine learning you had to code the algorithms from scratch, and manage all of the training and model building. Not anymore. Today, the state of the art is usually just an API call away. The downside of this is that it has led to a proliferation of machine learning ‘solutions,’ but not always using the guardrails that are required to ensure that models are fairly trained and correctly deployed.

For example, it’s all too easy for models to over-fit to their training data and under-perform on new data, and models that are trained with biased datasets will necessarily produce biased results. This is not new, but it could become a big problem in an increasingly ML-driven world if the power of machine learning and data science is tarnished by some poorly conceived failures to follow best practices.

I also believe that companies will come to realize that incorporating data science and ML technologies into their products and services is not the silver bullet that it may at first seem. Hiring a capable data science team is challenging and expensive, and companies new to the field may not have a full understanding of the right requirements, making early hires a risky proposition.

In traditional software engineering, maintaining and developing a codebase has always been important. In data science research, reproducibility is key, and within the research community we are seeing an increased emphasis being placed on open datasets and code availability—all in the service of greater reproducibility. One interesting challenge in this regard will be how we manage datasets in the future. We are used to source-code management tools such as Git and GitHub, but these are not well suited to the management of evolving datasets. We are starting to see the beginnings of a new category of data management and collaboration tools, which I think will prove to be an important area for the future.

To learn more about barrysmyth’s work and stay up-to-date with his latest articles, follow him here on Medium and on Twitter, or explore his academic work on Google Scholar. Here’s a sample of some of Barry’s excellent TDS contributions on topics like Wordle, number theory, and public health data:

Feeling inspired to share some of your own writing with a wide audience? We’d love to hear from you.

This Q&A was lightly edited for length and clarity.

--

--

Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly/write-for-tds