Reflecting Back on a Decade of Practicing Data Science

Journey, thoughts & interests, and a possible future

Jaco du Toit
Towards Data Science

--

Betty’s Bay (Image by author)

How it started?

I first discovered machine learning in 2012 during my master’s degree in Electrical and Electronic Engineering through conversations with professors and colleagues in our research lab. I was fascinated by what I learned and decided to take online courses that included the popular Andrew Ng and Daphne Koller material on coursera.org. Andrew’s machine learning course stood out as the most engaging, practical, and enjoyable. This opened up a new and exciting world of endless possibilities for X to Y mappings.

I was left with more questions after taking the online courses and wondered if this is how the mind actually functions, if there’s a universal learning algorithm that can achieve human-like intelligence, what learning and inference truly are, and how they can be expressed mathematically. I was also interested in the inspiration behind the creation of these algorithms and their potential to be exploited psychologically. To gain more insight, I read books and research papers by authors such as Jeff Hawkins, Geoffrey Hinton, Daniel Kahneman, Dan Ariely, V. S. Ramachandran, Claude Shannon, Pedro Domingos, Steven Pinker, and Ian Goodfellow.

By 2017/2018, I had a good understanding of the basic learning algorithms and more specifically Neural networks but, after extensive discussions with my supervisor and much consideration, I decided to shift my focus to Bayesian methods during the peak of the Deep learning trend. This shift in focus changed my course from weights, neurons and activation functions to uncertainty. And much of it! I read more books and research papers by authors such as David Barber, Kevin Murphy, Christopher Bishop, Tom Minka, John Winn, Judea Pearl, Norman Fenton, David MacKay, Daphne Koller, and Edwin Thompson Jaynes. I’m still finding this field intellectually stimulating as it requires a different way of thinking compared to other machine learning approaches—often leading to introspection.

What was interesting over the past 10 years?

As an industry professional, the past decade has been filled with opportunities to practice data science and implement machine learning algorithms across various industries: electricity utility, digital and social media, fintech, and mobile telecommunications. My aim has always been to achieve more with less and to share the learnings. Some of the most impactful algorithms I’ve applied (and variations thereof) include PCA, Logistic Regression, Naïve Bayes, Gaussian Mixture Models, Gaussian Process, Latent Dirichlet Allocation and simple Neural Networks (sequential and feed-forward). The versatility and broad applicability of the Gaussian distribution for unsupervised, semi-supervised, and supervised tasks continues to impress me.

I particularly enjoyed working on engineering-type problems more than customer-facing problems, partly due to the physics that underlie the data generation processes. Analysing and interpreting human behaviour from data or metadata was complex, outside of my expertise, and Apophenia a “tool” that I was not comfortable using.

During my time working at digital and social media companies, I found Deep learning to be highly applicable, largely due to the reasons outlined in Chapter 11 of Ian Goodfellow’s book on Deep learning. From my experience, this was a “safe and effective” form of data science, particularly for text and image problems. There are plenty of tools, example code, online courses, and research papers available, and apart from the tedious process of data preparation, it was often like choosing from a candy store before going to the movies. However, I had to be familiar with the underlying theory of the tools I was using, especially when pre-built solutions were not sufficient. Nando de Freitas once compared it to playing with Lego, which is an accurate analogy and the most valuable take-away from my experience with Deep learning. Breaking complex problems down into smaller, manageable parts, an age-old wisdom.

In all of the industries where I worked, I found Bayesian modelling to be appropriate for a wider range of problems compared to other methods. These problems often involved limited data, missing data, the need to incorporate domain knowledge, the need for custom models, the need to incorporate uncertainty, and the need for probabilistic and causal interpretation. However, it was not always easy to convince management to invest in such methods. Barriers to adoption included a lack of specialized skills, limited relevant online resources, and longer model development times. Developing Bayesian models meant burning the midnight oil. Despite these challenges, I managed to make Bayesian modelling work for some problems, but for most I had to keep a straight face while “panel beating” the data. Things changed once I discovered Infer.NET, a framework for developing bespoke Bayesian models (open-sourced in 2018). It took a lot of effort to get familiar with the framework, but it was well worth it!

The field of Bayesian modelling is intellectually satisfying as it frequently uncovers hidden shortcomings in our own reasoning and perceptions, which is surprising considering the basic principles (and philosophy) at its core. The results obtained can often be counter-intuitive, unexpected, and have a significant impact. It’s also intriguing to note that many widely-used machine learning algorithms can be understood as special cases derived from a more general Bayesian perspective. Why limit oneself when one can have it all (approximately)?

Over the past decade, I observed many advancements in technology driven by the contributions of AI researchers and open-source communities, as well as the support provided by the industry in terms of infrastructure. This has made model development and data management faster, cheaper, and more accessible, which in turn has made it possible to label more data and train models with high computational costs almost “instantly”. While many of us were focused on joining Pandas dataframes, organizations like OpenAI, DeepMind, and others created ground-breaking text-to-image, text-to-video, and language models that have become widely used. Who knows the limits of ChatGPT yet? Additionally, it’s worth mentioning the impressive accuracy achieved by AlphaFold and the historic (bittersweet) AlphaGo championship. Makes me wonder which field of study will ultimately provide the most accurate understanding of human intelligence and consciousness.

Like many others, I have encountered unusual job titles such as “Big Data Engineer.” I once overheard a recruiter asking “How big should the candidate be?” This comment, though figurative, highlights the lack of a clear definition and ongoing evolution of responsibilities and overlap of these emerging roles. This has made hiring and applying for these positions challenging. The Drew Conway Venn diagram, which was widely popular, currently appears to be a fitting depiction of the complexity of the data science role. However, I still believe that the Ouroboros would be a more apt representation, as it symbolizes the cyclical and self-referential nature of the field.

What worked well?

Data science has become an integral part of many organizations’ operations, but in my experience effective leaders overseeing such initiatives were scares. To develop a strong support structure and business model for data science projects requires a deep understanding of the various components involved, including balancing the needs of the business with the latest technology trends. From what I have observed, the most successful leaders in this field had a practical approach and were well-versed in both the industry and the science. Success in data science projects often depended on the ability to balance project intent, necessary skills, subject matter expertise, and systems and data. Unfortunately, those who were overly ambitious or overly focused on the latest technology trends often struggled, while those who lived by principles such as Occam’s Razor and Gall’s Law tended to be more successful, often with a bit of luck as well!

Seeing the financial impact that a model or solution developed by yourself or by others can make on an organization is a truly satisfying experience. One approach that has worked for me in achieving this is to start small and allow for enough time for development and refinement cycles while keeping the end-to-end product in mind throughout the process. This includes keeping well documented findings, presenting information in a clear and easy-to-understand manner, seeking feedback from peers, actively collaborating with subject matter experts across different business units, approaching the problem from different perspectives, experimenting with different techniques, asking many questions, and carefully listening to user feedback. The process of developing a prototype and bringing it to production is a challenge in itself, but it’s important to remember that even the best solution will be of little use if people don’t understand how to use it. Something not easily resolved without making compromises.

What’s next?

In the coming years, I believe the focus of AI will shift from creating artificial general intelligence, to the development of systems that are able to perform well on specific tasks and maintain a high degree of accuracy in handling edge cases, making them more practical and reliable for human consumption. I think we’ll see more realistic valuation of AI companies, more focus on enterprise level data as a product, more focus on A/B testing using the Bayesian methodology, the integration of blockchain technology to allow customers to monetise their data, and more advancements with multi-modal models.

Personally, I think more is needed now more than ever to make the field of Bayesian modelling (causal modelling in particular) practical and attractive to students, researchers, scientists, and industry practitioners. An “Andrew Ng style Bayesian inference course” if you will. I stress this point especially after what happened during the 2020–2022 period, which appears to be the biggest biological misstep in humanity. It seems that many Covid studies are plagued by statistical fallacies such as Simpson’s Paradox and Berkson’s Paradox, and other statistical manipulation that influenced the Covid narrative (see examples here). It’s quite frustrating to see this happening, especially when the proper use of statistics and probability are crucial to understanding and addressing a pandemic. I mean “How dare you?”.

While data science skills and data science leaders are in high demand and scarce, I hope to see more support provided by organisations for professional development in the field. I guess one way to tackle this would be to encourage more collaboration between industry and academia, as well as more initiatives to develop and retain skilled professionals.

Labelled data will continue to be important, so we will probably see organisations offering incentives for capturing and managing clean and reliable data. This should be interesting!

This is my personal experience practising data science over the past decade. These reflections, opinions and “forecasts” are my own. Please reach out and do let me know what you’ve experienced!

--

--

PhD in channel coding and practicing machine learning individual. Passionate to learn and share.