Great Books for Data Science

Duncan McKinnon
Towards Data Science
10 min readJan 14, 2019

--

10 non-technical books that will get you excited about data science

There is no shortage of books that promise to teach data science. Most of these books read like college textbooks with a wealth of technical material prefaced by a short conceptual introduction. While there is plenty of demand for these technical skills, really good data scientists need a mix of conceptual understanding and curiosity too before they can make the most of their tools.

Each of these books approaches the concept of data science differently: some are about what data science is, some are about how professional data scientists think, some introduce perspectives that may change how you view the world, and one even contains the full history of information.

Each time I came across one of these books, it profoundly impacted my thinking in ways I wouldn’t have expected. At the very least I hope this list can serve as a useful starting point:

1. Weapons of Math Destruction

by Cathy O’Neil

Weapons of Math Destruction is all about what goes wrong when people utilize the tools of data science without a firm conceptual understanding of their roles and responsibilities. Data science is a SCIENCE and poor procedures, unclear objectives and individual biases always lead to completely in-viable and often destructive outcomes.

After reading this book, you begin to see the unintended weaponization of data everywhere. While it may induce some anxiety to see how poorly-implemented the worlds most powerful data systems are, none of us can fix the problems we fail to recognize.

This book should be considered mandatory reading for anyone who wants to develop or manage data systems.

2. The Information

by James Gleick

This book begins at the beginning; the emergence of information as a transferable entity. The Information tracks every step in the development of data from tribal drums to the written word to today’s massive data centers.

Personally, the concept of information was not something I had deeply questioned before reading this book, but just recognizing the role that information plays in the existence of basic life forces you to the point of existential crisis. After reading this book I had to learn more about information theory which led me to The Idea Factory, Claude Shannon, and the true basis of the modern world. Not totally related to data science, but definitely worth reading.

3. Everybody Lies

by Seth Stephens-Davidowitz

To be honest, when I saw the title of this book I found it a little off-putting. It was clearly created to be eye-catching like a self-help book for the masses. It’s one book that shouldn’t be judged on it’s cover. From page one it offers some of the most fascinating insights into the possibilities of good data science that have yet been explored.

Seth Stephens-Davidowitz uses examples from his time as a data scientist at Google to show how search information alone can be used to get at deeper subconscious truths about our collective society. The point isn’t that people lie as individuals, but that on aggregate our beliefs about ourselves can be shown to be highly elevated when faced with the evidence in simple online searches and comments.

If the goal of data science is to find ways to capture, model and then make use use of systemic consistencies (like physicists and engineers have been doing up to this point), then it is vital to have accurate representations of the environment, which is really what this book is about.

Underneath overly idealized models, individual biases and expectations there is the reality of the system, hiding in the data.

4. Dataclysm

by Christian Rudder

While I was an undergrad, OkCupid and Tinder were two of the companies I thought would be most exciting to work for. The data sets of these online dating giants may be rivaled only by Facebook in terms of the fundamental sociological information they contain. At these specific companies there is also incredible potential to use data and feedback systems to directly improve outcomes for customers while gaining insights into the patterns of human behavior along the way.

This book is basically those insights.

Written by Christian Rudder, co-founder and head of analytics at OkCupid, Dataclysm is about how data has been used, what we have learned and what happens next. One of the best things about this book is the use of visualizations. Creating representations of information that are intuitive and easy to digest is a huge (and often overlooked) side of data science.

This book inspires by both showing and telling what you can do with data.

5. Big Data

by Viktor Mayer-Schönberger and Kenneth Neil Cukier

Big data is a buzz word these days. The authors of this book are well aware of the hype, but rather than write this off as another fad, they work to explain the legitimate opportunities that new information technologies have created. Big data has become a cliche because people and businesses use the term in such an ethereal way, so it’s easy to forget to ground the concept of ‘data’ that is somehow ‘bigger’ in real world explanations and meaningful definitions.

This book goes beyond background, existing applications and definitions, to explain the incredible potential that is contained in massive data sets and how data scientist will someday be developing systems and automation that drives businesses, social policy and entire nations.

6. Algorithms to Live By

by Brian Christian and Tom Griffiths

Algorithms to Live By is a book that anyone could benefit from. Not only are the technical concepts incredibly powerful, but the book focuses on the accessibility and practicality of each algorithm when navigating daily life. People with a CS background probably have the most to gain from this book. While courses on algorithms, data structures and discrete mathematics teach how to use the tools of a software engineer, this book makes CS all the more accessible by explaining where the tools come from, why they are incredibly useful, and how to begin applying them in daily life.

While the book doesn’t teach you how to write a merge sort in C, it will give you real world examples of how libraries and search engines continually sort their data sets, the compromises made to ensure fast searches, the trade-offs of space and time complexity, and an intuitive explanation for the hard theoretical limits to optimal searching and sorting. This is all covered in one chapter of this pretty incredible book.

Every data scientist should have a thorough understanding of algorithms, data structures and discrete mathematics but whether you have the technical knowledge or not, this book offers to stretch your thinking.

7. The Signal and the Noise

by Nate Silver

The Signal and the Noise is probably one of the most popular statistics books around. ‘The signal in the noise' is a metaphor that is often used in data science: identifying the relevant information ‘signal’ that is correlated to the solution of a given problem from within a ‘noisy’ data set or system. The world is full of distractions, and many of the things that end up effecting our decision making are diverting our attention away from indicators that are more closely correlated to our objectives.

For example, Silver used to work in saber-metrics creating useful statistics relating to baseball strategy. In the past baseball scouts would go out looking for new players, armed only with intuition that comes with experience. These scouts focused on the skills of players that seem immediately evident in their relative size, speed or smarts. For the past century teams have relied on scouts to help them field winning teams based on the assumption that the intuition of experienced scouts is the key to finding the best players. Over the past few decades, professional statisticians and sabermetricians have been displacing whole scouting departments. It turns out that the scouts — and by extension the baseball teams, were being led astray by ‘noise’ (relying on the intuition of the scouts, who were exposed to only a limited set of observations). Teams that rely on statistical judgements when fielding their teams have proven far better at achieving their objective — winning games.

This is just the tip of the iceberg. The Signal and the Noise flows seamlessly through political strategy, social dynamics, gambling, Monte-carlo methods, and the basics of game theory (with a close look at the implications of the ultimatum game, iterated prisoner’s dilemma and tragedy of the commons).

The insights presented and the underlying mindsets (being a fox) taught in The Signal and the Noise are indispensable.

8. Complex Adaptive Systems

by John Miller and Scott Page

Complex Adaptive Systems is probably the most important book in this list. It isn’t the easiest book or the most exciting, but the perspective it assumes is incredibly powerful. The idea of a complex adaptive system is one in which knowing the behaviors or constraints on the component parts does not allow you to make actionable predictions on the behavior of the whole. In complex adaptive systems there is an emergent complexity that makes the whole greater than the sum of its parts. There are infinite examples of systems that fit these criteria: individual cells are complex adaptive systems, individual people are complex adaptive systems, individual cities are complex adaptive systems and of course the whole biosphere is a complex adaptive system.

Being able to identify complex adaptive systems as a ubiquitous form in the Universe is one thing, but this book is really about modeling and dissecting these systems to get at their inner structure. Whether we are aware of it or not data scientists are at the front lines when it comes to dealing with these complex systems, so having the tools to both recognize and model these system will only become more important as the problems we deal with get increasingly complex.

The concepts in this book come fast and thick, so it’s worth looking at another source to strengthen understanding. I recommend signals and boundaries, which approaches these same ideas from a slightly different perspective.

9. Antifragile/Incerto

by Nassim Nicholas Taleb

I chose Antifragile because it was the book that I found most impactful, but each of the 4 books in Nassim Nicholas Taleb’s Incerto series are trans-formative in their own right. They loosely reference one another, but you would still be well served as a data scientist if you chose to only read Antifragile.

The Incerto series is essentially a tirade against how modern systems deal with uncertainty and the weakness of the assumptions that underlie behaviors and systems. It comes from a place of bold experience, careful observation and tons of discipline. Taleb lays out a worldview in this series, complete with motives, background, justifications, counterexamples and anecdotes all in support of a central thesis. It is the work of a perfectionist, totally baffled by the weakness and stupidity built into the modern world.

I might someday write 4 separate articles about these books alone, but Antifragile was the one that really resonated for me in terms of data science. Antifragility is when systems become more robust and resilient as they are exposed to greater disorder. Given that the book is called Antifragile (a concept Nicholas Taleb seems to have conjured out of thin air in order to explain his worldview), I will leave the more detailed definitions to the author.

While nature demonstrates antifragility at every level, fragility is almost an inbuilt assumption of modern systems. I studied Physics in school, and what most intrigued me was chaos theory or nonlinear dynamics. When you get deeply into studying the physical world, it seems that chaos underlies all emergent order. How do intricate ordered systems like the human body develop under these circumstances? The answer is that the successful propagation of any ordered system is a function of its antifragility.

Once the modern world finally catches up to Nassim Nicholas Taleb, antifragility and the other concepts in the Incerto series (e.g. the black swan power-law distribution, recognizing randomness or noise) will be HUGE for data science and engineering. Better to get a head start.

10. Superforecasting

by Philip E. Tetlock and Dan Gardner

One of the most important applications for data science and statistics is in forecasting. In a fast moving world people, businesses and society as a whole will live and die on the ability to accurately predict future events and respond appropriately. Superforecasting is about the skills and habits of people who are capable of developing incredibly accurate predictions, consistently. It presents these strategies in the context of a famous experiment where teams competed to predict future events. Tetlock’s team was made up of volunteers from around the US and Canada — people with no special qualifications, just time and interest in the subject. In spite of their lack of credentials, Tetlock was able to outperform top analytics firms and the CIA using only these crowd-sourced predictions.

Through a mix of psychological insights and character studies, Tetlock reveals the traits and practices that make anyone capable of predicting future events with great precision and accuracy. When I began reading this book I was a little disappointed by the limited scope of the cases chosen (mostly from his study on volunteers), but the book ends up doing an excellent job of exposing and explaining the strategies that make someone a superior forecaster.

While the formal statistical methods are kept to a minimum, the mindset and heuristics that allow someone to make actionable predictions are vital to any data scientist.

--

--