Five books every data scientist should read that are not about data science
I wrote my first line of R code in 2010 for a class at the University of Washington (UW). I was hooked once I realized how much more powerful coding is than spreadsheets. Over the past decade, I witnessed the term ‘data science’ come into widespread use and saw the rise and fall of buzzwords like big data, business intelligence, analytics, and now artificial intelligence.
My class at UW was ‘computational finance,’ which easily filled a sizable lecture hall the way deep learning classes do today. At the time, the financial crisis was fresh in everyone’s mind. So was the, not to subtle, message for engineers; if you want to get a well paying job, go into finance and become a quant, much like data science today. The concept of using math directly in business operations is intriguing, not just for decision support but to make real-time decisions. However, the financial crisis also laid bare the inadequacy of even the most sophisticated models to cope with the lion of chaos that is the real world.
At the core of the financial crisis, many believe, is a Nobel winning differential equation; The Black-Scholes options pricing model. This model was used, without understanding its inherent limitations and implicit assumptions, to gauge risk for an enormous amount of investments. This technical blindness created the conditions for catastrophic economic damage.
Today aspiring data scientists are encouraged to learn a mind-boggling array of modeling techniques. Each method, like linear-regression for example, has its own set of philosophies you inherently agree with by its use (knowing or unknowingly). This has created a population of new workers ready to deploy models without understanding what is actually happening under the hood. Instead of addressing the technical blindness problem, the young community engages in proxy arguments about tools (R vs. Python!).
To help address this problem (of which I also suffer), I am presenting a short reading list of books that will philosophically prepare a data scientist. Also, they will motivate questions about the technical assumptions of models before they are deployed. This list is not exhaustive and the book topics range from fun to intense. There is a heavy influence of financial engineering because, more than any other discipline, it has given rise to the general-purpose data scientist.
1) Incerto: This book is a collection of writings by Nassim Taleb, the most famous of which is ‘The Black Swan’ and the best, IMO, of which is ‘Antifragile.’ Taleb is the greatest modern thinker on risk, uncertainty and the problems with quantitative modeling. He is also a Twitter troll known for calling out people who are ‘intellectual yet idiots’ IYI. By background, he is an immigrant derivative trader turned mathematical philosopher. You will either love him or hate him because he will consistently challenge your assumptions in all of his writing. If he writes anything, you should put it on your reading list immediately.
2) Fortune’s Formula: The story of the birth of a formula (The Kelly Criterion) during MIT’s early days that claims to be behind an enormous amount of financial success. You will learn about the father of information theory (Claude Shannon) and the beginnings of the card counting shenanigans that later become famous in Ed Thorpe’s ‘Beat the Dealer.’ Thorpe is now considered the godfather of quantitative hedge funds. Most importantly this book shows how a good model cannot be ignored forever but bad ones can burn you. The story is also one of the first times in history where computer science and mathematics team up to solve a real-world problem (it just happens to be for gambling). This story is a foreshadowing of the data science industry 60 years before its creation.
3) Chaos - Making a New Science: The detailed history of the youngest of sciences. Both a history of chaos and an accessible review of the topic. This book will give the reader an understanding of the limitations of our ability to model the real world. Many of the deep learning models being developed and deployed today cannot be genuinely understood due to the nature of non-linear processes. This book will help you comprehend these limitations. Also, a comprehensive review of the life and work of Benoit Mandelbrot alone make this a must read for any data scientist. James Gleick is a fantastic author and has many other excellent books you can add to your reading list.
4) Dark Pools: The story of a programmer that changed stock market trading forever. Today prediction models are deployed in the world of high-frequency trading where decisions are made at nanosecond speeds. This book walks through the creation of this hidden but powerful ecosystem. The fantastic thing about this story is that it illuminates how a great many problems can be solved when you know some code. It also demonstrates that creating real value is doing something truly innovative and not relying on existing assumptions. Sometimes you have to be a little crazy to solve a hard problem.
5) The Theory That Would Not Die: The history of Bayes formula and Bayesian statistics as well as its competing rival, the frequentist. Both a history of statistics and a plain language review of critical technical topics make this book vital. You will learn about some of the greatest minds in history like Pierre Laplace and R.A. Fischer along with how their philosophies shaped the world’s approach to data for centuries.
These five books, while not exhaustive, will help to build a philosophical foundation for a data scientist working on real-world problems. Do not make the same mistakes the quants did a decade ago. Seek to understand techniques and models philosophically, not just mechanically, and our profession will become invaluable