Exploring the Data Jungle

A free eBook on discovering, taming, and studying the data you need to solve real problems

Brian Godsey
Towards Data Science
3 min readAug 23, 2017

--

Some people believe that data comes in simple, organized tables — numbers and text stacked in neat little rows, each value separated from its neighbor by a comma. And other people believe that all data values are 100% correct, because they originated from an autonomous data collection machine that never makes mistakes. Some people believe that every database, straight out of the box, can store anything and everything and make it easily retrievable in a fraction of a second. In a perfect world, these beliefs may align with the truth, but — in this world — far from it.

Data in the wild is unkempt and unruly. It’s not always where you want it to be. It may exist in an odd format. It may have missing or incorrect values. It may be skewed or not representative of the population you wish to study. There may be way too much of it to manage. It may not exist. A good data scientist is no stranger to problems like these; they come with the territory. In order to avoid or solve these problems and others like them, it is helpful to be familiar with data in its many locations, formats, and qualities.

The three chapters in this collection each give a perspective of what you might find when you go looking for data. The first chapter — from my own book Think Like a Data Scientist — describes the world of data as a wilderness worthy of exploration and meticulous investigation. Here, the roles of data in the modern world have grown to the point that they can no longer be ignored; thus we need to prepare for the many ways and forms in which we might find the data we want. The second chapter — from Practical Data Science with R by Nina Zumel and John Mount — gives a thorough introduction to the many ways you can inspect data you have. The directives and suggestions of this R-specific viewpoint can be generalized to understand your data comprehensively using any statistical software, not just R. The third chapter — from Real-World Machine Learning by Henrik Brink, Joseph W. Richards, and Mark Fetherolf — gives a thoughtful blueprint for what to do as you prepare real-world data for machine learning. Data from the wild isn’t usually ready to be fed to a highly intelligent, but coldly deterministic, algorithm without a little cleaning, organizing, and dressing-up.

Data isn’t always approachable. It can be messy, wrong, or hard to access. But despite all of that, it can still answer real business questions and solve meaningful problems. This collection of chapters shows you how to approach data in the wild for maximum insight and benefit.

Brian Godsey, Ph.D., is a mathematician, entrepreneur, investor, and data scientist, whose book Think Like a Data Scientist is available in print and eBook now. — briangodsey.com

If you liked this, please click the💚 below.

--

--

mathy data sci/eng, writer, runner // startups // @kaskada // wrote Think Like a Data Scientist // briangodsey.com // tw: @brian_godsey