Data Science for Startups: Blog -> Book

Published in

Towards Data Science

4 min readJun 2, 2018

There’s a number of compelling reasons for data scientists to write books. I wanted to better understand new tools, and to document some of my past experience in industry. Kirill Eremenko also claims that writing makes you happier, more understanding, and more productive.

My initial approach for accomplishing this goal was to use Medium to publish a series of blog posts on Towards Data Science. My first article had a great response, and I got great feedback on the follow up articles that I shared. Once I made some good progress on this series, I was inspired to use “Blog Based Peer Review” to author a book. This approach was pioneered by Noah Wardrip-Fruin, a professor a UC Santa Cruz, when authoring his book “Expressive Processing”. I didn’t take this formal approach to peer review, but I did incorporate feedback when converting blog posts to book chapters.

The result of this process is the book “Data Science for Startups”. It is available for free online in multiple formats, and the print version is available royalty free. Code for all examples included in the book is available on Github, and I’m also releasing the source code for the book. Here are of the formats:

In this post I’ll discuss the tools used to author the book, build the book, and self-publish, as well as the writing process that I used.

Tooling

For authoring content, I used Medium to write the initial drafts of chapters as blog posts. I like using Medium for writing as opposed to other editors, because I don’t have to worry about formatting, there is a built-in spell checker, and it’s easy to include code snippets. After using the platform for a bit, I found that it provides a good flow for authoring. There are some features that are missing that would be useful when authoring a book, such as being able to reference figures, code blocks, or bibliography entries, but it does provide a good starting point.

For building the book I used the bookdown package, which is an R library that converts R markdown files to PDF, epub, and web formats for publishing a book. Since it’s based on R markdown, you can use R code to generate visualizations as part of the book compilation process. There was a bit of work involved in converting Medium posts to R markdown (Rmd) files, which I’ll discuss in more detail in the next section. In the past, I used LaTeX to write my dissertation. For this book, I found bookdown much easier to use, without losing much control over the layout of content. When building a PDF with bookdown, the library first translates R markdown to a TeX file, and then uses Pandoc to generate the output file.

For publishing, I used Kindle Direct Publishing, which now has a paperback option. KDP provides good tools for reviewing the content of the book interior, provides a useful cover designer, and provides a wide variety of book sizes. It only takes a few days before your book shows up on the Amazon marketplace. I also explored using CreateSpace and Lulu, but found KDP has the best pricing for a good print quality.

Writing

I had a good idea of what content I wanted to cover with the book, which I outlined in my introduction post. However, I didn’t have a fully-realized plan for what systems that I would build and then cover in the book. Ideally, I wanted to make a complete game analytics platform, and then discuss different portions of this system as different chapters. In reality, I build MVPs of parts of a data platform and used publicly available data sets to reduce the amount of system building that I needed to do before authoring chapters. When writing a technical book, it’s useful to know which of the following approaches you’re taking to author content:

Building a system and then documenting how it works
Writing a textbook that provides an introduction to different topics

I started with the first approach, and then shifted to the second approach later on in the writing process. I should have determined which approach to take from the start and stuck with it.

Since I already had a good outline of what I wanted to cover, I focused on writing each of the chapters sequentially. Here’s the process I used:

Create a chapter outline
Write code for tutorials in the chapter
Create visualizations and code snippets
Write the text portion of the chapter

I found that the second step generally took the most time, especially when exploring new tools, such as working with Google Datastore.

Once I published posts on Medium, I then needed to convert the posts to an Rmd format. To accomplish this, I copied the text from the post into a new file, added section and subsection headers, added code block identifiers, replaced hyperlinks with footnotes, and added images using the include_graphics function in knitr. I also spot checked the resulting chapter output, and modified spacing to remove orphans and other text artifacts. I also fixed any typos highlighted on medium or mentioned in responses.

My recommendations when writing a technical book are to lock down tooling early on if possible, and to accolate much more time than you think will be necessary to write a book, especially if you’re going to be learning something new along the way. Overall, I found the experience rewarding and would recommend more data scientists to give it a shot!

Data Science for Startups: Blog -> Book

Tooling

Writing

Written by Ben Weber