A Tale of Two Nerds
Imagine two data scientists. You give them each of them a dataset and present them with a problem. The first listens to you, takes notes, and says they will ‘work on this and circle back with you’.
The second data scientist pulls out their laptop, sits next to you (or shares their screen), and says ‘lets solve this problem right now’. As quickly as the two of you can talk through ideas, they code it in real time and graph the results. Within 20 minutes you have an answer, graphs, and new understanding of your data.
Which nerd would you want to work with? Which nerd would you want to hire? Which nerd do you want to be?
Why its important to learn how to code quickly and without reference.
Why should you want to learn how to code fast? Because often, speed trumps quality. Having an decent answer right now is usually better than having a great answer next week.
Speed also requires a certain level of competence. Stopping after every line of code to google "how to select a column in Pandas" is a fantastic way to disrupt your train of thought. If you can immediately turn your ideas into code, you can quickly iterate and develop better approaches to a problem.
Sure you can just as easily write fast, terrible code that produces inaccurate results. But lets keep it real, if your going to fail, its better to do that fast too! Faster failing just means you have more time to succeed!
In fact, failing is the very thing that causes people to be afraid to code in front of others in the first place. We all hate when our peers to see our flaws, and the code that comes out of our fingers seem like little representations of your thoughts. If the code fails, it feels like a failure of our intelligence.
I’m here to promote the idea of detaching yourself from your ego, and Coding in front of other people, as fast as possible. Your code is not you. Just like the words you are reading now, your code is just words on a screen. And I have a big secret to share with you! Coding for data science is mostly easy.
Coding for data science is easy
Shhhh….
Whatever you do, don’t let the HR people in on our secret! Data Science salaries are high at the moment, and I would hate to spoil that for everyone.
But yeah, coding for data science really isn’t that hard.
Is the neural network or support vector machine model so and so just made complicated? Sure. But sometimes the actual model is only 3–10 lines of code. Lets not forget about the 700 lines of code found just above the model! This is the code that takes data in a raw format then cleans, encodes, combines, aggregates, and more.
This is 99% of what you do in data science. For every minute spent building a model and feeling smart, you get 2 hours of taking ugly data and turning it into slightly less ugly data.
So what are the 10 things we need to know?
These 10 things will get you through most situations. Even super complex looking data science scripts can normally be achieved with combinations of these items:
- basic selection (choosing and renaming columns, distinct)
- filtering data
- filtering with operators (and, or, etc.) or by other means (top, head, etc.)
- combining columns through concatenation (e.g. first name + last name) and aggregation (e.g. price * quantity)
- joining and appending data
- aggregating data (sum, mean, median, standard deviation, etc.)
- grouping and aggregating data
- aggregating at different levels of grouping (e.g. ungrouping, or window type functions)
- if statements to set flags or Boolean values in your data
- making a handful of simple graphs (bar, line, scatter)
How do you learn how to do these 10 things really fast without Google?
The solution: Gauntlets!
Gauntlets are tiny little drills that I have been doing for years, and now I’m going to share them with you. Here is how they work.
You start in your programming language / IDE of choice and have a blank template that looks something like this:
This template covers most of the 10 things I listed earlier. All you need to do is:
- Bring in a dataset, we will be using the free and popular MPG dataset from R (for both the Python and R templates)
- Just below each commented line in the template, give it a shot from memory. Try to get all the way through the gauntlet.
- If you can figure one out, take a peek at the filled in template for your language. Then type it in.
That’s it.
You do this Monday – Friday, right at the beginning of the day when you first start work. You do it before you look at your emails or join a meeting.
The first time you do one of these you might have to look at the filled in template for every section, and that’s OK! But I promise you, within 2 weeks you will have each section memorized. And if you can do these 10 things from memory, then you can combine them to pretty much do anything.
The Code
Here are the filled in templates in both Python and R. For python you will need to download the mpg dataset here. For R, the dataset is built into the Tidyverse package so no need to download it.
Python Code
R Code
Wrapping it up
Don’t let my gauntlet templates hold you back! Once you have these 10 things nailed down, you can keep expanding the list of things you do each day.
Even if you don’t program for data science, but do traditional programming instead, you could design similar gauntlets. I could easily imagine things any programmer might want to add to a gauntlet to learn a new language. Some examples:
- operators- and, or, equals, assignment, iteration
- loops- for, while, do while
- switches
- functions
- classes- create a class, override parameters, get / set values in an instance, etc.
- basic recursion (a function that calls itself)
Best of luck surviving the gauntlet! By the way, did you know that you are allowed to smash that clap button up to 50 times for a single article? Talk about a gauntlet for your fingers!!