Hands-On Tutorials
Slowing Down Makes You Faster, Here’s How
TLDR
If you introduce poorly written code and don’t refactor your work, whether you realise it or not you are invisibly slowing yourself down and making your life harder. Refactoring is fun, it makes your code faster, it makes you faster at writing code, you write more maintainable code, and it helps you write better code in the first place once you’ve had to clean up after yourself.
Tech debt is invisible glue and as you add more of it to your code base with each unchecked commit, your progress will eventually grind to a sticky halt. It happens every single time, no matter how much you try to outrun it.
Learning to refactor starts with learning how to re-write existing code without changing the outputs but improving its legibility, performance, simplicity, maintainability or any combination or all of the above. The difference between "poorly written once and never again touched code" and even code that is just refactored once changes your software from about a 2 to a 7 in quality and ease of extension. Everyone (including yourself in a few days when you encounter your own previously written code) will love you for it.
If you plan on working on a piece of software past a couple of days, refactoring needs to be part of your story. If you’re working on a script for a couple of hours and will throw it away after, that’s a caveat to this and feel emboldened to spaghetti code and live freely. It might be a fun exercise to look at a small script you’ve previously written and take some of these refactoring tips for a spin!
Lets get started…
What is Refactoring
People smarter than myself have already defined this for us, see Martin Fowler’s wise words below.
"A disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior."
Introduction
Refactoring is an unloved habit that pays dividends forever and I wish there were more Data Science investors who incorporated it into their programming portfolio of habits. It’s a staple of any good software engineer’s habits but it doesn’t pop up all that often in the data practitioner world and I want to change that, starting with this blog.
The other week at work, by doing some simple refactoring which I’ll walk you through below, there was a 113x performance improvement which changed a data pipeline processing time from ~84 hours down to 45 minutes.
This isn’t a claim on how much faster your code will become, it was merely the experience I had with little effort that I feel is a story worth sharing.
This was also done with simple work that took an hour or so. There is still has room for improvement, there’s no telling how much faster it could be done. The impact on your code from Refactoring might be many more orders of magnitude an improvement for your code or you might not have a performance problem, only a headache problem when reading what you wrote last week. Refactoring generally helps both.
Limits
I want to note that this is a Frankenstein blog of a poor mans version of Raymond Hettingers talk (Link at the bottom), a case study, a tutorial and my rambling thoughts. Raymond’s talk changed my experience of refactoring and problem solving from a chore to a love. I have noted his talk down in my "People Much Smarter Than Me" section at the end of this page. Please go watch that and follow him on twitter, he’s part of the core python developer team and he puts out lots of wisdom and love in this planet and he needs to be protected at all costs.
My claim of 113x improvement in speed is related to an experience I had rewriting a function for a colleague at work the other week and of course has no bearing on how amazing refactoring will be for your situation. I’m also writing everything in python and so my styling, behaviors and recommendations are biased to python and pandas.
The Ball Game
Drawing inspiration from Jeremy Howard and Rachel Thomas and their teaching style, instead of starting from foundations and building up, its important to see the end result first so that there’s an obvious outcome to work towards. Show me the whole match of AFL before teaching me how to kick a football so to speak.
We can then break down the thinking, steps, and work in between to get there and why this is important along the way.
Our Case Study Goal
So our situation involves a sample of two data frames:
- A reference dataset: containing information about a feature in the main dataset, its bins and a special value we care about.
- A main dataset: of values we need to fit within the reference bins and attach the magic value to those binned values.
So for each observation of each feature in the main dataset, there will be a set of entries in the reference dataset containing bins to put those main dataset values into and a corresponding special value for that bin. Our goal is to connect each main dataset entry to a bin which is then connected to our special value and then have that all available back in the main dataset.
Note that we’ve only shown the solution for a single variable but solving for 1 feature solves for n features by wrapping our solution into a loop etc.
Below is the end result we’re looking for
Now that we have our stated end goal, lets now look at the code…
The Code
See the below 3 gists, the first being the setup dummy dataframes, the second being the old code that took ~85 hours to run on the full dataset and the third gist being a refactored rewrite that produces the same output. The second gist was used with approximately 3,000 features and a few million rows took ~85 hours. This is in place of the sample we’re working with. We will be working with a mock sample but the changes are still the same. I’ve also included some comments as to the sins I don’t like and reasons why I think the refactored lines are nicer to manage.
We’re going to solve the problem for the simplest situation (one feature) of which could then be looped/extended for as many features as exist in the main dataset. I find it effective to solve the simplest and smallest directional leap you can when working through a problem and then slowly extending and iterating to your result.
The above is fairly self explanatory, you can see the reference DataFrame as ref_df and the main DataFrame as main_df which we had screenshot examples of above.
This is our scary original method that I worked through the other week. Take a moment and think if you can understand it or how the authors have tried to achieve our stated goal. Don’t worry if it feels awkward or if it’s hard to understand, it is and it is. We will rewrite this problem from scratch as well as talk about ways we could modify this code in place if we were stuck with the method in its current form. You’ll notice I’ve written sins in as comments to the right of the code.
This is the final 10 liner that achieves the same stated goal. I’ve included some of the behaviors that I prefer when comparing this gist to the old code gist as comments on the right of the lines above.
Also, check out this colab notebook if you’d like to see the thinking behind these 10 lines. See below, I’ve also included it at my resources at the end.
So how do we go from the second gist to the third?! This seems like a big leap at first glance and it is without the story in between but I promise there’s simplicity in the magic.
Break the Problem Down
All programs and problems can generally be broken down into inputs, the processing of those inputs and some outputs. Setting the stage at the beginning of your problem solving and refactoring can be a fantastic way to clear your mind and distill the problem at hand. Lets quickly do that in our case study as this will reveal the steps we take in our few lines of code.
Once you break a problem down into the simple inputs, outputs and steps in between, you can progressively work through each step and arrive at your destination. Trying to do everything at once is too hard and you don’t have the mental registers to do it anyway (hint: check out Raymond’s wonderful talk, he talks about mental registers and your feeble human limits of thought)
Inputs
- Reference DataFrame
- Main DataFrame
Processing
- Create Bins | Lines 1–4
- Add bins to reference DataFrame | Lines 6–8
- Add bins to main DataFrame | Lines 10–12
- Merge datasets on bin categories. | Line 14
Outputs
- A Single Dataframe with reference special values connected to binned reference values
Thinking Patterns & Behaviors
This feels a lot more manageable than trying to eat up the original method presented to us. I can definitely set out the problem in my head much clearer with the above to reference rather than trying to interpret the original method and implementation. If you’re feeling game, get started unit testing and again distilling your code to further improve what you’ve re-written.
Doing this over and over to distill and refine and distill your code is refactoring in essence. As long as your improving the ease of reading, the performance, the simplicity or ease of use then you’re on the right path. Code is read more often than it’s written, so optimize for the reader and not you, the writer. Be empathetic as often it’ll be you reading it in a week but you’ve forgotten that you’ve written it in the first place.
Iterate & Adapt
The best place to start is to make small but directional improvements to the code as is. This could be for any benefit such as legibility, simplicity, brevity, using libraries instead of handwriting a process, etc. Lets look at a small example below. Look at lines 1–5 and 7–11 which are the original and refactored versions of the same snippet from the original large method.
I’ll quickly list the changes below:
- "out_val =" → "return out_val"
- "if pd.isna(x) == True:" → "if pd.isna(my_df):"
- x → my_df
- Improved comment
These are nice for the following respective reasons:
- Signal to the reader this is the desired output & returns the function as soon as you’re desired value is found instead of continue to evaluate code that isn’t required
- If you can do the same thing with less code (removing == True) then do! Python has a "truthy" and "falsy" behavior that makes for clean syntax like this. Read this stackoverflow for more info if this is new for you.
- Make better variable names, my_df isn’t perfect but it at least signals the datatype. "x" is so ambiguous, I’d have to go find the declaration or debug it in order to know what datatype this variable is.
- Write a comment that means I don’t have to read the code if I don’t want to. My comment isn’t perfect but its in a better direction than the previous comment.
So I know the above is a bit of a spot the difference but there are some subtle improvements. Making small adjustments like this may not feel magical or powerful but when you iterate and iterate and iterate over the same piece and distill it down to something concentrated, its often quite nice code! We’re not there yet with the example above but I wanted to show a small slice.
Demolition Tactics
This is an example of a complete rewrite where non of the old code has been taken into account and a totally fresh solution has been built by breaking the problem down and going back to basics. Sometimes if your working with a problem you are confident in solving, its quicker to demolish what was there and start fresh. Be careful as this often isn’t the best path, usually someone has thought long and hard about what they’ve written and its worth taking time to understand who has come before you and their lines of thinking. However in this scenario there are too many garden paths to go down and too much complexity for a problem that when you take a step back, is actually quite simple and would benefit from a rewrite.
Try it out, Follow Along
Go look at the colab notebook for how I’ve worked through this solution and produced the second gist. Continue reading if you’d like to see how I’d modify the original method in place if I wasn’t game enough to write it from scratch and was trying to make directional improvement in the code quality.
Refactoring Tips & Tricks
Referencing the original method, lets look at the small sins that when summed together, makes the code difficult to understand, slow and hard to change or maintain. I’ll also post what to do with each of these problems. I’ve popped in a screenshot followed by the few sins I spot.
- Write useful method and variable names, they should be short, memorable and informative. "x" is not a good variable name…
- Write docstrings for your methods, I like google’s docstring style which includes arguments and clear indentation etc. Docstrings are what power most autodoc tools like Sphinx. They also power the "??" Magic command in jupyter notebooks that is so helpful when wanting to peek at a method you want to use. Line 7 onwards has no docstring, certainly needs one with how complicated this method is. There is non included in this method.
- Use variables from within an appropriate scope. If you find yourself writing global variables all over the place and referencing variables from other parts of your codebase. Try to restructure and encapsulate your code so that relevant pieces of data are close to the functions that require them. Otherwise you likely have spaghetti code and promiscuous variables that you’ve got no idea how they’re being used elsewhere or by other functions. This method uses global variables unnecessarily.
- Make things as simple as possible, if you don’t have to write something to achieve the same result, don’t unless it adds context or information that isn’t clear otherwise.
- Do one thing at a time. There is too much going on in the above line. There is a call to pd.isnull based on the "START" column and then pulling "my_var" based on the output of that dataframe and then an iloc on the first row is pulled. This is too much in one line and for readability can be broken into multiple lines for sanity. Play codegolf if you want to write complicated one liner solutions, save your colleagues headaches by writing simple lines of code which they enjoy reading.
- Return early from a function if you can, it informs the reader and is faster. If you’ve found your answer in a method and don’t need to do any more work, return straight away. You can write multiple return statements in a method, the first one encountered will exit your code execution from that method/stack frame and you won’t have to continue processing for an answer you already have. This also makes it clear to the reader that these statements are final and if they’re looking at a particular code path they can stop reading and return from that method but if the variable is being assigned, they must unnecessarily continue reading the code in order to find when its finally returned.
- Each method should do one thing and do that thing well. If you find yourself writing a method with 5+ if else statements / decisions points, you’re likely doing more than one thing in that method and you should break down your problem into a few distinct steps. Software Engineers commonly refer to the Single Responsibility Principle or SRP as being breached in this scenario. Write many functions that do one thing and reference them together if you want to orchestrate many objectives in one method. Deeply nested if else statements are hard to debug and understand. Your far better off writing many single methods that achieve distinct goals which you can synthesize together rather than a humongous method that does it all at once. Start sweating when your methods grow over 10–15 lines and rethink what you’re writing so that its easier to read. There are always exceptions to this rule but I’ll bet its right more than its wrong.
- Don’t write a line of code so long it has to be split over 3+ lines in order to read it. You likely are doing too many things at once and again, you could break down the problem into steps and have clear and distinct steps. Its also very very difficult to read and understand, you’ve gifted headaches to everyone that follows you which often is yourself. You don’t really want to give yourself headaches do you? There are of course exceptions to this but for the most part, multi line lines of code are generally not good signs.
- Almost everything you ever want to do with data has been built into wonderful libraries like pandas and spark. Don’t re-write binning logic if you don’t have to, pandas is so much faster and smarter than its worth you trying to figure out. The above code snippet is an attempt at doing just that and it was orders of magnitude slower than letting pandas do it. Far harder to maintain as well.
- Lambdas are mercurial and divisive creatures. Wonderfully useful and smart to have in your wheelhouse but once wielding the lambda hammer, many a programmer start to see everything as a lambda nail when combined with the map, filter, and reduce functions. They are defined as anonymous functions which are to be used once and never referenced again. If you find yourself passing in named functions and doing funny stuff like pd.apply within a pd.apply function then you’re wonderfully creative but again you’re gifting headaches to the next person to come across what you’ve written. Simplify your problem into steps and if something needs to be done multiple times, create loops/generators to iterate over. Iterators are wonderful things, Raymond Hettinger has many resources and talks on the subject.
- Finally, write useful comments and references within your code for the next person so that their life is easy. Use this code as an example, I’m sure it was written in a flurry within a couple of hours but here I am spending many hours writing a blog, refactoring it myself, you’re reading this blog about the code and the time spent being read is much longer than the time taken to write it. Much more time is invested in reading and understanding code than is spent writing it, make sure you optimize for the reader. It might take you longer to write but empathy is a wonderful trait and code is prose to express such a virtue. You’ll often be thanking yourself anyway so even if your selfish, its still in your best interests.
Further Blogs
I’d love feedback and comments on what was enjoyable or not useful in this blog as to help drive further writing. I’m thinking of doing a small tutorial on unit testing which makes refactoring even more enjoyable. You have a test harness to cover your butt and you have provable evidence that given the same inputs, you get the same outputs. You can edit and refactor to your hearts content with much higher confidence that you haven’t introduced strange behavior. I’m open to any line of thinking that may have been brought up from this piece, any comments are welcome and encouraged.
Know that this is only one way of solving the problem, feel free to write your own solution to this problem or re-write a solution to another problem you have. My solution even has room to be refactored and reduced. Refactoring is never finished and there’s always ways to simplify, distill, improve, and refine your work!
Smarter People Than Me & Extra Resources
Special Thanks to ColJung for encouraging me to start writing. I always appreciate his guidance, encouragement, questions and comradery. I highly recommend his writing, it’s wonderful stuff and is in large part the inspiration for me starting a blog.
Here’s a full running notebook you can reference if you’d like to see the above gists running with some extra commentary:
This is Collin, the planet brained colleague I love working with who I will likely forever pester for knowledge and company
This is Raymond’s Talk that made me love refactoring & made me think differently about solving problems.
This is Joel, he’s a fantastic writer and his top blogs such as the one on encoding are brilliant.
Martin Fowler is another great writer that I thoroughly enjoy who’s written about refactoring extensively
Here is his page on refactoring!
Jeremy Howard & Rachel Thomas are my two heroes, Rachel in the fast.ai deep learning for coders course encourages everyone to start a blog and begin writing so I also partly attribute the start of this blog to her. I want to share them to anyone and everyone as they are brilliant. They made fast.ai among other things and its fantastic. For all things ethical and practical neural net related, go there and gorge on the brilliant content they’ve produced.