Python tips for someone transiting from Excel

Published in

Towards Data Science

9 min readOct 20, 2020

A discussion on best practices for greater productivity

Excel is one of the most frequently used application for data analysis. However, due to its clear limitations (i.e. performance, row and column limits), it may not be sufficient to cope with modern demands. Python fanatics have often recommended using the pandas library as an alternative to Excel to automate/speed-up data analysis. However, it may not necessarily lead to time savings if one is not familiar with the library and some of the best practices. This article aims to highlight some of the challenges in data analysis using Excel and how the use of python (and pandas library, along with some additional libraries) and adoption of best practices can help to increase overall efficiency. Some basic knowledge of python and pandas is assumed.

The typical data analyst may be sandwiched between the business function and the application owner. As a result of this segregation of responsibilities, there may be no single individual who would understand all use cases of the data. The responsibility then lies on the ardent data analyst to slice-and-dice the data, form hypothesis based on inputs and look for counterexamples to refute the hypothesis. This recursive process would require the data analyst to manually and repetitively filter, sort, group, join, pivot, vlookup, split the data on Excel. These steps can potentially lead to manual errors, especially when operations do not commute (i.e. results depend on the order of operations). The key to resolve this is to fundamentally change the approach: instead of mechanically manipulating the data to prepare the output, one should think strategically and prepare a program/script that generates the output. For me, this was the greatest motivation to use python, as the sheer manual and mind-numbing task of repeating a series of actions just didn’t sit well with me and Excel-based macros were just to slow.

This document/article covers some useful ideas that one can perhaps leverage to save some time and potentially avoid repetitive work via the use of python, specifically the pandas library.

Some caveats

Using pandas would have little value-add when:
(a) the data size is rather small (less than 50000 rows),
(b) very few (less than 10) files are involved,
(c) preparation of output data for future use is unlikely to be repeated frequently or manually.

Excel would be the preferred tool in these circumstances, depending on your familiarity with python and pandas.

Python and pandas have very limited graphical user interfaces, hence most may find it to be much less intuitive to use compared to Excel. For example, trivial operations such as reordering columns and finding if a value exists are trivial in Excel, but these would require a few lines of code on python and would be not so intuitive to a beginner dabbling in python and pandas for the first time.

On IDEs, Text Editors…

While there are fancy tools (PyCharm and Jupyter notebooks) available, these were not available or feasible (the tool is available, but there is insufficient RAM to use it) in the projects which I have been deployed to. Hence, I’ve very much used just the command prompt or Powershell to run the code. Any text editor with syntax highlighting should be fine, I use Notepad++.

1. Read files once, pickle them and read pickles in future.

As trivial as this sounds, the first task of the data analyst is to open and read the data files. The time taken for pandas to read an Excel file as a dataframe is longer than the time taken for Excel to open the same Excel file.¹ Reading data files as dataframes can take minutes when files are large (especially Excel files). If the data set need to be read every time the script is executed, the repeated process of executing and editing the script will substantially erode efficiency gains. The approach to overcome the bottleneck is to read the data and write it as a pickle (i.e. .pkl ). It is about 100 times faster² to read pickles. Further efficiency gains can be obtained by taking a sample (e.g. ~10,000 rows) of the data and writing it as a pickle. One should also note that reading flat files (.csv and delimited text files) is much faster than reading excel files, therefore where possible, one should retrieve/request the files in such flat file formats, assuming these data files are not meant to be opened by users in Excel.

2. Plan functions and code structure well

“I will do this right, this time,” said every developer when starting on a brand new project.

At the start of every project, the natural inclination would be to write hard-coded scripts to quickly generate the necessary reports. During this phase, there could be a single .py file with one massive function which creates one single report.³ Resist the urge to generate output reports hastily because long functions are hard to maintain, and technical debt accumulates very quickly.

Because readability counts, each function should logically represent one operation expected. The example below shows the recommended approach to read, clean, modify data, and multiply 2 columns to give the result in another column.

Example 1.1 (recommended):

def read_trade_data():
   return pd.read_excel('data.xlsx')def clean_trade_data(df):
    # PRICE turns out to be text with comma as decimals point.
    df['PRICE'] = df['PRICE'].str.replace({",", "."})
                             .astype('float64')
    return dfdef add_col_total(df):
    df['TOTAL'] = df['QTY']*df['PRICE']
    return dfdef df_output():
    df = read_trade_data()
    df1 = clean_trade_data(df)
    df2 = add_col_total(df1)
    return df2

Example 1.2

def df_output():
    df = pd.read_excel('data.xlsx')
    df['PRICE'] = df['PRICE'].str.replace({",","."})
                             .astype('float64')
    df['TOTAL'] = df['QTY']*df['PRICE']
    return df

While Example 1.1 looks more verbose, there are several benefits. Because of the segregation, df_output is way more readable in 1.1 as one would not have to read through all the code to clean data before reading code that modifies data. Moreover, if other dataframes are expected to have a similar decimal format, one can reuse the clean_trade_data function in 1.1. On the other hand, one would resort to copying and pasting code in 1.2. Also, if an error were to occur on df_output, it is easier to step into the smaller functions in python shell to debug for 1.1.

Optional and keyword arguments should be used to provide flexibility to data analyst; bite-size functions that do similar actions should be avoided.

If there is only one version of data.xlsx expected this should suffice:

def read_trade_data():
    return pd.read_excel('data.xlsx')

However, if there are multiple versions of data.xlsx which might need to be analyzed individually, it would make sense to write a function so that it can read these files quickly from the command line instead of having to recompile the code.

Example 2.1 (Negative example):

def read_trade_data1():
    return pd.read_excel('data1.xlsx')

There will be many such functions each mapping to a different file (e.g. data2.xlsx, data3.xlsx, …) and it can become really confusing over time.

Example 2.2 (Recommended):

def read_trade_data(file=None):
    d_file = {1: 'data.xlsx',
              2: 'data1.xlsx'}
    file = file if file else d_file[max(d_files.keys())]
    return pd.read_excel(file)

Example 2.2 has several advantages:
(a) all the files are in the same dictionary (readability counts!),
(b) it reads the latest dataN.xlsx if nothing is passed as the file argument,
(c) it allows one to pass the path to the file explicitly if the file has not been added to the files dictionary.

Many coding tutorials include code similar to Example 2.1 as the context of the tutorial may not justify the use of the recommended best practice.

Some of these functions can be made to be sufficiently generic such at that it can be applied for other projects. For example, it is usually very helpful to output a summary of the data, i.e. number of unique values for each column and printing a small sample of the data. Such functions should then be written as methods for the dataframe class (so that it can be called as a method df.summarize() instead of mymodule.summarize(df)). Furthermore, if such function applies generically across any dataset, then it should be refactored into a separate python file and imported to the namespace of the main project file.

Here’s one of the functions in my toolkit (pypo.py).

There are times when it is faster to view and perform analysis in Excel. But it can be a hassle to navigate to the file and then double-click on it. This function upgrades the to_excel() method by opening the Excel file after it has been written.

#pypo.pyfrom pandas.core.base import PandasObjectdef to_excelp(df, *arg, **kw):
    
    def xlopen(path):
        import win32com.client as w32
        from pywintypes import com_error
        import os        #Opens the file (.csv, .xls, .xlsx) in Excel
        xl = w32.Dispatch('Excel.Application')        try:
            wb = xl.Workbooks.Open(path)
        except com_error:
            print('Check if file exists in current working directory…', end='')
            if path in os.listdir():
                print('found!')
                path = '{}\{}'.format(os.getcwd(), path)
                wb = xl.Workbooks.Open(path)
            else:
                print('not found! Please check file path!')
        else:
            pass
        finally:
            xl.Visible = True
        return path, xl, wb

    df.to_excel(*arg, **kw)
    path, xl, wb = xlopen(arg[0])
    return path, xl, wbPandasObject.to_excelp = to_excelp

Add this to the namespace of the main working file:

#main.pysys.path.insert(1, 'C:/Documents/pyprojects/pypo.py')
import pypo

While scripting in the command line interface, one can simply type df.to_excelp('test.xlsx') to write a dataframe as an Excel file which will be opened once written. The function also returns the Excel application xland workbook wb object, which can be used subsequently (to automate formatting and creation of tables in MsExcel perhaps?).

Assessors are available to extend python, and while these can simplify the code in pypo.py, it would result in a slightly more verbose syntax (i.e. df.myfunc.to_excel()) in the main file. Recall that “flat is better than nested”.

3. Filtering and Selecting data

The intuitive approach is to write functions which return dataframes after filtering and/or selecting the relevant rows or columns. However, appending or concatenating these dataframes is slow. The best practice found so far is to write functions which return boolean masks instead. Combining other masks can be done using binary operations and getting the row count will become simpler⁴, albeit less intuitive.

Example 3.1: Returning a dataframe

# Returns a dataframe
def apples(df):
    return df[df['Fruits']=='Apples']df_apples = apples(df)

The intuitive approach is faster if further transformations (i.e. merge, sort, filter) are not expected.

Example 3.2: Returning a boolean mask

def select_apples(df)
    return df['Fruits']=='Apples'mask_apples = select_apples(df)

To get the number of rows of apples: sum(mask_apples)
To get the dataframe containing apples: df[mask_apples]

This approach avoids the setup cost of a dataframe until it really needed. Operations using boolean masks are much faster than operations on dataframes.

4. Optimizing

Avoid explicitly looping through rows or columns of a dataframe whenever possible; that is the whole point of pandas. In most situations, np.where, np.select, np.cut, np.vectorize, and df.apply should do the trick. Some of these methods are essentially still loops but are generally faster than explicit loops.

5. Avoid vlookups in Excel, merge tables instead

Excel’s vlookup function was never intended to be used to merge 2 tables. Quoting docs.microsoft, vlookup “searches for a value in the first column and returns a value in the same row from another column in the table array”. When the first column of the table array contains more than one of the lookup value, the row number of the first value found is used. The subsequent values which matches the lookup values are ignored. Therefore, use pandas.merge instead and deal with the duplicate entries, if any. If vlookup is preferred, ensure that the lookup values are unique.

Vlookup looks to the right of the table; looking ‘left’ requires a combination of choose() and array formulae which can be error-prone if one is not familiar with the behavior of array formula. Some may avoid looking ‘left’ by inserting a helper (duplicate) column, which introduces some form of redundancy.

The motivation for the article is to consolidate the author’s learnings across various projects so that the reader will not have to go through a similar amount of pain when he/she is handling voluminous amount of data at work. At the time of writing, there is a copious amount of information available on the Internet written on data science, big data, etc. However, there is little guidance available pertaining to the best project practices when implementing an open-source framework for data analysis.

[1] To some extent, it is not a fair comparison as Excel files are designed to be opened in MsExcel and pandas does a lot of magic under the hood to prep the data as a dataframe.

[2] It takes 30 minutes to read 1 sheet from 12 excel files. Total 1.7million rows. Reading the pickle equivalent takes 15 seconds.

[3] If you are smiling at this because you understand or have seen such horror, that is good. If you are wondering what is wrong with this, please read on.

[4] sum(mask), as True values are evaluated as 1 in python.