The world’s leading publication for data science, AI, and ML professionals.

How to Load Excel Files with Hidden Rows and Columns into Pandas

Using the openpyxl package to ignore hidden rows and columns when reading Excel files as Pandas DataFrames

Photo by Mika Baumeister on Unsplash
Photo by Mika Baumeister on Unsplash

I learned something new recently – that it is possible to exclude hidden rows and/or columns when reading Excel files as Pandas DataFrames, all thanks to the openpyxl package, and I’d like to share it with you.

It is not uncommon to have rows or columns of an Excel file hidden. Less-relevant columns may be hidden to keep a spreadsheet neater; or rows may be hidden to deliberately ignore duplicate data.

Whatever the reasons are, it may be of value to keep them hidden when loading a file as a Pandas DataFrame, as it may affect subsequent data wrangling and exploration.

Unfortunately, the traditional way of reading Excel files as Pandas DataFrames – using the pandas.read_excel() method – does not facilitate that.

In this post, we will explore how we can identify hidden rows and columns of an Excel file using the Openpyxl package and hence exclude them when loading it as a Pandas DataFrame.

The dataset

We will be using a small subset of the titanic training dataset publicly available on Kaggle. For simplicity, I have limited the number of rows of data to 20. I have also deliberately hidden Columns F ("Age"), I ("Ticket") and K ("Cabin"), as well as Rows 6, 11, 16 and 21.

Preview of dataset in Excel (Image by Author)
Preview of dataset in Excel (Image by Author)

Why pandas.read_excel() does not work?

Typically, we use Pandasread_excel() method to load a dataset in Excel as a Pandas DataFrame. Let’s see what happens when we do that for our titanic dataset containing hidden rows and columns.

>>> import pandas as pd
>>> df = pd.read_excel("data/titanic.xlsx")
>>> df
Dataset as Pandas DataFrame with all rows and columns (Image by Author)
Dataset as Pandas DataFrame with all rows and columns (Image by Author)

By default, the read_excel() method reads in data from all rows and columns from the specified Excel file. In other words, it does not exclude the hidden rows and columns.

Installation of openpyxl

To load as Pandas DataFrames without hidden rows and columns, we can use the openpyxl package, a Python library to "read/write Excel 2010 xlsx/xlsm/xltx/xltm files".

Since openpyxl is not a standard Python built-in library, you will first need to install it. Open a command line window and type the following command:

>>> pip install openpyxl

Loading Excel file and worksheet using openpyxl

To open an existing Excel file using the openpyxl package, we use the openpyxl.load_workbook() method, specifying the name of the path where the Excel file is stored.

>>> import openpyxl
# Open an Excel workbook
>>> workbook = openpyxl.load_workbook("data/titanic.xlsx")

This creates a Workbook object which, according to the documentation, is the "top-level container for all document information". This object contains many attributes pertaining to the input file, including the .sheetnames attribute which returns the list of the names of all worksheets in the workbook.

# Create a list of names of all worksheets in `workbook`
>>> sheet_names = workbook.sheetnames
# Create a `Worksheet` object 
>>> worksheet = workbook[sheet_names[0]]

In our titanic.xlsx file, we only have one worksheet named "train", so we get the sheet name by taking the first element of the sheet_names list. Next, we create a Worksheet object from the Workbook object.

Finding indices of hidden rows

Similarly, the Worksheet object contains attributes pertaining to the specified worksheet. To find indices of all hidden rows, we make use of the .row_dimensions attribute of the Worksheet object, like this:

# List of indices corresponding to all hidden rows
>>> hidden_rows_idx = [
        row - 2 
        for row, dimension in worksheet.row_dimensions.items() 
        if dimension.hidden
    ]
>>> print(hidden_rows_idx)
[4, 9, 14, 19]

Notice that we need to take row - 2 instead of just row because we want to find indices corresponding to the Pandas DataFrame, not the Excel file.

Finding names of hidden columns

To find names of all hidden columns, we first use the .column_dimension attribute of the Worksheet object:

# List of indices corresponding to all hidden columns
>>> hidden_cols = [
        col 
        for col, dimension in worksheet.column_dimensions.items() 
        if dimension.hidden
    ]
>>> print(hidden_cols)
['F', 'I', 'K']

This generates a list comprising uppercase alphabets that correspond to the hidden columns of an Excel worksheet. Thus, we need to somehow convert the hidden_cols list into a list of names of the hidden columns. To do so, we use Python’s built-in library, string, and its .ascii_uppercase attribute:

# List of indices corresponding to all hidden columns
>>> hidden_cols_idx = [
        string.ascii_uppercase.index(col_name) 
        for col_name in hidden_cols
    ]
# Find names of columns corresponding to hidden column indices
>>> hidden_cols_name = df.columns[hidden_cols_idx].tolist()
>>> print(hidden_cols_name)
['Age', 'Ticket', 'Cabin']

Disclaimer: Do note that using string.ascii_uppercase assumes that there are at most 26 columns. If there are more than 26 columns, the code will need to be modified.

Finally, once we have the indices for the hidden rows and names for the hidden columns, the rest is simple. To exclude those hidden rows and columns, we simply use Pandas’ .drop()method.

# Drop the hidden columns
>>> df.drop(hidden_cols_name, axis=1, inplace=True)
# Drop the hidden rows
>>> df.drop(hidden_rows_idx, axis=0, inplace=True)
# Reset the index 
>>> df.reset_index(drop=True, inplace=True)
>>> df
Dataset as Pandas DataFrame without hidden rows and columns (Image by Author)
Dataset as Pandas DataFrame without hidden rows and columns (Image by Author)

Putting it all together

Here’s a code snippet that puts together the above codes with some simple refactoring:

Wrapping it up

So, there you have it – a Python code that allows you to read an Excel file containing hidden rows and/or columns as is, as Pandas DataFrames.

This particular use-case only scratches the surface of what the openpyxl package offers. For more information about the openpyxl package, check out its documentation here.

The codes shown in this post can also be found as a notebook at this GitHub repo.


Hello! I’m Zeya. Thanks for reading this post. If you’ve found it useful, do let me know in the comments. I welcome _ discussions, questions and constructive feedback too. Feel free to follow me on Medium or reach out to me via LinkedIn or [Twitter](https://twitter.com/zeyalt). Have a great day!


Related Articles