Data Wrangling / Machine Learning / Python

The Many Ways To Re-skin An Intercept

Methods to add an intercept column to a dataset

Karsten
Towards Data Science
6 min readJan 24, 2022

--

Image by Karsten Cao

If you are anything like me, you have undoubtedly encountered linear regression on your data science journey. Maybe to understand the model you decided to code one from scratch or a least with a few common libraries such as Pandas or NumPy. I, somehow, keep forgetting the step to add a constant column of 1’s onto my datasets for my intercept and end up having to look up the process only to find a different method from before. So that you and I may never go too far searching, here are several ways to add an intercept column onto an existing dataset.

The methods are followed by a bonus time and memory evaluation section at the end for those who are curious. pd.concat was found to be the most memory efficient and it produced a Pandas DataFrame, while np.concatenate was among the best in terms of speed and it produced a NumPy array.

1. Creating the Data

I started with a Pandas DataFrame as it works with all the methods provided. A NumPy array can also be used, but it is limited to the NumPy processes.

Here we have the data represented in [2] a Pandas DataFrame and [3] a NumPy 2D Array. Images by Karsten Cao

2. “Re-skinning” the Intercept

Below I listed 7 methods that I encountered to add an intercept column to an existing dataset. They seem to be split into three categories: allocations-first, concatenations, and inserts. The results are in a Pandas DataFrame or NumPy 2D array with the intercept in the left most column, ready for some linear algebra and machine learning.

The general outputs of any which of the methods. The Pandas DataFrame is on top, and the NumPy array is below. Image by Karsten Cao

3. General Discussion

In this section, I will explore each method illustrate a high level description of each process.

The first method allocates a matrix of ones in the shape of the original dataset with one extra column. The non-intercept positions are then replaced with the data from X. This may feels slow as a lot of 1’s are generated and unused before being overwritten. The output is a np.array.

tempX = np.ones((X.shape[0], X.shape[1] + 1))
tempX[:,1:] = X

The second method uses concatenation. It generates an intercept column, converts that into a DataFrame object and then concatenates the column with the original dataset. This method’s memory usage was quite surprising as it barely used more memory than the original dataset. The output is a pd.DataFrame.

pd.concat([pd.DataFrame(np.ones(X.shape[0])), X], axis=1)

The third, fourth, and fifth methods are concatenation methods usingNumPy. Other than the difference in parameters needed, these are just three different approaches to append an intercept onto a dataset. Each creates a new intercept column before combining the result for a new output. The outputs are np.arrays. There are differences, however the differences are not explored in this article.

np.c_[np.ones(X.shape[0]), X]np.hstack([np.ones((X.shape[0], 1)), X])np.concatenate([np.ones((X.shape[0], 1)), X], axis=1)

The sixth method is an insert. It specifies the column by providing the index, and axis orientation, chooses what value to insert, and continues. It uses an np.array(X) to convert the DataFrame to a NumPy array. This is similar to the Python list built-in function but it does carry some confusion because you can add a single value to propagate a column in the array. You could similarly pass [1] in the third argument for the exact same output. The output is a np.array.

np.insert(np.array(X), 0, 1, axis=1)

The seventh method is an insert and swap. It makes a copy so that the original DataFrame is left untouched. An intercept is added using a special Pandas process similar to the sixth method where we set a column to a scalar, creating a column filled with the provided scalar. The columns of the DataFrame are pulled out, reordered, and then passed into the reindex method. The output is a pd.DataFrame.

tempX = X.copy()
tempX['intercept'] = 1
columns = list(tempX.columns)
columns[0], columns[1:] = columns[-1], columns[0:-1]
tempX.reindex(columns=columns)

These were the several ways that I found to add an intercept to an existing dataset using Pandas or NumPy. Since this process is only done once per dataset and just to prepare data for a model like Linear Regression, any of these options are perfectly suitable for a person just getting into Machine Learning and Data Science.

But what if milliseconds and megabytes matter? Keep reading.

4. The Nitty Gritty

Some might not be satisfied after being told that “any of these options are perfectly suitable”. In this section I explore the time and space efficiencies of each model and compare their performances.

Methodology

I used a Pandas DataFrame and a NumPy Array with (100_000, 100) randomly initialized variables.
To avoid Python caching the arrays and matrices, I ran and timed each method individually and in isolation. A base() function was created to initialize and calculate the constant memory usages of the dataset and the intercept column. In addition, the base memory usage was used to verify which variables were being cached.
All the intercept methods were placed in a function, which was wrapped in a memory profiler to measure incremental data usage. The memory usage below was calculated using the profiler decorator.
To time each function, I created a time decorator that looped through each intercept method, calculated the elapsed time, and logged the results.

Results

Constant Memory Usage for recurring calculations and to establish a baseline. Image by Karsten Cao
Memory usage and time evaluation (s) for each method. Image by Karsten Cao

Here are a few things that I noticed:

  • The initial data usage difference between Pandas and NumPy was negligible.
  • The np.array had a smaller initial overhead and ran faster on the mentioned methods, except for np.concatenate, with 10 iterations.
  • Surprisingly, the second method that used pd.concat was the most memory efficient method with only method with 323.9MB of ram usage. It seemed to merge the pointers into the returned DataFrame instead of allocating new space for a copy. Unfortunately it took the longest, taking 2–5 times the amount of time compared to that of np.concatenate.
  • The fastest method varied by iterations. With 10 iterations, np.concatenate, np.hstack, and np.insert tied with 0.097 seconds. With 100 iterations, the NumPy concatenation methods’ (3, 4, 5) times had a three-way tie ± 1%.
  • If you want the output to be a DataFrame, the seventh method with an insert and reindex was surprisingly fast with 0.801 seconds for 100 iterations and 0.130 seconds for 10 iterations. Otherwise np.concatenate was consistently fast across both dataset types.

Conclusion

All in all, I would recommend sticking to whichever data object you started with or what fits your use case. After all, if you’re coding in Python then maybe what is most important to you is your time to code and think instead of your ability to shave off milliseconds to min-max the process of adding an intercept to your dataset.

I know I will just be using the one that I can remember.

Thank you for reading! Feel free to check out the code here!

--

--

Join me on my road to becoming a Data Analyst! Sharing my observations one post at a time.