Get into shape! 🐍

Shaping and reshaping NumPy and pandas objects to avoid errors

Jeff Hale
Towards Data Science

--

Shape errors are the bane of many folks learning data science. I would bet money that people have quit their data science learning journey due to frustration with getting data into the shape required for machine learning algorithms.

Having a stronger understanding of how to reshape your data will spare you tears, save you time, and help you grow as a data scientist. In this article, you’ll see how to get your data in the shape you need it. 🎉

greece ruins
Athens has many rows and columns. source: pixabay.com

Doing it

First, let’s make sure we’re using similar package versions. Let’s import the libraries we’ll need under their usual aliases. All code is available here.

import sys
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

If you don’t have the libraries you need installed, uncomment the following cell and run it. Then run the cell imports again. You may need to restart your kernel.

# !pip install -U numpy pandas scikit-learn

Let’s check our package versions.

print(f"Python: {sys.version}")
print(f'NumPy: {np.__version__}')
print(f'pandas: {pd.__version__}')
print(f'scikit-learn: {sklearn.__version__}')
Python: 3.8.5 (default, Sep 4 2020, 02:22:02)
[Clang 10.0.0 ]
NumPy: 1.19.2
pandas: 1.2.0
scikit-learn: 0.24.0

Dimensions

A pandas DataFrame has two dimensions: the rows and the columns.

Let’s make a tiny DataFrame with some hurricane data.

df_hurricanes = pd.DataFrame(dict(
name=['Zeta', 'Andrew', 'Agnes'],
year=[2020, 1992, 1972 ]
))
df_hurricanes
png

You can see the number of dimensions for a pandas data structure with the ndim attribute.

df_hurricanes.ndim2

A DataFrame has both rows and columns, so it has two dimensions.

Shape

The shape attribute shows the number of items in each dimension. Checking a DataFrame’s shape returns a tuple with two integers. The first is the number of rows and the second is the number of columns. 👍

df_hurricanes.shape(3, 2)

We have three rows and two columns. Cool. 😎

The size attribute shows us how many cells we have.

df_hurricanes.size6

3 * 2 = 6

It’s easy to get the number of dimensions and size form the shape attribute so that’s the one to remember and the one we’ll use. 🚀

Let’s make a pandas Series from our DataFrame. Use just the brackets syntax to select a column by passing the name of the column as a string. You get back a Series.

years_series = df_hurricanes['year']
years_series
0 2020
1 1992
2 1972
Name: year, dtype: int64
type(years_series)pandas.core.series.Series

What does the shape of a pandas Series look like? We can use the Series shape attribute to find out.

years_series.shape(3,)

We have a tuple with just one value, the number of rows. Remember that the index doesn’t count as a column. ☝️

What happens if we use just the brackets again, except this time we pass a list containing a single column name?

years_df = df_hurricanes[['year']]
years_df
png
type(years_df)pandas.core.frame.DataFrame

My variable name might have given away the answer. 😉 You always get back a DataFrame if you pass a list of column names.

years_df.shape(3, 1)

Take away: the shape of a pandas Series and the shape of a pandas DataFrame with one column are different! A DataFrame has a shape of rows by columns and a Series has a shape of rows. This is a key point that trips folks up.

Now that we know about finding the shape in pandas, let’s look at using NumPy. Pandas extends NumPy.

Greek ruins
Columns. source: pixabay.com

NumPy

NumPy’s ndarray is its core data structure — we’ll just refer to it as an array from here on. There are many ways to create NumPy arrays, depending upon your goals. Check out my guide on the topic here.

Let’s make a NumPy array from our DataFrame and check its shape.

two_d_arr = df_hurricanes.to_numpy()
two_d_arr
array([['Zeta', 2020],
['Andrew', 1992],
['Agnes', 1972]], dtype=object)
type(two_d_arr)numpy.ndarraytwo_d_arr.shape(3, 2)

The shape returned matches what we saw when we used pandas. Pandas and NumPy share some attributes and methods, including the shape attribute.

Let’s convert the pandas Series we made earlier into a NumPy array and check its shape.

one_d_arr = years_series.to_numpy()
one_d_arr
array([2020, 1992, 1972])type(one_d_arr)numpy.ndarrayone_d_arr.shape(3,)

Again, we see the same result in pandas and NumPy. Cool!

parthenon
Rows and columns. source: pixabay.com

The problem

Things get tricky when an object expects data to arrive in a certain shape. For example, most scikit-learn transformers and estimators expect to be fed their predictive X data in two-dimensional form. The target variable, y is expected to be one-dimensional. Let’s demonstrate how to reshape with a silly example where we use year to predict the hurricane name.

We’ll make x lowercase because it has just one dimension.

x = df_hurricanes['year']
x
0 2020
1 1992
2 1972
Name: year, dtype: int64
type(x)pandas.core.series.Seriesx.shape(3,)

Same goes for our output variable, y.

y = df_hurricanes['name']
y
0 Zeta
1 Andrew
2 Agnes
Name: name, dtype: object
type(y)pandas.core.series.Seriesy.shape(3,)

Let’s instantiate and fit a LogisticRegression model.

lr = LogisticRegression()
lr.fit(x, y)

And you get a value error. The last lines read:

ValueError: Expected 2D array, got 1D array instead:
array=[2020. 1992. 1972.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Let’s try to follow the error message’s instructions:

x.reshape(-1, 1)

Reshaping is great if you passed a NumPy array, but we passed a pandas Series. So we get another error:

AttributeError: 'Series' object has no attribute 'reshape'

We could change our Series into a NumPy array and then reshape it to have two dimensions. However, as you saw above, there’s an easier way to make x a 2D object. Just pass the columns as a list using just the bracket syntax.

I’ll make the result a capital X, because it will be a 2D array — and a capital letter is the statistical naming convention for a 2D array (also known as a matrix) .

Let’s do it!

X = df_hurricanes[['year']]
X
png
type(X)pandas.core.frame.DataFrameX.shape(3, 1)

Now we can fit our model without errors! 😁

lr.fit(X, y)LogisticRegression()

Reshaping NumPy arrays

If our data were stored in a 1D NumPy array, then we could do what the error message suggests and turn it into a 2D array with reshape. Let's try that with the data we saved as a 1D NumPy array earlier.

one_d_arrarray([2020, 1992, 1972])one_d_arr.shape(3,)

Let’s reshape it!

hard_coded_arr_shape = one_d_arr.reshape(3, 1)
hard_coded_arr_shape
array([[2020],
[1992],
[1972]])
hard_coded_arr_shape.shape(3, 1)

Passing a positive integer means give that dimension that shape. So now our array has the shape 3, 1.

However, it’s a better coding practice to use a flexible, dynamic option. So let’s use -1 with .reshape().

two_d_arr_from_reshape = one_d_arr.reshape(-1, 1)
two_d_arr_from_reshape
array([[2020],
[1992],
[1972]])
two_d_arr_from_reshape.shape(3, 1)

Let’s unpack that code. We passed a 1 so the the second dimension — the columns — got a 1.

We passed a negative integer for the other dimension. That means the remaining dimension becomes whatever shape is needed to make it hold all the original data.

Think of -1 as fill in the blank to make a dimension so that all the data has a home. 🏠

In this case you end up with a 2D array with 3 rows and 1 column. -1 took on the value 3.

It’s a good practice to make our code flexible so that it can handle how many ever observations we throw at it. So instead of hard-coding both dimensions, use -1. 🙂

greek ruins
source: pixabay.com

Higher dimensional arrays

The same principle can be used for reshaping higher dimensional arrays. Let’s make a three-dimensional array and then reshape it into a four-dimensional array.

two_d_arrarray([['Zeta', 2020],
['Andrew', 1992],
['Agnes', 1972]], dtype=object)
two_d_arr.shape(3, 2)three_d_arr = two_d_arr.reshape(2, 1, 3)
three_d_arr
array([[['Zeta', 2020, 'Andrew']],

[[1992, 'Agnes', 1972]]], dtype=object)

Use -1, to indicate which dimension should be the one to be computed to give exactly all the data a home.

arr = two_d_arr.reshape(1, 2, -1, 1)
arr
array([[[['Zeta'],
[2020],
['Andrew']],

[[1992],
['Agnes'],
[1972]]]], dtype=object)

Note that if the reshape dimensions don’t make sense, you’ll get an error. Like this:

two_d_arr.reshape(4, -1)
two_d_arr
--------------------------------------------------------------------

ValueError: cannot reshape array of size 6 into shape (4,newaxis)

We have six values, so we can only reshape the array into the number of dimensions that can hold exactly six values.

In other words, the number of dimensions must form the product six. Remember that -1 is like a wildcard that can become any integer value.

Predicting

Scikit-learn expects a 2D array for most predictions.

Say you have one single sample in a list that you want to use to make a prediction. You might naively think the following code will work.

lr.predict(np.array([2012]))

It doesn’t. ☹️

ValueError: Expected 2D array, got 1D array instead:
array=[2012].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

However, we can follow the helpful error suggestion to make a two-dimensional array with reshape(1, -1).

lr.predict(np.array([2012]).reshape(1, -1))array(['Zeta'], dtype=object)np.array([2012]).reshape(1, -1).shape(1, 1)

You’ve made the first dimension (the rows) 1 and the second dimension (the columns) match the number of features 1. Cool!

Don’t be afraid to check the shape of an object — even just to confirm it is what you think it is. 🙂

While we’re on the topic of reshaping for scikit-learn, note that the text vectorization transformers such as CountVectorizer behave differently than other scikit-learn transformers. They assume you have just one column of text, so they expect a 1D array instead of a 2D array. You might need to reshape. ⚠️

Other ways to make a 1D array

In addition to reshaping with reshape, NumPy's flatten and ravel both return a 1D array. The differences are in whether they create a copy or a view of the original array and whether the data is stored contiguously in memory. Check out this nice Stack Overflow answer for more info.

Let’s look at one other way to squeeze a 2D array into a 1D array.

car getting crushed
Squeeze to reshape. source: pixabay.com

Squeeze out unneeded dimensions

When you have a multi-dimensional array but one of the dimensions doesn’t hold any new information you can squeeze out the unnecessary dimension with .squeeze(). For example, let's use the array we made earlier.

two_d_arr_from_reshapearray([[2020],
[1992],
[1972]])
two_d_arr_from_reshape.shape(3, 1)squeezed = np.squeeze(two_d_arr_from_reshape)squeezed.shape(3,)

Ta da!

Note that the TensorFlow and PyTorch libraries play nicely with NumPy and can handle higher dimensional arrays representing things like video data. Getting the data into the shape the input layer to your neural network requires is a frequent source of errors. You can use the tools above to reshape your data into the required dimensions. 🚀

Wrap

You’ve seen how to reshape NumPy arrays. Hopefully future code you see will make more sense and you’ll be able to quickly manipulate NumPy arrays into the shapes you need.

If you found this article on reshaping NumPy arrays to be helpful, please share it on your favorite social media. 😀

I help people learn how to data things with Python, pandas, and other tools. If that sounds cool to you, check out my other guides and join my 15,000+ followers on Medium to get the latest content.

greek ruins

Happy reshaping! 🔵🔷

--

--

I write about data things. Follow me on Medium and join my Data Awesome mailing list to stay on top of the latest data tools and tips: https://dataawesome.com