Get into shape! 🐍
Shaping and reshaping NumPy and pandas objects to avoid errors
Shape errors are the bane of many folks learning data science. I would bet money that people have quit their data science learning journey due to frustration with getting data into the shape required for machine learning algorithms.
Having a stronger understanding of how to reshape your data will spare you tears, save you time, and help you grow as a data scientist. In this article, you’ll see how to get your data in the shape you need it. 🎉
Doing it
First, let’s make sure we’re using similar package versions. Let’s import the libraries we’ll need under their usual aliases. All code is available here.
import sys
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
If you don’t have the libraries you need installed, uncomment the following cell and run it. Then run the cell imports again. You may need to restart your kernel.
# !pip install -U numpy pandas scikit-learn
Let’s check our package versions.
print(f"Python: {sys.version}")
print(f'NumPy: {np.__version__}')
print(f'pandas: {pd.__version__}')
print(f'scikit-learn: {sklearn.__version__}')Python: 3.8.5 (default, Sep 4 2020, 02:22:02)
[Clang 10.0.0 ]
NumPy: 1.19.2
pandas: 1.2.0
scikit-learn: 0.24.0
Dimensions
A pandas DataFrame has two dimensions: the rows and the columns.
Let’s make a tiny DataFrame with some hurricane data.
df_hurricanes = pd.DataFrame(dict(
name=['Zeta', 'Andrew', 'Agnes'],
year=[2020, 1992, 1972 ]
))
df_hurricanes
You can see the number of dimensions for a pandas data structure with the ndim attribute.
df_hurricanes.ndim2
A DataFrame has both rows and columns, so it has two dimensions.
Shape
The shape attribute shows the number of items in each dimension. Checking a DataFrame’s shape returns a tuple with two integers. The first is the number of rows and the second is the number of columns. 👍
df_hurricanes.shape(3, 2)
We have three rows and two columns. Cool. 😎
The size attribute shows us how many cells we have.
df_hurricanes.size6
3 * 2 = 6
It’s easy to get the number of dimensions and size form the shape attribute so that’s the one to remember and the one we’ll use. 🚀
Let’s make a pandas Series from our DataFrame. Use just the brackets syntax to select a column by passing the name of the column as a string. You get back a Series.
years_series = df_hurricanes['year']
years_series0 2020
1 1992
2 1972
Name: year, dtype: int64type(years_series)pandas.core.series.Series
What does the shape of a pandas Series look like? We can use the Series shape attribute to find out.
years_series.shape(3,)
We have a tuple with just one value, the number of rows. Remember that the index doesn’t count as a column. ☝️
What happens if we use just the brackets again, except this time we pass a list containing a single column name?
years_df = df_hurricanes[['year']]
years_df
type(years_df)pandas.core.frame.DataFrame
My variable name might have given away the answer. 😉 You always get back a DataFrame if you pass a list of column names.
years_df.shape(3, 1)
Take away: the shape of a pandas Series and the shape of a pandas DataFrame with one column are different! A DataFrame has a shape of rows by columns and a Series has a shape of rows. This is a key point that trips folks up.
Now that we know about finding the shape in pandas, let’s look at using NumPy. Pandas extends NumPy.
NumPy
NumPy’s ndarray is its core data structure — we’ll just refer to it as an array from here on. There are many ways to create NumPy arrays, depending upon your goals. Check out my guide on the topic here.
Let’s make a NumPy array from our DataFrame and check its shape.
two_d_arr = df_hurricanes.to_numpy()
two_d_arrarray([['Zeta', 2020],
['Andrew', 1992],
['Agnes', 1972]], dtype=object)type(two_d_arr)numpy.ndarraytwo_d_arr.shape(3, 2)
The shape returned matches what we saw when we used pandas. Pandas and NumPy share some attributes and methods, including the shape attribute.
Let’s convert the pandas Series we made earlier into a NumPy array and check its shape.
one_d_arr = years_series.to_numpy()
one_d_arrarray([2020, 1992, 1972])type(one_d_arr)numpy.ndarrayone_d_arr.shape(3,)
Again, we see the same result in pandas and NumPy. Cool!
The problem
Things get tricky when an object expects data to arrive in a certain shape. For example, most scikit-learn transformers and estimators expect to be fed their predictive X data in two-dimensional form. The target variable, y is expected to be one-dimensional. Let’s demonstrate how to reshape with a silly example where we use year to predict the hurricane name.
We’ll make x lowercase because it has just one dimension.
x = df_hurricanes['year']
x0 2020
1 1992
2 1972
Name: year, dtype: int64type(x)pandas.core.series.Seriesx.shape(3,)
Same goes for our output variable, y.
y = df_hurricanes['name']
y0 Zeta
1 Andrew
2 Agnes
Name: name, dtype: objecttype(y)pandas.core.series.Seriesy.shape(3,)
Let’s instantiate and fit a LogisticRegression model.
lr = LogisticRegression()
lr.fit(x, y)
And you get a value error. The last lines read:
ValueError: Expected 2D array, got 1D array instead:
array=[2020. 1992. 1972.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Let’s try to follow the error message’s instructions:
x.reshape(-1, 1)
Reshaping is great if you passed a NumPy array, but we passed a pandas Series. So we get another error:
AttributeError: 'Series' object has no attribute 'reshape'
We could change our Series into a NumPy array and then reshape it to have two dimensions. However, as you saw above, there’s an easier way to make x a 2D object. Just pass the columns as a list using just the bracket syntax.
I’ll make the result a capital X, because it will be a 2D array — and a capital letter is the statistical naming convention for a 2D array (also known as a matrix) .
Let’s do it!
X = df_hurricanes[['year']]
X
type(X)pandas.core.frame.DataFrameX.shape(3, 1)
Now we can fit our model without errors! 😁
lr.fit(X, y)LogisticRegression()
Reshaping NumPy arrays
If our data were stored in a 1D NumPy array, then we could do what the error message suggests and turn it into a 2D array with reshape
. Let's try that with the data we saved as a 1D NumPy array earlier.
one_d_arrarray([2020, 1992, 1972])one_d_arr.shape(3,)
Let’s reshape it!
hard_coded_arr_shape = one_d_arr.reshape(3, 1)
hard_coded_arr_shapearray([[2020],
[1992],
[1972]])hard_coded_arr_shape.shape(3, 1)
Passing a positive integer means give that dimension that shape. So now our array has the shape 3, 1.
However, it’s a better coding practice to use a flexible, dynamic option. So let’s use -1 with .reshape().
two_d_arr_from_reshape = one_d_arr.reshape(-1, 1)
two_d_arr_from_reshapearray([[2020],
[1992],
[1972]])two_d_arr_from_reshape.shape(3, 1)
Let’s unpack that code. We passed a 1 so the the second dimension — the columns — got a 1.
We passed a negative integer for the other dimension. That means the remaining dimension becomes whatever shape is needed to make it hold all the original data.
Think of -1 as fill in the blank to make a dimension so that all the data has a home. 🏠
In this case you end up with a 2D array with 3 rows and 1 column. -1 took on the value 3.
It’s a good practice to make our code flexible so that it can handle how many ever observations we throw at it. So instead of hard-coding both dimensions, use -1
. 🙂
Higher dimensional arrays
The same principle can be used for reshaping higher dimensional arrays. Let’s make a three-dimensional array and then reshape it into a four-dimensional array.
two_d_arrarray([['Zeta', 2020],
['Andrew', 1992],
['Agnes', 1972]], dtype=object)two_d_arr.shape(3, 2)three_d_arr = two_d_arr.reshape(2, 1, 3)
three_d_arrarray([[['Zeta', 2020, 'Andrew']],
[[1992, 'Agnes', 1972]]], dtype=object)
Use -1, to indicate which dimension should be the one to be computed to give exactly all the data a home.
arr = two_d_arr.reshape(1, 2, -1, 1)
arrarray([[[['Zeta'],
[2020],
['Andrew']],
[[1992],
['Agnes'],
[1972]]]], dtype=object)
Note that if the reshape dimensions don’t make sense, you’ll get an error. Like this:
two_d_arr.reshape(4, -1)
two_d_arr--------------------------------------------------------------------
ValueError: cannot reshape array of size 6 into shape (4,newaxis)
We have six values, so we can only reshape the array into the number of dimensions that can hold exactly six values.
In other words, the number of dimensions must form the product six. Remember that -1 is like a wildcard that can become any integer value.
Predicting
Scikit-learn expects a 2D array for most predictions.
Say you have one single sample in a list that you want to use to make a prediction. You might naively think the following code will work.
lr.predict(np.array([2012]))
It doesn’t. ☹️
ValueError: Expected 2D array, got 1D array instead:
array=[2012].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
However, we can follow the helpful error suggestion to make a two-dimensional array with reshape(1, -1)
.
lr.predict(np.array([2012]).reshape(1, -1))array(['Zeta'], dtype=object)np.array([2012]).reshape(1, -1).shape(1, 1)
You’ve made the first dimension (the rows) 1 and the second dimension (the columns) match the number of features 1. Cool!
Don’t be afraid to check the shape of an object — even just to confirm it is what you think it is. 🙂
While we’re on the topic of reshaping for scikit-learn, note that the text vectorization transformers such as CountVectorizer behave differently than other scikit-learn transformers. They assume you have just one column of text, so they expect a 1D array instead of a 2D array. You might need to reshape. ⚠️
Other ways to make a 1D array
In addition to reshaping with reshape
, NumPy's flatten
and ravel
both return a 1D array. The differences are in whether they create a copy or a view of the original array and whether the data is stored contiguously in memory. Check out this nice Stack Overflow answer for more info.
Let’s look at one other way to squeeze a 2D array into a 1D array.
Squeeze out unneeded dimensions
When you have a multi-dimensional array but one of the dimensions doesn’t hold any new information you can squeeze out the unnecessary dimension with .squeeze()
. For example, let's use the array we made earlier.
two_d_arr_from_reshapearray([[2020],
[1992],
[1972]])two_d_arr_from_reshape.shape(3, 1)squeezed = np.squeeze(two_d_arr_from_reshape)squeezed.shape(3,)
Ta da!
Note that the TensorFlow and PyTorch libraries play nicely with NumPy and can handle higher dimensional arrays representing things like video data. Getting the data into the shape the input layer to your neural network requires is a frequent source of errors. You can use the tools above to reshape your data into the required dimensions. 🚀
Wrap
You’ve seen how to reshape NumPy arrays. Hopefully future code you see will make more sense and you’ll be able to quickly manipulate NumPy arrays into the shapes you need.
If you found this article on reshaping NumPy arrays to be helpful, please share it on your favorite social media. 😀
I help people learn how to data things with Python, pandas, and other tools. If that sounds cool to you, check out my other guides and join my 15,000+ followers on Medium to get the latest content.
Happy reshaping! 🔵🔷