Having appropriate dtypes for your Series and DataFrame is very important for many reasons:
- Memory management: using the right dtype for a particular series can dramatically reduce its memory usage, and by extension this also applies to dataframes
- Interpretation: anyone else (human or computer) will make assumptions on your data based on its dtype: if a column full of integers is stored as a string, they will treat it as strings, not integers
- It enforces you to have clean data, like dealing with missing values or mis-recorded values. This will ease the data-crunching down the road a lot
And there are probably many more reasons, can you name a few? If so please write it in a comment.
In this first post of my pandas series, I want to review the basics of pandas datatypes – or dtypes.

We will first review the available dtypes pandas offers, then I’ll focus on 4 useful dtypes that will fulfill 95% of your needs, namely numerical dtypes, boolean dtype, string dtype, and categorical dtypes.
The end goal of this first post is to make you more comfortable with the various data types availables in pandas and what are their differences.
If you’re interested in pandas and time-series, make sure to check out my Fourier-transform for time-serie posts:
- Review how the convolution relate to the Fourier transform and how fast it is:
Fourier transform for time-series: fast convolution explained with numpy
- Deepen your understanding of convolution using image examples:
Fourier-Transform for Time Series: About Image Convolution and SciPy
- Understand how the Fourier-transform can be visualy understood using a vector-visual approach:
Fourier-transform for time-series : plotting complex numbers
- See how detrending techniques can improve greatly the output of the Fourier-transform:
Review of the available dtypes
Let’s take a minute to review the dtypes pandas offers. Since pandas is based on Numpy, they can be splitted in 2 categories:
- numpy-based dtypes
- Pandas-specific dtypes
Under the hood, pandas stores your data in numpy arrays, hence uses numpy’s dtypes. We can consider those the ultimate low-level dtypes. But for convenience, pandas also exposes higher-level dtypes that were specifically created by the pandas team.
The idea is that from your point of view, the end-user, all dtypes (numpy-based or pandas-specific) are equally considered as valid dtypes for your series and _dataframe_s.
The numpy-based dtypes are the following:
**float**
: to store floating-point numbers (0.0245)**int**
: to store integer numbers (1, -6)**bool**
: to store booleans (True/False)**datetime64[ns]**
: to store an instant of our timeline (date and time of day)**timedelta64[ns]**
: to store a relative duration (this complements the datettime dtype)**object**
: can store literally any python object
The actual complete list of numpy dtypes is bigger than that, but we’ll stick to those above.
Then pandas comes into play and exposes many new dtypes, including :
**string**
: another way to store strings**Nullable-int**
and**Nullable-float**
that are better at handling missing values than numpy‘s**int**
and**float**
**boolean**
: boolean that is better at handling missing values than numpy‘s**bool**
**categorical**
: a dtype that is appropriate for data than can only have specific values and can handle missing entries
Again, other dtypes are defined in pandas, but I want to focus on these 4 because I found that there are the most useful.
How dtypes are set
In practice, the dtype of your data can have 2 origins:
- Either you didn’t specify the dtype, and pandas made an assumption when the series/dataframe were created (whether it’s from loading a csv or creating object like
s = pd.Series([1, 2, 3])
). This will work well about 50% to 80% of the time, depending on how well the input data is formatted. - Either you specify the dtype, by explicitly telling pandas what dtype to use for each column/series
For the first case:
- PROS: simpler and faster
- CONS: you don’t know what happened unless you review each dtype afterward, you might not have the most-appropriate dtype (like
object
because pandas did not figure what the column was about), and even worse you might very ill-appropriate dtype for some columns
For the second case:
- PROS: you know exactly what happened to your data, you know exactly what dtypes the columns have, you know exactly how you treated missing and ill-conditionned values. In other words, your data is ready for actual processing
- CONS: it takes more time, more code
In the end, I’d recommend the following: first, let pandas infer dtypes and do its best. Then review each columns and manually set the dtypes you think should be changed.
We’ll see in the next post how to do all this, but for now I want to review the available dtypes and why they exist.
Since most of the time I end up using only a subset of all the dtypes available, I’ll focus on those:
- numerical dtypes, mostly float and int
- boolean dtypes
- string-like (including string and object) dtypes
- categorical dtype
Numerical-like dtypes
Regarding numerical dtypes, numpy provides a solid base: integers and floating-point numbers dtypes, that also handle _np.nan_
. Note also that by default, numerical dtypes used will be int64
and float64
. This might not be ideal for your data regarding memory usage and explicit allowed values for a given Series:
pd.Series([1, 2, 3]) # dtype: int64
pd.Series([1, 2, 3], dtype="int") # dtype: int64
pd.Series([1, 2, 3], dtype='float') # dtype: float64
pd.Series([1, 2, np.nan]) # dtype: float64
pd.Series([1, 2, np.nan], dtype="int") # dtype: float64
pd.Series([1, 2, np.nan], dtype='float') # dtype: float64
pd.Series([1, 2, pd.NA]) # dtype: object
# fails since pd.NA cannot be converted to an int
# pd.Series([1, 2, pd.NA], dtype="int"))
# fails since pd.NA cannot be converted to an int
# pd.Series([1, 2, pd.NA], dtype='float'))
Note that as soon as _**np.nan**_
is used, the dtypes is converted to float. For this reason, and some more, pandas also exposes dtypes that can handle missing values using the pandas-specific _**pd.NA**_
and still keep the explicit underlying numerical dtype, like _**Int64Dtype**_
(rationale here).
pd.Series([1, 2, 3, np.nan], dtype="Int64") # dtype: Int64
pd.Series([1, 2, 3, pd.NA], dtype="Int64") # dtype: Int64
pd.Series([1, 2, 3, np.nan], dtype="Float32") # dtype: Float32
pd.Series([1, 2, 3, pd.NA], dtype="Float32") # dtype: Float32
This way, we can have the best of both worlds: an explicit underlying numerical dtype (int64, float32, etc), and handle missing values with _**pd.NA**_
.
Boolean-like dtypes
There are basically 2 dtypes that act like booleans, namely ‘bool‘ and ‘boolean‘.
‘bool‘ corresponds to standard numpy-based boolean, and hence cannot contain ‘NA‘ since only True and False can be stored in a boolean-numpy array.
To handle ‘not-available’ entries – or ‘NA’- pandas exposes the dtype ‘boolean‘, that can contain ‘NA’. See the example below:
pd.Series([True, False], dtype='bool') # numpy-boolean
pd.Series([True, False, pd.NA], dtype='boolean') # pandas-boolean
pd.Series([True, False, np.nan], dtype='boolean') # pandas-boolean
# this cannot work, since numpy does not handle pd.NA
# pd.Series([True, False, pd.NA], dtype='bool')
# this works, but np.nan is converted to True...
# pd.Series([True, False, np.nan], dtype="bool") # --> dtype: object, not boolean...
# if no dtype is passed, pandas tries to infer an appropriate one
pd.Series([True, False]) # --> dtype: bool
pd.Series([True, False, pd.NA]) # --> dtype: object, not boolean...
pd.Series([True, False, np.nan]) # --> dtype: object, not boolean...
String-like dtypes
Natively, strings can be stored in numpy arrays since we can use the `object` dtype, which exists to handle "any python object".
Strings differ from most other dtypes in that one cannot know the length of a string in advance, hence the necessary memory to allocate. That’s also the case for any custom python object, that be very simple or really complicated, and need various amounts of memory.
So we can use the ‘object‘ dtype in pandas since it is available in numpy:
pd.Series(['toto', 'titi', 'tata']) # --> dtype: object
pd.Series(['toto', 'titi', 'tata'], dtype="object") # --> dtype: object
pd.Series(['toto', 'titi', 'tata'], dtype="str") # --> dtype: object
pd.Series(['toto', 'titi', np.nan], dtype="str") # --> dtype: object
pd.Series(['toto', 'titi', pd.NA], dtype="str") # --> dtype: object
Additionally, pandas created a dtype that make the fact that the data is a string explicit: the _**StringDtype**_
that can be specified as _**string**_
, which is better since "explicit is better than implicit", and interface better with the rest of pandas’ ecosystem (rationale here):
pd.Series(['toto', 'titi', 'tata'], dtype="string") # --> dtype: string
pd.Series(['toto', 'titi', np.nan], dtype="string") # --> dtype: string
pd.Series(['toto', 'titi', pd.NA], dtype="string") # --> dtype: string
Note that most of the time, you’ll need to challenge yourself in asking: do I really need that series to be stored as a string ? Are those strings just representation of other data, like numerical data or categorical data ? If so, you’ll want to convert those series dtypes. Subscribe to see my next post on how to do that!
The joker dtype: Categorical
I’d suggest trying this: open up one of your datasets in pandas, review the columns one by one, and ask yourself: could this feature be stored as a Categorical dtype ?
I’m willing to bet that you’ll say yes to that question more than you expected.
This dtypes is especially useful and suitable for things you usually stored as integer-like numbers (0, 1, 2, etc) and/or strings (‘Male’/’Female’, ‘Dog’/’Cat’/’Other’).
This dtypes can greatly improve speed and memory usage in some case. It also, again, makes it explicit to others (humans and computers) that this particular data represents a category and should be treated as such.
pd.Series([1, 2, 3, 2], dtype="category") # dtype: category, with 3 int64 possible values [1, 2, 3]
pd.Series([1, 2, 'a', 1], dtype="category") # dtype: category, with 4 object-like possible values [1, 2, 'a']
pd.Series(["a", "b", "c", "a"], dtype="category") # dtype: category, with 3 object-like possible values ['a', 'b', 'c']
pd.Series(["M", "F", "M", "F"], dtype="category") # dtype: category, with 2 object-like possible values ['M', 'F']
Wrap up
Other dtypes are implemented in pandas, but I found that they are not used as much the ones above.
So remember:
- Good dtypes are critical to your processing, reviewing and setting the right ones as soon as possible will make your work down the road a lot easier for everyone. Also, it might save a lot of memory and processing complexity
- Before using
object
orstring
dtype, consider usingcategorical
dtype - If dealing with missing values or NaN, consider using pandas dtypes like
boolean
as opposed to numpy’sbool
- Only use
object
dtype if your data is complex and/or does not fit in any other dtype
In the next post, we’ll see how to inspect the dtypes of existing Series/DataFrame, and how to change them to convert to other dtypes.
If you’re considering joining Medium and get unlimited acces to all of my and others posts, use this link to quickly subscribe and become one of my refered member:
and then subscribe to get notified for futur posts:
Finaly, check out some of my other posts:
300-times faster resolution of Finite-Difference Method using numpy
PCA/LDA/ICA : a components analysis algorithms comparison