
1. Reading multiple files

Let’s say we would like to read multiples files from a folder into a pandas DataFrame. Python has a path library that can list all the files in a certain folder, but I find glob easier to use. glob supports regular expressions, which gives us granular control on which files we would like to read. Note, there is no need to extract zipped files as pandas can unarchive them automatically while reading.
To specify, which files we would like to read, we use *.gz
with glob.
filepaths = glob.glob('data/*.gz')
# Output
['data/20191210.csv.gz', 'data/20191209.csv.gz', 'data/20191211.csv.gz']
glob detects the files, but they are not ordered. File with date 2019–12–10 is before 2019–12–09 in the list. This is an issue if we are working with time-series data like in this example.
To get the correct order of time-series data, we have two options (one better than the other):
- sort files by names and read them in sorted order (bad),
- sort data by a column after all data is read in a DataFrame (preferred).
1.1 Sorting by filenames
Let’s sort files by names to show why it is a bad idea.
filepaths = sorted(filepaths)
# Output
['data/20191209.csv.gz', 'data/20191210.csv.gz', 'data/20191211.csv.gz']
In this example, sorting filenames would work, but we need to be aware that sorting strings is not the same as sorting integers as shown in the example below. This happens because programming languages compare character by character. It is a bad idea to rely on the names of the files for the correct order of the data.
In the example below, we sort strings and integers, respectively.
sorted(['100', '200', '1000', '20000', '9'])
# Output
['100', '1000', '200', '20000', '9']
sorted([100, 200, 1000, 20000, 9])
# Output
[9, 100, 200, 1000, 20000]
1.2 Sorting whole DataFrame by a column
As mentioned above, sorting a DataFrame by a column is a better option, but we need to be careful with the column data type.
Let’s read the data.
df_list = []
for filepath in filepaths:
df_ = pd.read_csv(filepath)
df_list.append(df_)
df = pd.concat(df_list)
DataFrame has multiple columns and we would like to sort it by timestamp.
df.head()

Not so fast! Let’s look at the data type of the timestamp column.
df.timestamp
# Output
0 2019-12-09D00:00:18.265072000
1 2019-12-09D00:00:42.503391000
2 2019-12-09D00:01:50.513620000
3 2019-12-09D00:02:35.189775000
4 2019-12-09D00:03:49.934950000
...
389905 2019-12-11D23:35:26.637330000
389906 2019-12-11D23:35:26.637330000
389907 2019-12-11D23:37:25.166632000
389908 2019-12-11D23:40:09.701271000
389909 2019-12-11D23:55:46.793512000
Name: timestamp, Length: 1578327, dtype: object
The timestamp column has an object type. What happens when we sort an object type?
pd.Series(['100', '200', '1000', '20000', '9']).sort_values()
# Output
0 100
2 1000
1 200
3 20000
4 9
dtype: object
A pandas object type is used for text or mixed numeric and non-numeric values. To get the correct order of the timestamp column we need to change it to the datetime64 type.
df.timestamp = pd.to_datetime(df.timestamp.str.replace("D", "T"))
df.timestamp
# Output
0 2019-12-09 00:00:18.265072
1 2019-12-09 00:00:42.503391
2 2019-12-09 00:01:50.513620
3 2019-12-09 00:02:35.189775
4 2019-12-09 00:03:49.934950
...
389905 2019-12-11 23:35:26.637330
389906 2019-12-11 23:35:26.637330
389907 2019-12-11 23:37:25.166632
389908 2019-12-11 23:40:09.701271
389909 2019-12-11 23:55:46.793512
Name: timestamp, Length: 1578327, dtype: datetime64[ns]
The timestamp column has datetime64 type and we can proceed with sorting now.
df = df.sort_values("timestamp")
df.head()

2. Descriptive statistics

The describe is the first command I usually run when I’m not familiar with the dataset. I briefly review descriptive statistics like count, min and max values to see if the data was read correctly.
df.describe()

What I didn’t notice for quite some time was that describe outputs descriptive statistics only for numerical columns. This makes sense because we cannot calculate them with other data types like strings.
When we output columns we see that there are columns like timestamp, symbol, etc., that are not listed above.
df.columns
# Output
Index(['timestamp', 'symbol', 'side', 'size', 'price', 'tickDirection', 'trdMatchID', 'grossValue', 'homeNotional', 'foreignNotional'], dtype='object')
Luckily describe takes include argument, where we can specify to include all columns. Include argument also takes a list, where we can specify a subset of columns.
df.describe(include='all')

3. Visualizing outliers

Pandas plot
function enables us to swiftly visualize the distribution of a certain column.
To visualize outliers of a certain column we can use a box plot.
df["size"].plot(kind="box")

On the plot above, we can observe that size has many outliers, which makes a box plot uninformative.
My solution was to clip the bottom and top 1% quantile.
df['size_clip'] = df['size'].clip(df['size'].quantile(0.01), df['size'].quantile(0.99))
df.size_clip.plot(kind='box')

This makes a box plot more informative, but we need to try different quantiles values and also add a new column in a DataFrame. A question arises, which column should we then use. Is there a better way?
Plot function takes the ylim parameter, where we can specify boundaries for the y-axis. This makes a box plot more informative and we don’t need a new column in the DataFrame.
df["size"].plot(kind="box", ylim=(0, 5000))

4. Binning data

Let’s say we would like to transform a numeric column to a discrete column.
4.1 Bin values into discrete intervals
The most straight forward way is to specify n intervals and bin the data accordingly. pandas has cut function that does just that.
Let’s bin the price column to 5 discrete values.
df['price_discrete_bins'] = pd.cut(df.price, 5)
# Output
0 (7462.2, 7575.6]
1 (7462.2, 7575.6]
2 (7462.2, 7575.6]
3 (7462.2, 7575.6]
4 (7462.2, 7575.6]
...
1309666 (7121.433, 7235.4]
1309667 (7121.433, 7235.4]
1309668 (7121.433, 7235.4]
1309669 (7121.433, 7235.4]
1309670 (7121.433, 7235.4]
Name: price_discrete_bins, Length: 1309671, dtype: category
Categories (5, interval[float64]): [(7121.433, 7235.4] < (7235.4, 7348.8] < (7348.8, 7462.2] < (7462.2, 7575.6] < (7575.6, 7689.0]]
As we would expect, the price_discrete_bins has 5 distinct values, one for each bin.
df['price_discrete_bins'].value_counts()
# Output
(7121.433, 7235.4] 419820
(7235.4, 7348.8] 367043
(7462.2, 7575.6] 301654
(7348.8, 7462.2] 200231
(7575.6, 7689.0] 20923
Name: price_discrete_bins, dtype: int64
The cut function also enables us to specify bin labels. Also see the retbins argument, which outputs both intervals and labels.
df['price_discrete_labels'] = pd.cut(df.price, 5, labels=['very low', 'low', 'mid', 'high', 'very high'])
# Output
0 high
1 high
2 high
3 high
4 high
...
1309666 very low
1309667 very low
1309668 very low
1309669 very low
1309670 very low
Name: price_discrete_labels, Length: 1309671, dtype: category
Categories (5, object): [very low < low < mid < high < very high]
df.price_discrete_labels.value_counts(normalize=True)
# Output
very low 0.320554
low 0.280256
high 0.230328
mid 0.152886
very high 0.015976
Name: price_discrete_labels, dtype: float64
4.2 Bin values into equal-sized buckets
The most straight forward way is not always the best way when doing data Analytics (or anything else I suppose 😊 ). What I didn’t point out above is that very high value (0.015%) is significantly less represented than other values (all over 15%). This could pose a problem if we intend to train a predictive model on this dataset:
- an algorithm could overfit on "very high" value because patters would occur by chance as the value is so rare,
- an algorithm could attribute the occurrence of "very high" value due to mistake and skip it altogether.
Is there a way to avoid this problem?
There is. Let’s bin the price to equal-size bins so that each bin will have the same number of values.
df['price_discrete_equal_bins'] = pd.qcut(df.price, 5)
# Output
0 (7477.0, 7689.0]
1 (7477.0, 7689.0]
2 (7477.0, 7689.0]
3 (7477.0, 7689.0]
4 (7477.0, 7689.0]
...
1309666 (7121.999, 7212.5]
1309667 (7121.999, 7212.5]
1309668 (7121.999, 7212.5]
1309669 (7121.999, 7212.5]
1309670 (7121.999, 7212.5]
Name: price_discrete_equal_bins, Length: 1309671, dtype: category
Categories (5, interval[float64]): [(7121.999, 7212.5] < (7212.5, 7292.5] < (7292.5, 7348.5] < (7348.5, 7477.0] < (7477.0, 7689.0]]
df.price_discrete_equal_bins.value_counts()
# Output
(7121.999, 7212.5] 264009
(7292.5, 7348.5] 262610
(7348.5, 7477.0] 262396
(7477.0, 7689.0] 260412
(7212.5, 7292.5] 260244
Name: price_discrete_equal_bins, dtype: int64
5. Working with missing data

In Pandas, multiple values denote a missing value:
- NaN for numeric/object missing values,
- NaT for DateTime missing values,
- None, which comes from Python.
The reason for this mess is (from pandas documentation): The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. We are hopeful that NumPy will soon be able to provide a native NA type solution (similar to R) performant enough to be used in pandas.
What surprised me is that None equals None in Python, but nan doesn’t equal nan in numpy.
None == None
True
pd.np.nan == pd.np.nan
False
This is important so that we don’t filter values by None:
df[df.some_column == None]
# instead use:
df[df.some_column.isnull()]
What is the reason that numpy nan is not equal to itself? Let’s dive deeper. Some functions return nan when they are non-computable.
pd.np.log(-1)
nan
pd.np.arccos(2)
nan
Does this mean that the results of the function should be the same? Clearly not.
pd.np.log(-1) == pd.np.arccos(2)
False
Few important details on missing values behavior taken from pandas documentation:
- NA groups in GroupBy are automatically excluded.
- When summing data, NA (missing) values will be treated as zero.
- If the data are all NA, the result will be 0.
- Cumulative methods like cumsum() and cumprod() ignore NA values by default but preserve them in the resulting arrays.
Conclusion

These were few gotchas with pandas that you need to be careful about when doing data analysis. Hopefully, you learned something new.
To run the examples download this Jupyter notebook.
Let’s connect
Talk: Book a call Socials: YouTube 🎥 | LinkedIn | Twitter Code: GitHub