The Data Science Trilogy
NumPy, Pandas and Matplotlib basics

So you are new to Python. Or perhaps you are already familiar with these libraries, but wanted to get a quick refresher. Whatever the case may be, Python has become without a doubt one of the most popular programming languages today, as shown by the following graph from Stack Overflow Trends:

Out of the most popular Python packages used in Data Science and machine learning , we find Numpy, Pandas and Matplotlib. In this article, I’ll briefly provide a zero-to-hero (pun intended, wink wink 😉 ) introduction to all the basics you need to get started with Python for Data Science. Let’s get started!
Set-up
To begin with, you will need to install the packages. If you are completely new to Python, I recommend following this tutorial, or any other that you find suitable. You can use the following pip
commands:
pip install numpy -U
pip install pandas -U
pip install matplotlib -U
Once that’s done, we will now have the following imports:
Here, we used the common alias given to each of the packages.
Numpy

Numpy is one of the main libraries used in Machine Learning an data science: it is used for a variety of mathematical computations, written in optimized C code at its base.
Numpy Arrays
Arrays are simply collections of objects. A 1-rank array is a list. A 2-rank array is a matrix, or a list of lists. A 3-rank array is a list of lists of lists, and so on.
We can create a Numpy array with the np.array()
constructor with a regular Python list as its argument:
One of the most common properties of an np array is it’s shape
, which indicates the rank of the array:
We get a tuple with the corresponding rank, which we call the dimension of the array. In this case, the array above is uni-dimensional, also called a flat-array. We can also use a list of list to obtain a more clear matrix-like shape:
In this case, this looks more like a row vector. Similarly, we can initiate a column vector as follows:
We can also initialize full matrices the same way:
Of course, this is just a tuple:
Reshaping an array
We can use the np.reshape
function to change the dimensions of an array, as long as it contains the same number of elements and the dimensions make sense. E.g.: reshaping the M1
matrix into a row vector:
Flat array to vector
We can change a flat array into a 2D array (vector) as follows:
Numpy arrays are not lists
Note that there is a crucial difference between lists and NumPy arrays!
One thing we can see straight away is the printing style. We also have very different behaviour:
Dot Product
We can "multiply" arrays of matching neighbour dimensions with the np.dot(a1,a2)
function, where a1.shape[1]==a2.shape[0]
. Ex. the mathematical dot product of two arrays (often used in machine learning) or n-dimensional vectors , given by

With a loop, this would look like this:
Better way to do it:
Now, the difference is a lot in speed is noticeable with a lot of data, as it can be seen on the example below:
This is because it takes advantage of parallelization.
Generating random arrays
One way to do this is with the np.random.randn()
function:
Pandas

Pandas is a library for data manipulation. It is highly compatible with Numpy, although it has some subtleties.
Initializing tables form Numpy arrays
You can pass a rank-2 Numpy array to the pd.DataFrame()
constructor to initialize a pandas data-frame object.

Renaming Columns
We can access the column names of the table using .columns
attribute:
We can rename the columns all at once by specifying a list with corresponding names:

Now our data-frame has the input column names.
Column selection
We can select columns from a pandas dataframe using the df['colname']
syntax. The result is a Pandas Series
object, which is similar to a flat array, but integrates a number of properties and methods.
There are number of built-in methods (see documentation), but as an illustration, we can obtain summary statistics by using the describe()
method:
Subsetting multiple columns
Subsetting multiple columns with a list of defined columns outputs a pandas dataframe object.

Note that the columns don’t have to be in any particular order; this is useful to re-order the table as we like:

Assigning columns
Suppose we would like to obtain all the rows where a certain condition is satisfied. How can we do this? Take the following dummy data as an example:

How do we integrate an extra column to this data? We can simply assign an array, list or series of a compatible size (number of rows) with the df['new_col'] = new_col
syntax:

Now suppose that we would like to fetch all records which satisfy a certain condition. This kind of selection can be done using the following syntax:
df[condition]
where condition
is a NumPy array of booleans. Thanks to NumPy broadcasting, comparing a NumPy array/pandas series to one value will produce an array of the same kind where every value is compared to the query.
For example, comparing the Request
column to the value no
gives:
Which indicates all the places where the condition is true. We can then use this array to select only the rows that are True
, using the syntax mentioned above:

We can use logic to produce more complex queries. For example: fetch all the rows which have a negative consumption index or have an impact score greater than 1, and which have a service type of service1
:

Here, we can code all of the three conditions that we require to extract the information

Pandas apply
Following the previous example, imagine that we would like to apply a certain function to every value of a specific column, and obtain a series in which each element is the output of the applied function on the corresponding element. For these cases, we can use the apply()
method of Pandas Series
objects. In a data-frame, it would have the following syntax:
df['column'].apply(func)
where func
can be any function which returns a single output.
Example: Apply a mapping to each of the services to a certain bin, and assign the resulting column to the dataframe.

Pandas Loc
Now imagine you would like to change a certain all rows in a certain column, when those rows satisfy a specified condition. The syntax for this is
df.loc[condition, 'column'] = new_val
Example: Set the Request
value to "no" is Factor
is of type 1:

Sorting a dataframe
To sort a dataframe in a certain order, we can use the sort_values()
. Use the inplace=True
argument to change the original dataframe. You can sort by more than one column by specifying them in a list.

We see that the values have been updated accordingly.
Resetting the index
When sorting values, the index will change. We can reset it so that we obtain a sorted index once more:

Dropping columns
When resetting the index, the old index is kept as a column. We can drop any columns with the drop()
method.

Aggregations
Another useful thing to do with Matplotlib are aggregations. Suppose we would like to aggregate Impact
counts for each of the EDX
groups. The syntax is as follows:
df.groupby("column").agg_function().reset_index()
The agg_function()
is a pandas function such as sum()
or mean()
(see this article).

Matplotlib

As it name implies, this is Python‘s main plotting library. A number of other libraries such as seaborn
rely on this.
- Documentation: https://matplotlib.org/
Scatter plot
A scatter plot is a 2D plot made by specifying two arrays or lists of identical length. For instance, we could compare the consumption index to it’s impact.
Here are some common Pyplot functions:
plt.figure()
: specifies reference canvas. Passingfigsize=(a,b)
changes the plotting canvas dimensions.-
plt.scatter(x,y,color="b", marker="")
: produces the scatter plot.marker
specifies the shape of the plotted points.Code output using Pandas (Image by author)
Of course, it would make more sense with more data. Let’s generate this artificially:

Plot
If we would like to see the relationship in a linear way, sometimes it is preferable to use the plot
function instead. The syntax is similar.

Caution: Matplolib plot
method will plot the given points in the order they are given.
Overlayed plots
We can overlay two plots which share the same x-axis values just by calling the plot function twice, with each y values:

Barplot
One of the advantages of Pandas is that it integrates a number of common Matlotlib plots. For instance, we can plot a barplot with the plot()
method. This can be combined with other matplotlib functions such as title()
, xlabel()
, etc.

For instnace, we can use the df.plot.bar()
method to create a barplot directly from the dataframe:

The figsize
argument is the same as in figure(figsize=())
, and rot
indicates the rotation of the labels. Of course, you can guess what x
and y
are used for here.
Group barplots can also be overlayed.

Other examples can be found in the documentation.
Piechart
Using the same dataframe design, we can easily plot barcharts as follows:


Enabling subplots=True
allows for plotting different barplots for different count columns.
Bonus: Seaborn!
At times, there are some plot that are just inherently prettier or easier to produce with Searborn, a high-level Matplotlib wrapper library.
Regression Plots
For this example, we will use the Tips dataset provided by Searborn.


Passing the x_bins
argument aggregates continuous data into bins and provides additional lines indicating density of the bins:

Box Plots
Box plots are a quick way to visualize common distribution statistics of a vector of data, as illustrated in the following picture:


Distribution Plots
Another interesting thing to obtain a full picture of the distribution of some data is by using a displot.

Pairplots
Pairplots allow us to simultaneously see the correlation between out dataframe columns and the distributions of each, represented in a variety of ways. We will illustrate this with seaborn’s preloded penguins
data.

We can simply use the pairplot
function from Seaborn to produce the following plot:

Using the hue
argument allows us to specify groups in the plot as well:

All the labeling and the details are done automatically. You can also change the type of distribution plot in the diagonal by using diag_kind="hist"
as an argument.

In order to plot kernel denstity estimations, you can also specify kind="kde"
as an argument.

Correlation Plots
We can produce a correlation matrix for quick analysis by using the corr()
method on a pandas dataframe, which will produce a matrix of correlation values between the numeric columns. Then, we can simply wrap this around the sns.heatmap()
function with the annot=True
argument to produce the plot below:

Lineplots
Lineplots are useful for visualizing time-series-like data; i.e. organized from one point to another. Here’s an example taken from Seaborn
documentation:
First, we load the FMRI dataset (see source).

Next, we suse the lineplot
function specifying the time-related variable timepoint
, along with the signal
variable as the "response". Since this one is a categorical variable, we obtain multiple lines for each! Further, we also specify the hue
to be the region, that is, the grouping. This results in a very nice plot with four different trendsnicely grouped:

What goes next?
An that’s it for this introduction! Of course, no single article will cover absolutely everything, so here are a couple of other great articles so that you can continue your learning journey:
Sources
- Python logo
- Trends graph
- NumPy logo
- Pandas logo
- Matplotlib logo
- Boxplot & normal distribution image
- FMRI dataset