The world’s leading publication for data science, AI, and ML professionals.

The Data Science Trilogy: NumPy, Pandas and Matplotlib basics

So you are new to Python. Or perhaps you are already familiar with these libraries, but wanted to get a quick refresher. Whatever the case…

The Data Science Trilogy

NumPy, Pandas and Matplotlib basics

Img src: https://www.pexels.com/photo/business-charts-commerce-computer-265087/
Img src: https://www.pexels.com/photo/business-charts-commerce-computer-265087/

So you are new to Python. Or perhaps you are already familiar with these libraries, but wanted to get a quick refresher. Whatever the case may be, Python has become without a doubt one of the most popular programming languages today, as shown by the following graph from Stack Overflow Trends:

Img src: Stack Overflow trends
Img src: Stack Overflow trends

Out of the most popular Python packages used in Data Science and machine learning , we find Numpy, Pandas and Matplotlib. In this article, I’ll briefly provide a zero-to-hero (pun intended, wink wink 😉 ) introduction to all the basics you need to get started with Python for Data Science. Let’s get started!

Set-up

To begin with, you will need to install the packages. If you are completely new to Python, I recommend following this tutorial, or any other that you find suitable. You can use the following pip commands:

pip install numpy -U 
pip install pandas -U 
pip install matplotlib -U 

Once that’s done, we will now have the following imports:

Here, we used the common alias given to each of the packages.

Numpy

Img src: https://github.com/numpy/numpy/blob/main/branding/logo/primary/numpylogo.svg
Img src: https://github.com/numpy/numpy/blob/main/branding/logo/primary/numpylogo.svg

Numpy is one of the main libraries used in Machine Learning an data science: it is used for a variety of mathematical computations, written in optimized C code at its base.

Numpy Arrays

Arrays are simply collections of objects. A 1-rank array is a list. A 2-rank array is a matrix, or a list of lists. A 3-rank array is a list of lists of lists, and so on.

We can create a Numpy array with the np.array() constructor with a regular Python list as its argument:

One of the most common properties of an np array is it’s shape , which indicates the rank of the array:

We get a tuple with the corresponding rank, which we call the dimension of the array. In this case, the array above is uni-dimensional, also called a flat-array. We can also use a list of list to obtain a more clear matrix-like shape:

In this case, this looks more like a row vector. Similarly, we can initiate a column vector as follows:

We can also initialize full matrices the same way:

Of course, this is just a tuple:

Reshaping an array

We can use the np.reshape function to change the dimensions of an array, as long as it contains the same number of elements and the dimensions make sense. E.g.: reshaping the M1 matrix into a row vector:

Flat array to vector

We can change a flat array into a 2D array (vector) as follows:

Numpy arrays are not lists

Note that there is a crucial difference between lists and NumPy arrays!

One thing we can see straight away is the printing style. We also have very different behaviour:

Dot Product

We can "multiply" arrays of matching neighbour dimensions with the np.dot(a1,a2) function, where a1.shape[1]==a2.shape[0]. Ex. the mathematical dot product of two arrays (often used in machine learning) or n-dimensional vectors , given by

(Image by the author using LaTeX)
(Image by the author using LaTeX)

With a loop, this would look like this:

Better way to do it:

Now, the difference is a lot in speed is noticeable with a lot of data, as it can be seen on the example below:

This is because it takes advantage of parallelization.

Generating random arrays

One way to do this is with the np.random.randn() function:

Pandas

Img src: https://pandas.pydata.org/about/citing.html
Img src: https://pandas.pydata.org/about/citing.html

Pandas is a library for data manipulation. It is highly compatible with Numpy, although it has some subtleties.

Initializing tables form Numpy arrays

You can pass a rank-2 Numpy array to the pd.DataFrame() constructor to initialize a pandas data-frame object.

Code output (Image by author)
Code output (Image by author)

Renaming Columns

We can access the column names of the table using .columns attribute:

We can rename the columns all at once by specifying a list with corresponding names:

Code output (Image by author)
Code output (Image by author)

Now our data-frame has the input column names.

Column selection

We can select columns from a pandas dataframe using the df['colname'] syntax. The result is a Pandas Series object, which is similar to a flat array, but integrates a number of properties and methods.

There are number of built-in methods (see documentation), but as an illustration, we can obtain summary statistics by using the describe() method:

Subsetting multiple columns

Subsetting multiple columns with a list of defined columns outputs a pandas dataframe object.

Code output (Image by author)
Code output (Image by author)

Note that the columns don’t have to be in any particular order; this is useful to re-order the table as we like:

Code output (Image by author)
Code output (Image by author)

Assigning columns

Suppose we would like to obtain all the rows where a certain condition is satisfied. How can we do this? Take the following dummy data as an example:

Code output (Image by author)
Code output (Image by author)

How do we integrate an extra column to this data? We can simply assign an array, list or series of a compatible size (number of rows) with the df['new_col'] = new_col syntax:

Code output (Image by author)
Code output (Image by author)

Now suppose that we would like to fetch all records which satisfy a certain condition. This kind of selection can be done using the following syntax:

df[condition]

where condition is a NumPy array of booleans. Thanks to NumPy broadcasting, comparing a NumPy array/pandas series to one value will produce an array of the same kind where every value is compared to the query. For example, comparing the Request column to the value no gives:

Which indicates all the places where the condition is true. We can then use this array to select only the rows that are True, using the syntax mentioned above:

Code output (Image by author)
Code output (Image by author)

We can use logic to produce more complex queries. For example: fetch all the rows which have a negative consumption index or have an impact score greater than 1, and which have a service type of service1:

Code output (Image by author)
Code output (Image by author)

Here, we can code all of the three conditions that we require to extract the information

Code output (Image by author)
Code output (Image by author)

Pandas apply

Following the previous example, imagine that we would like to apply a certain function to every value of a specific column, and obtain a series in which each element is the output of the applied function on the corresponding element. For these cases, we can use the apply() method of Pandas Series objects. In a data-frame, it would have the following syntax:

df['column'].apply(func)

where func can be any function which returns a single output.

Example: Apply a mapping to each of the services to a certain bin, and assign the resulting column to the dataframe.

Code output (Image by author)
Code output (Image by author)

Pandas Loc

Now imagine you would like to change a certain all rows in a certain column, when those rows satisfy a specified condition. The syntax for this is

df.loc[condition, 'column'] = new_val

Example: Set the Request value to "no" is Factor is of type 1:

Code output (Image by author)
Code output (Image by author)

Sorting a dataframe

To sort a dataframe in a certain order, we can use the sort_values(). Use the inplace=True argument to change the original dataframe. You can sort by more than one column by specifying them in a list.

Code output (Image by author)
Code output (Image by author)

We see that the values have been updated accordingly.

Resetting the index

When sorting values, the index will change. We can reset it so that we obtain a sorted index once more:

Code output (Image by author)
Code output (Image by author)

Dropping columns

When resetting the index, the old index is kept as a column. We can drop any columns with the drop() method.

Code output (Image by author)
Code output (Image by author)

Aggregations

Another useful thing to do with Matplotlib are aggregations. Suppose we would like to aggregate Impact counts for each of the EDXgroups. The syntax is as follows:

df.groupby("column").agg_function().reset_index()

The agg_function() is a pandas function such as sum() or mean() (see this article).

Code output (Image by author)
Code output (Image by author)

Matplotlib

Img src: https://matplotlib.org/stable/gallery/misc/logos2.html
Img src: https://matplotlib.org/stable/gallery/misc/logos2.html

As it name implies, this is Python‘s main plotting library. A number of other libraries such as seaborn rely on this.

Scatter plot

A scatter plot is a 2D plot made by specifying two arrays or lists of identical length. For instance, we could compare the consumption index to it’s impact.

Here are some common Pyplot functions:

  • plt.figure(): specifies reference canvas. Passing figsize=(a,b) changes the plotting canvas dimensions.
  • plt.scatter(x,y,color="b", marker=""): produces the scatter plot. marker specifies the shape of the plotted points.

    Code output using Pandas (Image by author)
    Code output using Pandas (Image by author)

Of course, it would make more sense with more data. Let’s generate this artificially:

Code output using Pandas (Image by author)
Code output using Pandas (Image by author)

Plot

If we would like to see the relationship in a linear way, sometimes it is preferable to use the plot function instead. The syntax is similar.

Code output using Pandas (Image by author)
Code output using Pandas (Image by author)

Caution: Matplolib plot method will plot the given points in the order they are given.

Overlayed plots

We can overlay two plots which share the same x-axis values just by calling the plot function twice, with each y values:

Code output using Pandas (Image by author)
Code output using Pandas (Image by author)

Barplot

One of the advantages of Pandas is that it integrates a number of common Matlotlib plots. For instance, we can plot a barplot with the plot() method. This can be combined with other matplotlib functions such as title(), xlabel(), etc.

Code output using Pandas (Image by author)
Code output using Pandas (Image by author)

For instnace, we can use the df.plot.bar() method to create a barplot directly from the dataframe:

Code output using Pandas and Matplolib (Image by author)
Code output using Pandas and Matplolib (Image by author)

The figsize argument is the same as in figure(figsize=()) , and rot indicates the rotation of the labels. Of course, you can guess what x and y are used for here.

Group barplots can also be overlayed.

Code output using Pandas and Matplolib (Image by author)
Code output using Pandas and Matplolib (Image by author)

Other examples can be found in the documentation.

Piechart

Using the same dataframe design, we can easily plot barcharts as follows:

Code output using Pandas and Matplolib (Image by author)
Code output using Pandas and Matplolib (Image by author)
Code output using Pandas and Matplolib (Image by author)
Code output using Pandas and Matplolib (Image by author)

Enabling subplots=Trueallows for plotting different barplots for different count columns.

Bonus: Seaborn!

At times, there are some plot that are just inherently prettier or easier to produce with Searborn, a high-level Matplotlib wrapper library.

Regression Plots

For this example, we will use the Tips dataset provided by Searborn.

Preview of the tips dataset , provided by Seaborn (Image by author)
Preview of the tips dataset , provided by Seaborn (Image by author)
Code output (Image by author)
Code output (Image by author)

Passing the x_bins argument aggregates continuous data into bins and provides additional lines indicating density of the bins:

Code output (Image by author)
Code output (Image by author)

Box Plots

Box plots are a quick way to visualize common distribution statistics of a vector of data, as illustrated in the following picture:

Img src: https://upload.wikimedia.org/wikipedia/commons/1/1a/Boxplot_vs_PDF.svg
Img src: https://upload.wikimedia.org/wikipedia/commons/1/1a/Boxplot_vs_PDF.svg
Code Output (Image by author)
Code Output (Image by author)

Distribution Plots

Another interesting thing to obtain a full picture of the distribution of some data is by using a displot.

Seaborn code output (Image by the author)
Seaborn code output (Image by the author)

Pairplots

Pairplots allow us to simultaneously see the correlation between out dataframe columns and the distributions of each, represented in a variety of ways. We will illustrate this with seaborn’s preloded penguins data.

Code ouput (Image by the author)
Code ouput (Image by the author)

We can simply use the pairplot function from Seaborn to produce the following plot:

Seaborn code output (Image by author)
Seaborn code output (Image by author)

Using the hue argument allows us to specify groups in the plot as well:

Seaborn code output (Image by author)
Seaborn code output (Image by author)

All the labeling and the details are done automatically. You can also change the type of distribution plot in the diagonal by using diag_kind="hist" as an argument.

Seaborn code output (Image by author)
Seaborn code output (Image by author)

In order to plot kernel denstity estimations, you can also specify kind="kde" as an argument.

Seaborn code Output (Image by author)
Seaborn code Output (Image by author)

Correlation Plots

We can produce a correlation matrix for quick analysis by using the corr() method on a pandas dataframe, which will produce a matrix of correlation values between the numeric columns. Then, we can simply wrap this around the sns.heatmap() function with the annot=True argument to produce the plot below:

Seaborn code output (Image by Author)
Seaborn code output (Image by Author)

Lineplots

Lineplots are useful for visualizing time-series-like data; i.e. organized from one point to another. Here’s an example taken from Seaborn documentation:

First, we load the FMRI dataset (see source).

Seaborn dataset display (Image by author)
Seaborn dataset display (Image by author)

Next, we suse the lineplot function specifying the time-related variable timepoint , along with the signal variable as the "response". Since this one is a categorical variable, we obtain multiple lines for each! Further, we also specify the hue to be the region, that is, the grouping. This results in a very nice plot with four different trendsnicely grouped:

Seaborn code output (Image by author)
Seaborn code output (Image by author)

What goes next?

An that’s it for this introduction! Of course, no single article will cover absolutely everything, so here are a couple of other great articles so that you can continue your learning journey:

A Beginner’s Guide to Data Analysis in Python

Introduction to Data Visualization in Python

Sources

Follow me at

  1. https://www.linkedin.com/in/hair-parra-526ba19b/
  2. https://github.com/JairParra
  3. https://medium.com/@hair.parra

Related Articles