Analysis of an art survey using Pandas

Published in

Towards Data Science

9 min readAug 9, 2019

Pandas is a Python open source library for data science that allows us to easily work with structured data, such as csv files, SQL tables, or Excel spreadsheets. In this article, we use Pandas to analyze the results of an art survey carried out by students of statistics at the Comenius university in Bratislava. Students were asked to rate 39 well-known paintings on a scale from 1 to 5 (meaning 1 “don’t like at all”). The dataset includes the ratings, along with the name of the painting, the art movement, and the artist, containing 3 paintings for each art movement. The dataset can be found in Kaggle. Kaggle is an online community of data scientists and machine learners that contains a wide variety of datasets.

Paintings

Students rating famous paintings from different art movements.

www.kaggle.com

Exploratory data analysis and data cleaning

Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The objective is to understand the data, discover patterns and anomalies, and check assumption before we perform further evaluations.

After downloading the csv file from Kaggle, we can load it into a Pandas dataframe using the pandas.read_csv function and visualize the first 5 rows using the pandas.DataFrame.head method.

Since the number of columns is too large, we can not see all of them using the head method. One option is to change the display options so that we can visualize the whole dataframe. Another option is to employ the column attribute as follows:

As we can observe, the dataframe contains 51 columns: 48 ratings, as well as the name of the painting, the art movement, and the artist. The students rated 39 paintings in total.

Inappropriate data types and missing values are the most common problems of datasets. We can easily analyze both using the pandas.DataFrame.info method. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

We have verified that there are not null values and the datatypes are the expected ones. The columns S1-S48 contain integers and the other columns (art movement, artist, and painting) contain objects (strings in Pandas). In this particular case, we do not need to carry out any cleaning operation, but this is more an exception than a rule, since dataset are often pretty messy.

We can also evaluate whether or not the dataset contains null values as follows:

Alternatively, we can evaluate the data types using the pandas.DataFrame.dtypes attribute. This returns a Series with the data type of each column.

Pandas provides a method called pandas.DataFrame.describe that generates descriptive statistics of a dataset (central tendency, dispersion and shape). By default, the describe method analyzes only numeric columns, but it can also analyze object columns by providing all to the argument include. Since the summary returned by the describe method is a dataframe, we can easily access its elements by using the pandas.DataFrame.loc method.

As shown in the picture above, every student rated 39 pictures in total. We can also conclude that some students were more critic than others. For instance, student 9 gave an average rating of 1.8974 around 1 point less than student 10. Most of students gave a minimum rating of 1 and a maximum rating of 5 with exception of student 9 (max =4) and student 46 (min =2) . Although the five-number summary (min, 25%, 50%, 75%, max) provides us information about the distribution of the observations, it would be interesting to analyze the distribution of ratings by means of visualizations. And we will do it! Keep reading 😉

We can also verified the number of unique element in columns, art movement, artist, and painting as follows:

The pandas.Series.nunique method returns number of unique elements. As shown above, the dataframe contains 39 paintings of different artists, including 3 painting per art movement.

This dataset does not contain a large number of samples. For this reason, it does not required important cleaning operations; however, there are a few minor changes that we can perform to facilitate the further analysis of the dataset.

Eliminate spaces in column names → To select a single column by name using dot notation.
Calculate the average rating by painting → To evaluate preferences with regards to paintings and art movements.
Set painting column as index → To easily access information using the name of the painting instead of indexes.

After performing the changes, we are ready to easily draw conclusions using the data. 💪 😊 Let’s get started!

Answering questions and drawing conclusions

Exploratory data analysis and data cleaning are the steps that allow us to get a feeling about the dataset and to get the dataset ready to easily draw conclusions using it. Now! We are ready to answer the following questions using the dataset.

Which are the 5 highest rated paintings?

To obtain the highest rated paintings, we can use the average rating column previously created as follows:

We can also create a bar plot to visualize the results. Bar plots are used with categorical data, where each bar represents a particular category. The height of the bars is proportional to the values that they represent.

We can easily access all information about the best rated paintings as follows:

The highest rated painting is Mucha’s Four Seasons with an average rating of 3.96. Alphonse Mucha was a czech painter and Art Nouveau master, most well-known for his original posters of women surrounded by decorative botanical motifs. The Four Seasons depicts young women set against the seasonal views of the countryside. He painted 3 series in total (1896–1897–1900). Since the date is not provided, we cannot conclude which serie was shown to the students. But all are amazing! 😍

The following plot depicts the other top-rated paintings. I am sure you know some of them 😍

Which are the 5 lowest rated paintings?

To analyze the lowest rated paintings, we proceed as before. First, we create a serie that contains the average rating. Then, we visualize the information using a bar plot.

As before, we can easily access all information about the lowest rated paintings in the following manner:

The lowest rated painting is The Nude Maja painted by Francisco Goya. Goya was a spanish romantic painter and one of the most important artists of the late 18th and early 19th centuries. The Nude Maja is considered to be one of the first portraits that show a woman completely nude, and it can be found at el Museo del Prado in Madrid. Although his painting is the lowest rated of the survey, I have to admit that Goya is one of my favorite painters, with a extensive pictoric work that goes from court portraits to horrific scenes of war.

Which are considered the highest rated art movements?

The dataset contains paintings of 13 different art movements: (1) Renaissance, (2) Baroque, (3) Neoclassicism, (4) Romanticism, (5) Impressionism, (6) Post-Impressionism, (7) Symbolism, (8) Art Nouveau, (9) Cubism, (10) Abstract art, (11) Surrealism, (12) Op art, (13) Pop art. To evaluate the highest rated art movements, we have to compute the mean of the average rating of the three paintings that belong to each art movement. We can easily calculate that by using the pandas.DataFrame.groupby method. A groupby operation involves a combination of splitting the object, applying a function, and combining the results.

First, we group by art movement. Then, we calculate the mean of each group. As before, we can easily interpret the result by using a bar plot as follows:

As shown in the image above, the highest rated art movement is impressionism. We can obtain the impressionist paintings used in this survey in the following manner:

Which student gave the highest marks? Which student gave the lowest marks?

Previously, we have calculated the average rating of each painting. Now, we have to compute the average rating provided by each student.

The previous code returns a Serie with the mean of the values of each column. We can observe the first values of the serie using the head method in the following fashion:

We can obtain the student that provided the highest marks and the lowest marks using the following methods

pandas.Series.idxmin → Return the row label of the minimum value.
pandas.Series.idxmax → Return the row label of the maximum value.

As shown above, student 48 provided the highest marks with an average of 4.05. On the contrary, student 21 provided the lowest marks with an average of 1.84.

We can visualize the individual marks provided by both students using bar plots.

How students distributed their marks?

We can visualize the average rating provided by the students by plotting the previously created Serie (students_average).

What position occupies in the ranking el Gernica of Picasso?

El Gernica is a large oil painting on canvas by the spanish artist Pablo Picasso, and is considered one of the most famous Picasso’s paintings. El Gernica shows the tragedies of war and was created in response to the bombing of Gernica, a small town located in northern Spain, during the Spanish Civil War (1936–1939). The painting was located in different cities across Europe and America. After the end of Franco’s dictatorship, Spain became a democracy and the painting returned to Spain in 1981. Nowadays el Gernica is considered one of the most important icons of modern art and powerful anti-war paintings.

We can obtain the ranking position of el Gernica using the pandas.Index.get_loc. This method returns the integer location of a particular index. First, we have to sort the average rating column in descending order. Then, we apply the get_loc method to the indexes of the Serie.

Since indexes in python and pandas start with 0, the ranking position of el Gernica is 16.

Key Takeaways

Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The pandas .head(), .info(), .describe(), .nunique(), and .shape are useful methods for exploratory data analysis.
We can access a group of rows and columns using the .loc() method.
We can use pandas.DataFrame.plot method to make a plot of a dataframe using matplotlib. The type of plot is specified in the argument kind.
If the x axis of a bar plot is not specified, the indexes of the dataframe are used.
The .value_counts() method returns a serie containing counts of unique values.
The .sort_values() method is used to sort values in a dataframe. The argument ascending=False is used to sort in descending order.

Thanks for reading !!! 😍