The world’s leading publication for data science, AI, and ML professionals.

DonorsChoose: Extensive Exploratory Data Analysis (EDA)

An analytical approach to explaining and revealing insights into the DonorsChoose platform to answer the business question: 'What makes a…

Photo by Octavian Dan on Unsplash
Photo by Octavian Dan on Unsplash

About the project

DonorsChoose.org is an online charity platform where thousands of teachers can propose and promote online project requests for material and specific equipment to make students’ education opportunities equal for each and anyone.

The project is based on a Kaggle Competition for Predicting Excitement at DonorsChoose.org, to identify projects that are remarkable and make business sense.

Objectives

The purpose of the following Exploratory Data Analysis (EDA) is to reveal trends and insights which, from a business perspective, identifies and put in evidence the most attractive projects for financial donations.

The main goal is to analyze the data in order to answer the following business question: what makes a project exciting?

Set-up

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
%matplotlib inline
import pingouin as pg
color = sns.color_palette()
from pandas.plotting import table
import scipy
from scipy.stats import pearsonr, spearmanr
from wordcloud import WordCloud, STOPWORDS
import datetime as dt
import plotly
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
# Load data
don = pd.read_csv("donations.csv")
ess = pd.read_csv("essays.csv")
out = pd.read_csv("outcomes.csv")
res = pd.read_csv("resources.csv")
pro = pd.read_csv("projects.csv")

Dataset information

The data – download here – is divided in 5 independent datasets: 1- Donations, 2- Essays, 3- Outcomes, 4- Resources, 5- Projects.

Image by author
Image by author

There are 3097989 donations and 619326 outcomes for 664098 projects. There is also 3667217 resources and 664098 essays.

Functions

Let’s create a function that prints the shape of the dataframe → number of observations, features, duplicates, missing values.

# SHAPE
def shape_df(df):
    print(f"Number of observations: {df.shape[0]}")
    print(f"Number of variables: {df.shape[1]}")
    print(f"Number of duplicates: {df.duplicated().sum()}")
    print(f"Are there any missing values {df.isnull().values.any()}")
    print("-----")
    print(df.dtypes.sort_values(ascending=True))
    print("------")
    print("Datatypes' proportion:")
    print(df.dtypes.value_counts(ascending=True))

Function to detecs if there are any missing values → if so, return the amount both absolute and percent of missing values.

# MISSING VALUES
def null_val(df):
    detect_null_val = df.isnull().values.any()
    if detect_null_val:
        null_abs = df.isnull().sum()
        null_pc = df.isnull().sum() / df.isnull().shape[0] *100
        null_concat = pd.concat([null_abs,null_pc], axis=1).round(2)
        null_concat.columns = ['Absolute', 'Percent']
        return null_concat.sort_values(by="Absolute", ascending=False)
    else:
        print("There are no missing values.")

Function to display correlations through a heatmap → Pearson method.

# CORRELATIONS
def corrs(x):
    mask = np.triu(x.corr(), 1)
    plt.figure(figsize=(19, 9))
    return sns.heatmap(x.corr(), annot=True, vmax=1, vmin=-1, square=True, cmap='BrBG', mask=mask);

Exploratory Data Analysis

Finding what type (and how many by type) of variables we’re dealing with, check for missing values, correlations, as well as the statistical description of the numerical and categorical variables.

Dataset: Projects

Basic info

How many features are available?
How many observations are in the dataset?
Are there any duplicated records?
Are there any missing values?
What type of variables are there?
How many variables of each type?
# Observations, features, duplicates, datatypes and proportions
shape_df(pro)
>> output:
Number of observations: 664098
Number of variables:    35
Number of duplicates:   0
Are there any missing values? True
-----
fulfillment_labor_materials                 float64
students_reached                           float64
school_ncesid                              float64
school_latitude                            float64
school_longitude                           float64
total_price_including_optional_support     float64
total_price_excluding_optional_support     float64
school_zip                                 float64
primary_focus_subject                      object
primary_focus_area                         object
secondary_focus_subject                    object
secondary_focus_area                       object
projectid                                  object
poverty_level                              object
grade_level                                object
teacher_ny_teaching_fellow                 object
eligible_double_your_impact_match          object
resource_type                              object
teacher_teach_for_america                  object
school_charter_ready_promise               object
eligible_almost_home_match                 object
school_kipp                                object
school_nlns                                object
school_year_round                          object
school_magnet                              object
school_charter                             object
school_county                              object
school_district                            object
school_metro                               object
school_state                               object
school_city                                object
schoolid                                   object
teacher_acctid                             object
teacher_prefix                              object
date_posted                                object
dtype: object
------
Datatypes' proportion:
float64     8
object     27
dtype: int64
# Describe missing values
null_val(pro)
null_val() function. Image by author.
null_val() function. Image by author.
# Stats description (num)
pro.describe().round(2)
Numeric stats description. Image by author.
Numeric stats description. Image by author.

From the 664098 records, the average price of ‘total_price_excluding_optional_support’ is 542, pretty much higher than the median which indicates the presence of outliers. The same happens with ‘total_price_including_optional_support’. Looking at the 1st and 3rd quartiles, the data seems to be ok except for the maximum values.

# Stats description (cat)
pro.describe(include="object")
Categorical stats description (croped image). Image by author.
Categorical stats description (croped image). Image by author.

California is the state with the most submitted projects, but Chicago is the city with the highest number of observations (9007).

There are 3 classes of school metropolitan types with urban as being the most frequent.

Amongst the 5 types of teachers, ‘Mrs.’ is by far the most frequent record and female teachers take the lead concerning gender.

From the 7 areas of study, Literacy & Language is the most requested one with Supplies being the most solicited resource type.

Among the students identified as being the most vulnerable regarding poverty level, the most frequent is the youngest ‘Grades PerK-2’.

On the last day of September of 2012, 1490 project requests were posted online.

# Checking correlations
corrs(pro)
Corrs() function. Image by author.
Corrs() function. Image by author.

From the previous heatmap, there are no insightful correlations.

1. What is the teacher’s profile and how is it distributed amongst the requested projects?

# Checking feature "teacher_prefix" distribution
df_teacher = pro[["teacher_acctid","teacher_prefix"]].copy()
# Replacing np.nan values by ""unknown"
df_teacher.teacher_prefix.replace(np.nan, "Unkown", inplace=True)
# Bar plot
ax = df_teacher["teacher_prefix"].value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("teacher_prefix", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Teacher Prefix Distribution", fontsize=10, loc="right")
# side table
sidetb = df_teacher.teacher_prefix.value_counts().sort_values(ascending=False).to_frame()
sidetb["percent"] = (sidetb.apply(lambda x: x/x.sum())*100).round(2)
sidetb['cumsum'] = sidetb["percent"].cumsum()
display(sidetb.T)
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)
Image by author.
Image by author.

From the analysis, it is clear that ‘Mrs’ and ‘Ms.’ are the titles/ prefixes largely most frequent which together corresponds to around 87% of the records.

TimeDate

Converting the date_posted into "year", "month", "day" and "weekday" variables.

# Convert "date_posted" into "year","month","day","weekday" features
pro["date_posted"] = pd.to_datetime(pro["date_posted"], format="%Y-%m-%d")
pro["year_posted"] = pro["date_posted"].dt.year
pro["month_posted"] = pro["date_posted"].dt.month.map({1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun", 7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dez"})
pro["weekday_posted"] = pro["date_posted"].dt.dayofweek.map({0:"Mon", 1:"Tue", 2:"Wed", 3:"Thu", 4:"Fri", 5:"Sat", 6:"Sun"})
pro[["date_posted","year_posted","month_posted","weekday_posted"]]
Image by author.
Image by author.

2. What are the most common periods in which project requests have been submitted?

2.1. On which weekdays?

# Checking distribution of submited projects regarding the weekdays
df_timedate = pro[["teacher_acctid","teacher_prefix","year_posted","month_posted","weekday_posted","date_posted"]].copy()
# Bar plot
ax = df_timedate["weekday_posted"].value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Weekdays", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Submited projects regarding the days of the week", fontsize=10, loc="right")
# side table
daysofweek = df_timedate.weekday_posted.value_counts().sort_values(ascending=False).to_frame()
daysofweek["percent"] = (daysofweek.apply(lambda x: x/x.sum())*100).round(2)
daysofweek['cumsum'] = daysofweek["percent"].cumsum()
display(daysofweek.T)
Image by author.
Image by author.

Sunday is clearly the day of the week when teachers submit more projects. The trend is that the closer it is to the weekend, the greater the concentration is. A possible explanation is that on Sundays teachers have more spare time to perform this task.

2.2. In which months?

# Checking distribution of submited projects along the academic year
# Bar plot
ax = df_timedate["month_posted"].value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
plt.xlabel("Months", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Submited projects along the year", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)
Image by author.
Image by author.

The analysis shows that the months with the highest incidence of submissions on the platform are September and October. It makes sense to have a greater volume of submissions at the beginning of the academic year. There is a greater concentration in the first semester.

2.3. What is the overall evolution throughout the years?

# Excluding 2014 - not complete:
df_timedate[df_timedate["year_posted"]==2014].month_posted.unique()
>> output: array(['May', 'Apr', 'Mar', 'Feb', 'Jan'], dtype=object)
---
# Checking evolution from 2002 to 2013
# Bar plot
ax = df_timedate.groupby("year_posted")['teacher_acctid'].count()[:-1].plot(kind="line", figsize=(9,6), fontsize=12, linewidth=2, grid=True, table=True)
plt.xlabel("Years", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Evolution of submited projects", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)
Image by author.
Image by author.

The analysis of the annual project post distribution reveals a clear and strong growth trend. Since 2002, the year the platform was born, the growth rate has been exponential.

3. Which areas are more responsible for having online submissions?

schools_ = pro[["school_metro", "school_state", "school_city", "resource_type", "grade_level", "poverty_level", "year_posted"]].copy()
schools_.head()
Image by author.
Image by author.
# Checking feature "school_metro" regarding school located areas
# Replacing np.nan values by ""unknown"
schools_.school_metro.replace(np.nan, "Unkown", inplace=True)
---
# Bar plot
ax = schools_.school_metro.value_counts().plot(kind="pie", figsize=(9,6), autopct='%1.2f%%', explode= (0.005, 0.005, 0.005, 0.005), startangle=80, fontsize=12, table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("", fontsize=10)
plt.title("Metropolitan school areas", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)
Image by author.
Image by author.

From which metropolitan areas do schools belong? The highest concentration is found in the urban areas followed by the suburban and rural. 12% is unknown.

4. In which states is the highest concentration of submitted projects?

# Checking feature "school_state" regarding higher concentration of project openings
high_states = schools_.school_state.value_counts()
# Bar plot
ax = schools_.school_state.value_counts(ascending=False)[:5].plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("States with higher schools' concentration", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)
Image by author.
Image by author.

The selected ‘top 5’ of School States represents almost 50% of all openings. California takes the lead followed by New York and North Carolina with a percentage close to Illinois and Texas.

5. Which school of which cities have more impact?

# Checking feature "school_city" regarding project openings
# Bar plot
ax = schools_.school_city.value_counts(ascending=False)[:10].plot(kind="bar", figsize=(12,8), fontsize=12, color=sns.color_palette("rocket"), table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Cities with higher schools' concentration", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)
Image by author.
Image by author.

If concerning the states, California is followed by New York and North Carolina with a percentage close to Illinois and Texas, regarding to the cities, Chicago, Los Angeles, Brooklyn, Bronx and New York are the ones that best represent the distribution.

6. Which type of resource request is most demanding?

# Checking feature "resource_type" distribution
# Bar plot
ax = schools_.resource_type.value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Resources", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Resource types Distribution", fontsize=10, loc="right")
Image by author.
Image by author.

From the type of resource requested, Supplies and Technology are the most demanding types concerning quantities, followed by Books, Others, Trips and Visitors.

7. What is the profile of students regarding their grades?

# Checking feature "grade_level" distribution
# Bar plot
ax = schools_.grade_level.value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Grade levels", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Grade levels distribution", fontsize=10,loc="right")
Image by author.
Image by author.

The trend is clear: the yougest the students, the greatter the number of online requests.

8. Is there any relation between the grade levels and the type of resources requested?

# Understanding how resources are distributed over the grade levels
plt.figure(figsize=(16,8))
ax = sns.countplot(x="grade_level", hue="resource_type", data=schools_, dodge=True)
plt.legend()
plt.xlabel("Grade levels", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Resources' distribution by the grade levels", fontsize=10, loc="right");
Image by author.
Image by author.

Understanding how resources are distributed over the grade levels: no matter the grade level, the trend for demanding resources stands the same: Supplies, Technology, Books, others.

9. Does the poverty level influence the trend for requested funds?

# Checking feature "poverty_level" distribution
# Bar plot
ax = schools_.poverty_level.value_counts(dropna=False).plot(kind="pie", figsize=(12,8), autopct='%1.2f%%', explode= (0.005, 0.005, 0.005, 0.005), startangle=80, fontsize=12, table=True)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("", fontsize=10)
plt.title("Poverty classes distribution regarding requested projects", fontsize=10, loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,2)
Image by author.
Image by author.

From the analysis of the distribution of the poverty classes, it is clear that the higher the vulnerability, the bigger is the amount of submitted projects for funding.

10. Is there a trend in the distribution of requested projects regarding the metropolitan placement and poverty level?

# Checking "poverty levels" regarding metropolitan school areas
h_pov = schools_.groupby(["poverty_level","school_metro"]).year_posted.count().iloc[:4]
hpov = pd.DataFrame(h_pov)
hst_pov = schools_.groupby(["poverty_level","school_metro"]).year_posted.count().iloc[4:8]
hstpov = pd.DataFrame(hst_pov)
l_pov = schools_.groupby(["poverty_level","school_metro"]).year_posted.count().iloc[8:12]
lpov = pd.DataFrame(l_pov)
m_pov = schools_.groupby(["poverty_level","school_metro"]).year_posted.count().iloc[12:]
mpov = pd.DataFrame(m_pov)

merge_1 = pd.merge(hstpov, hpov, on=["school_metro"])
merge_1.rename(columns={"year_posted_x":"Highest Poverty", "year_posted_y":"High Poverty"}, inplace=True)
merge_2 = pd.merge(merge_1,mpov,on=["school_metro"])
merge_3 = pd.merge(merge_2,lpov,on=["school_metro"])
merge_3.rename(columns={"year_posted_x":"Moderate Poverty", "year_posted_y":"Low Poverty"},inplace=True)
---
# Bar plot
ax = merge_3.sort_values(by=["High Poverty"], ascending=True).plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Poverty level distribution regarding metropolitan schools' areas", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)
Image by author.
Image by author.

From the analysis of poverty levels regarding metropolitan school areas, the ones that belong to urban areas associated with high poverty levels are the most frequent and also because urban areas have a lot more density which makes sense with the distribution and relation of both classes.

11. Which are the most submitted donations regarding the areas of study?

# Checking "primary focus areas" regarding requested projects
subject = pro[["resource_type", "grade_level", "primary_focus_area", "total_price_excluding_optional_support", "total_price_including_optional_support", "poverty_level", "year_posted"]].copy()
---
# Bar plot
ax = subject.primary_focus_area.value_counts(ascending=False).plot(kind= "bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Primary Focus Areas", fontsize=10)
plt.xticks(rotation=30, horizontalalignment="right")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Primary Focus Areas distribution", fontsize=10, loc="right")
Image by author.
Image by author.

Regarding the areas of interest/ study, subjects related with Literacy & Language are the most submitted ones followed by almost half of Math & Science requests. Music & The Arts, Applied Learning, Special Needs, History & Civics, Health & Sports are the less requested focus areas.

12. Does the grade level influence the areas of interest/ studies?

# Understanding how primary focus areas are distributed through the grade levels
plt.figure(figsize=(16,8))
ax = sns.countplot(x="grade_level", hue="primary_focus_area", data=subject, dodge=True)
plt.legend()
plt.xlabel("Grade levels", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Primary focus areas distribution by grade levels", fontsize=10, loc="right");
Image by author.
Image by author.

As excpected, regardless the grade level, the focus areas of studies doesen’t change significantly.

Dataset: Essays

ess.shape
>> (664098, 6)
ess.head()
Image by author.
Image by author.
# Describe missing values
null_val(ess)
Image by author.
Image by author.

Word Cloud

Wordcloud is a type of visualization that displays the most common and frequent words present in a text string in which the word’ size is proportional to its frequency.

13. Which are the most frequent words present on the Essays dataset?

Let’s define a function to create a Wordcloud highlighting and plotting the most common words on each Series of the Essays’ dataset.

def plot_wordcloud(wc, otherWords=""):
ess[wc].dropna(inplace=True)
    text = " ".join(ess[wc][~pd.isnull(ess[wc])])
    wordcloud = WordCloud(width=3000, height=2000, random_state=1, background_color="black", colormap="Set2", collocations=False, stopwords=STOPWORDS.update(otherWords)).generate(text)
    fig = plt.figure(figsize=(14,14))
    plt.imshow(wordcloud)
    plt.title("Wordcloud with %s content" % wc, fontsize=12, loc="right")
    plt.axis("off")
    plt.show()

Title

plot_wordcloud("title", ["need","Help","Classroom"])
WordCloud: title content. Image by author.
WordCloud: title content. Image by author.

The most frequent words present in the title are Book, Reading, Technology, Reade, Science, Math, Library, Grade, Literacy, School, Ipad.

Short description

plot_wordcloud("short_description", ["will", "class", "allow", "able", "one", "year", "many", "time", "remember", "use", "day", "way"])
WordCloud: short description. Image by author.
WordCloud: short description. Image by author.

The most frequent words present in the short description are Student, School, Book, Reading, Technology, Learning, Skill, Learn, Teach, Children, Material, Math, Computer.

Need Statement

plot_wordcloud("need_statement",["need", "student", "students", "fulfillment", "including", "help", "donorschoose", "org", "shipping,", "set", "new", "will"])
WordCloud: need statement. Image by author.
WordCloud: need statement. Image by author.

The most frequent words present in the need statement are Book, Reading, Math, Cost, Skill, Project, Center, Material, HTML, Science, Game, Skill, Camera, iPad.

Dataset: Resources

Basic info

How many features are available?
How many observations are in the dataset?
Are there any duplicated records?
Are there any missing values?
What type of variables are there?
How many variables of each type?
# Observations, features, duplicates, datatypes and proportions
shape_df(res)
>> output:
Number of observations: 3667217
Number of variables:    9
Number of duplicates:   0
Are there any missing values? True
-----
vendorid                 float64
item_unit_price          float64
item_quantity            float64
resourceid                object
projectid                 object
vendor_name               object
project_resource_type     object
item_name                 object
item_number               object
dtype: object
------
Datatypes' proportion:
float64    3
object     6
dtype: int64
# Describe missing values
null_val(res)
Image by author.
Image by author.
res.head()
Image by author.
Image by author.
res.describe(include="all").round(1)
Image by author.
Image by author.

More than one and a half million books were requested being the most frequent resource type.

The average item-unit price is of $180 and the median is of $14 → this indicates the presence of some heavy outliers. Also, some negative values present in the dataset are awkward, to say the least.

The average quantity is 3.3 units with a median of 1. The Q1 and Q3 indicate a good distribution but those outliers need to be deal with.

14. How much did each corresponding type of resource receive in total donations?

To answer this question, we need to calculate both the item quantity and item price of the items per each type of available resource.

Item quantity

res["item_quantity"].value_counts(dropna=False)
1.0        2304849
2.0         279693
3.0         108075
5.0          81142
4.0          80076
            ...   
315.0            1
10500.0          1
313.0            1
309.0            1
121.0            1
Name: item_quantity, Length: 313, dtype: int64
---
res.item_quantity.value_counts(normalize=True,bins=15).round(6)
(-993.1089999999999, 66207.2]    0.997734
(926900.8, 993108.0]             0.000000
(860693.6, 926900.8]             0.000000
(794486.4, 860693.6]             0.000000
(728279.2, 794486.4]             0.000000
(662072.0, 728279.2]             0.000000
(595864.8, 662072.0]             0.000000
(529657.6, 595864.8]             0.000000
(463450.4, 529657.6]             0.000000
(397243.2, 463450.4]             0.000000
(331036.0, 397243.2]             0.000000
(264828.8, 331036.0]             0.000000
(198621.6, 264828.8]             0.000000
(132414.4, 198621.6]             0.000000
(66207.2, 132414.4]              0.000000
Name: item_quantity, dtype: float64

From the aggregation of the items in 15 buckets, we realize 99% of the data is concentrated between the values -993 and 66207.

Plotting the Empirical Cumulative Distribution Function (ECDF) of the "item quantity" feature and detecting the presence of the outliers in the variable. It shows in which percentage the data has greater and smaller points than the x-axis: very high indeed.

# Plotting ECDF
x = np.sort(res["item_quantity"].dropna())
y = np.arange(1, len(x)+1) / len(x)
plt.plot(x,y,marker = ".", linestyle = "none")
plt.title("ECDF Item Quantity",loc="right");
Image by author.
Image by author.

Let’s search for those distant outliers.

value = res.item_quantity.quantile([.9999995])
p = value.iloc[0]
res[res["item_quantity"] > p]
Image by author.
Image by author.

I have decided to drop the item ‘SWATCH BOOK-SCHOOL SMART CONST 09506’ (has no price also) and keep all the other.

# Removing outlier
res = res[res["item_quantity"] < 993108].copy()
res
Image by author.
Image by author.
res["item_quantity"].describe().round(2)
count    3658906.00
mean           3.03
std           11.75
min            0.00
25%            1.00
50%            1.00
75%            2.00
max        10500.00
Name: item_quantity, dtype: float64
---
res["item_quantity"].value_counts(dropna=False)
1.00        2731320
2.00         317590
3.00         120462
4.00          87816
5.00          85564
             ...   
2.99              1
274.00            1
10500.00          1
268.00            1
290.00            1
Name: item_quantity, Length: 316, dtype: int64
---
# Next step is to deal with np.nan, 0.00 and 2.99 values
res[res["item_quantity"] == 2.99]
Image by author.
Image by author.
# Checking for items with no quantity records
(res[res["item_quantity"] == 0]).head()
Image by author.
Image by author.

The approach on both issues is to replace all the above anomalous values by the median value.

# Replacing anomalous values by np.nan
res["item_quantity"] = res["item_quantity"].replace(2.99,3)
res["item_quantity"] = res["item_quantity"].replace(0,np.nan)
# Replacing the outliers by the median
res["item_quantity"] = res.item_quantity.fillna(res.item_quantity.median())
---
res.item_quantity.describe().round(1)
count    3658906.0
mean           3.0
std           11.8
min            1.0
25%            1.0
50%            1.0
75%            2.0
max        10500.0
Name: item_quantity, dtype:float64

Item Price

# Plotting ECDF
x = np.sort(res.item_unit_price.dropna())
y = np.arange(1, len(x)+1) / len(x)
plt.plot(x,y,marker = ".", linestyle = "none")
plt.title("ECDF Item Price",loc="right");
Image by author.
Image by author.
re = res.copy() # Make a copy of the dataset
# Dealing with the outliers
Q1 = re["item_unit_price"].quantile(0.25)
Q3 = re["item_unit_price"].quantile(0.75)
IQR = Q3 - Q1
re["item_unit_price_"] = re.item_unit_price[~((re.item_unit_price < (Q1 - 1.5 * IQR)) |(re.item_unit_price > (Q3 + 1.5 * IQR)))]
# Displaying data dispersion without the outliers
print(re.item_unit_price.describe().round(1))
print("")
print(int(re.item_unit_price.describe()[0]) - int(re.item_unit_price_.describe()[0]),"outliers were identified with this operation")
count      3653295.0
mean           179.8
std         234667.6
min            -99.6
25%              6.3
50%             14.0
75%             36.0
max      448421569.0
Name: item_unit_price, dtype: float64
>> 488446 outliers were identified with this operation
re.item_unit_price_.value_counts(dropna=False)
NaN       494057
29.950     56527
6.290      49729
19.950     43492
3.590      42383
           ...  
3.321          1
11.331         1
7.551          1
62.150         1
78.540         1
Name: item_unit_price_, Length: 8199, dtype: int64
---
# Imputing outliers by the median
res["item_unit_price"] = re["item_unit_price"].fillna(res.item_unit_price.median())
# Removing negative values
res = re[re["item_unit_price_"] > 0].copy()
res.drop(["item_unit_price_"],axis=1,inplace=True)
res.describe().round(2)
Image by author.
Image by author.
# Displaying results without the outliers
plt.hist(res.item_unit_price, bins=50)
plt.title('Numeric variable 'Item unit price'', fontsize=12, loc="right")
plt.ylabel('Absolute Frequency')
plt.show()
Image by author.
Image by author.

project_resource_type

r = res[["projectid","project_resource_type","item_unit_price","item_quantity"]].copy()
r["amount"] = r.apply(lambda x: x["item_unit_price"] * x["item_quantity"], axis=1)
r.head()
Image by author.
Image by author.
# Scatter plot: relationship between Item Unit Price and Item Quantity
fig = plt.figure(figsize=(9,6))
ax = fig.add_subplot(111)
ax.scatter(r["item_quantity"],r["item_unit_price"])
ax.set_xlabel("Iten Quantity")
ax.set_ylabel("Item Unit Price")
ax.set_title("Relationship of Item Unit Price vs. Item Quantity", fontsize=10,loc="right")
ax.set_xlim(0,650);
Image by author.
Image by author.

The scatter plot shows that the more expensive the items are, the less quantities tend to be requested.

# Bar plot
ax = r.groupby(["project_resource_type"])["amount"].count().sort_values(ascending=False).plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("project resource type", fontsize=10)
plt.xticks(rotation=30, horizontalalignment="right")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Distribution of project resources by amounts", fontsize=10, loc="right")
# side table
amount_ = r.groupby(["project_resource_type"])["amount"].count().sort_values(ascending=False).to_frame()
amount_["percent"] = (amount_.apply(lambda x: x/x.sum())*100).round(2)
display(amount_.T)
Image by author.
Image by author.

The total amount by each type of resource indicates that the books were the most expensive resource (cumulative), followed by supplements and technology in the third position.

Datasets: Resources + Outcomes + Donations

Merging dataframes

# Merging dataframes Resources and Outcomes, left join
df_m0 = pd.merge(pro,out,how="left",on="projectid")
df_m0.dropna(inplace=True)
---
df_m0.drop(columns={"school_ncesid","school_latitude","school_longitude","school_zip","school_county","school_charter","school_magnet","school_year_round","school_nlns","school_kipp","school_charter_ready_promise","teacher_teach_for_america","teacher_ny_teaching_fellow","eligible_double_your_impact_match","eligible_almost_home_match","at_least_1_green_donation","great_chat"}, inplace=True)
---
# Merging the above merged dataframe with Donations, left join
df = pd.merge(df_m0,don,how="left",on="projectid")
df.dropna(inplace=True)
---
display(df.shape)
>> (467107, 52)

15. On average, in which states there is a higher concentration of donations measured by individual donors?

Let’s go back and understand in wich states there was a higher concentration of submited projects. The selected ‘top 5’ of School States represents almost 50% of all projects openings.

# Checking feature "school_state" regarding higher concentration of project openings
# Bar plot
ax = schools_.school_state.value_counts(ascending=False)[:5].plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("States with higher schools' concentration", fontsize=10,loc="right")
display(ss.head(5).T)
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)
Image by author.
Image by author.

Checking feature "school_state" regarding concentration of project donations above average.

state_schools = df.groupby("school_state")["donation_total"].mean()
# Bar plot
ax = df.groupby("school_state")["donation_total"].mean().sort_values(ascending=False).head().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
​# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.xlabel("States with higher schools' concentration", fontsize=10)
plt.title("States with schools receiving donations above avegrage", fontsize=10,loc="right")
plt.axhline(df.groupby("school_state")["donation_total"].mean().mean(), linewidth=2, color ="r")
plt.tight_layout()
display(ss.head(5).T)
Image by author.
Image by author.

What is the average of projects’ donations by state?

df.groupby("school_state")["donation_total"].mean().mean()
>> 111.75994027601705
state_schools.plot.line();
Image by author.
Image by author.
# Describe
state_schools = df.groupby("school_state")["donation_total"].mean()
display(state_schools.describe().to_frame()) # Describe
# Plot "total_price_excluding_optional_support"
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})
mean=state_schools.mean()
median=state_schools.median()
sns.boxplot(state_schools, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
sns.distplot(state_schools, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
plt.legend({'Mean':mean,'Median':median})
ax_box.set(xlabel='')
plt.show()

16. Is there any correlation between the number of requested projects by state and donors by the state?

We simply get the donation by state and group it by the number of individual donors then sort values descending.

high_donors = df.groupby("donor_state")["donor_acctid"].count().sort_values(ascending=False).to_frame()
high_donors.head()
Image by author.
Image by author.

Let’s take from the previous calculations the stats regarding the highest submitted project requests.

high_states.head().to_frame()
Image by author.
Image by author.

Concatenating both datasets.

zaq = pd.concat([high_states, high_donors], axis = 1)
zaq.rename(columns={"school_state":"Requests", "donor_acctid":"Donations"}, inplace=True)
zaq.head()
Image by author.
Image by author.

Display correlations between submitted projects by state and donations made by state.

# Drop nans
zaq.dropna(inplace=True)
---
# Correlations
zaq.plot(x="Requests", y="Donations", kind="scatter", figsize=(9,6), fontsize=12)
plt.title("Projects requested by donations made", fontsize=10,loc="right")
plt.xticks(rotation=0,horizontalalignment="center")
plt.ylabel("Donations made",fontsize=10)
plt.xlabel("Requested projects",fontsize=10)

It seems there is a positive correlation between the amout of requested projects by state and donors by state.

# Calculating corr. coef. between Requests and Donations
zaq["Requests"].corr(zaq["Donations"])
>> 0.9689520473560519

17. Which areas where targeted for submitted projects?

# Checking "primary focus areas" regarding requested projects
# Bar plot
ax = df.primary_focus_area.value_counts(ascending=False).plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Primary Focus Areas", fontsize=10)
plt.xticks(rotation=30, horizontalalignment="right")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Primary Focus Areas distribution", fontsize=10, loc="right")

Literacy & Language, Math & Science, Music & The Arts, Special Needs, Applied Learning, History & Civics, and Health & Sports.

18. Which areas of the study did gather more donations?

From a business perspective, wich primary focus areas were more attractive to donors?

prisubject_donor = df.groupby("primary_focus_area")["donation_to_project"].sum().sort_values(ascending=False).to_frame().copy()
prisubject_donor.round(2)
prisubject = df.primary_focus_area.value_counts().sort_values(ascending=False).to_frame()
prisubject

19. Is there any sort of correlation between the donation to projects and the primary focus area?

Let’s concatenate both datasets and display the correlation between primary focus area of the projects and its donations.

cde = pd.concat([prisubject, prisubject_donor], axis=1) #Concatenate
cde.rename(columns={"school_state":"Requests", "donor_acctid":"Donations"}, inplace=True)
cde.round(2).head()
cde.dropna(inplace=True) #Drop nans
#Correlations
cde.plot(x="primary_focus_area", y="donation_to_project", kind="scatter", figsize=(9,6), fontsize=12)
plt.title("Primary focus area projects requested and its donations", fontsize=10, loc="right")
plt.xticks(rotation=0, horizontalalignment="center");

There is a very strong and positive correlation (99%) → donors are more willing to support areas like Literacy & Language, as well as Math & Science and Music & The Arts, than Special Needs or Applied Learning.

# Correlation
cde.primary_focus_area.corr(cde.donation_to_project)
>> 0.9917272663581518

20. On average, how long does it take for a project to be fully funded?

df[["projectid","date_posted","donation_timestamp"]].head()
Image by author.
Image by author.
# Filtering fully funded project
dft = (df[df["fully_funded"] == "t"]).copy()
---
# Converting column (object --> datetime)
dft["donation_timestamp"] = pd.to_datetime(dft["donation_timestamp"], format="%Y-%m-%d")
---
# displaying date only
dft["donation_timestamp"] = dft["donation_timestamp"].dt.date
dft["donation_timestamp"] = pd.to_datetime(dft["donation_timestamp"], format="%Y-%m-%d")
---
# Calcultaing durations (in days) and coverting into int64
dft["funding_span"] = dft.donation_timestamp - dft.date_posted
dft["funding_span"] = dft.funding_span.dt.days
---
# Counting the days until fully funded
states_funding_proj = dft.groupby(["school_state","projectid"])["funding_span"].mean()
states_funding_proj
school_state  projectid                       
AK            02059528bc746a40214a5fd9cb5e9d6b     66.000000
              0213fbd06fb04ca6a797da6ae0040cb1      0.000000
              02158d40acc0744af7d7584531092f07     99.066667
              0257042b0165752bc3e686cc90665632     13.000000
              02ad156dee38b083f364c8164f7f9b2f      0.000000
                                                     ...    
WY            fb0296ec6586036457750d0b1b38a49d    101.000000
              fc9c54832601b2ed0ba54237b80de254     27.500000
              fce5bf5cd0ae40947a1152bd1b834604     41.500000
              fd69421691ce38787a90ce9dac747477     24.500000
              fe450a764037db85093d0511315ba944      1.000000
Name: funding_span, Length: 173512, dtype: float64
---
states_funding_avg = states_funding_proj.groupby(["school_state"]).mean()
avg_days = states_funding_avg.mean()
print(f"Fully funding average days:", states_funding_avg.mean())
>> Fully funding average days: 40.45127526109651
---
# Average days until fully funded projects by state
# Bar plot
ax = states_funding_proj.groupby(["school_state"]).mean().sort_values().plot(kind="barh", figsize=(12,9), fontsize=12, color=sns.color_palette("rocket"), table=False)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("States", fontsize=10)
plt.xlabel("Days until fully funded projects", fontsize=10)
plt.title("Average days until fully funded projects by state", fontsize=10, loc="right")
plt.axvline(avg_days, linewidth=2, color ="r")
plt.tight_layout()
Image by author.
Image by author.

41 days is the average time projects get fully funded.

Above, the chart differentiates states that get funded earlier than the average days from those that get funded later.

Next, states below average:

states_funding_avg[states_funding_avg < avg_days].sort_values(ascending=False)
school_state
WA    40.141406
ID    40.066648
NJ    39.942280
GA    39.741997
TX    39.658309
NV    39.577678
NY    39.497603
AR    39.418404
AZ    39.395866
NM    39.259313
NH    39.224946
CT    39.036108
IL    39.022429
PA    38.834201
MA    38.724716
MN    38.491491
OR    38.392444
FL    38.354962
MT    37.189795
CO    36.780902
KY    35.623806
WV    35.029220
MO    34.963707
CA    34.799367
TN    32.973356
DC    32.016388
HI    31.547774
DE    31.091836
AK    30.834884
Name: funding_span, dtype: float64

States above average days:

states_funding_avg[states_funding_avg >= avg_days].sort_values(ascending=False)
school_state
MS    53.589771
AL    50.309404
SC    48.911828
RI    48.640019
WY    48.480016
NC    46.838167
MI    46.655316
IA    46.202405
VA    45.981566
LA    44.471631
ME    43.272701
ND    42.832920
VT    42.401578
OH    42.291703
MD    42.054127
NE    41.993862
IN    41.962964
UT    41.597421
WI    41.588098
OK    41.471277
KS    40.946715
SD    40.889714
Name: funding_span, dtype: float64

Wrap-up

1- What is the teacher profile? Female 87% {Mrs: 47%, Ms: 40%}, Male 13%.

2- What are the most common periods for submissions? 2.1- On which weekdays? Sun, Sat, Tue, Mon, Fri, Wed, Thu (sorted). 2.2- In which months? Set, Oct, Jan, Aug, Feb, Dec, Nov, Mar, Apr, Jul, May, Jun. 2.3- What is the overall evolution throughout the years? Growth has been exponential.

3- Which metropolitan areas get more submissions? Urban 53%, Suburban 23%, Rural 12%, unknown 12%.

4- In which states is the highest concentration of projects? The followed ‘top 5’ represents almost 50%: California, New York, North Carolina, Illinois, and Texas.

5- Which school of which cities have more impact? There is a city-state relation: Chicago, Los Angeles, Brooklyn, Bronx, and New York.

6- Which type of resource request is most demanding? Supplies, Technology, Books, Other, Trips, Visitors (sorted).

7- What is the profile of students regarding their grades? The trend is clear: the youngest the students, the greater the number of requests online.

8- Is there any relation between the grade levels and the type of resources requested? No matter the grade level, the trend stands the same: Supplies, Technology, Books, Other, Trips, Visitors (sorted).

9- Does the poverty level influence the trend for requested funds? The higher the vulnerability, the bigger the amount of submitted projects for funding is.

10- Is there a trend in the distribution of requested projects regarding the metropolitan placement and poverty level? The most frequent are the schools that belong to urban areas associated with high poverty levels.

11- Which are the most submitted donations regarding the areas of study? Literacy & Language, Math & Science, Music & The Arts, Applied Learning, Special Needs, History & Civics, Health and Sports.

12- Does the grade level influence the areas of interest/ studies? As expected, regardless of the grade level, the focus areas of studies don’t change significantly.

13- Which are the most frequent words present on the Essays dataset? a) Title: Book, Reading, Technology, Read, Science, Math, Library, Grade, Literacy, School, Ipad. b) Short description: Student, School, Book, Reading, Technology, Learning, Skill, Material, Math, Computer. c) Need statement: Book, Reading, Math, Cost, Skill, Project, Center, Material, HTML, Science, Game, Skill, Camera, iPad.

14- How much did each corresponding type of resource received in total donations? Books (511957), Supplies (1199005), Technology (243412), Other (206560), Trips (520), Visitors (181).

15- On average, in which states there is a higher concentration of donations measured by individual donors? AR, MT, RI, OK, NV.

16- Is there any correlation between the number of requests and donors? There is a positive correlation.

17- Which areas were targeted for submitted projects? Literacy & Language, Math & Science, Music & The Arts, Special Needs, Applied Learning, History & Civics, and Health & Sports.

18- Which areas of the study did gather more donations? Literacy & Language, Math & Science, Music & The Arts, Applied Learning, Special Needs, History & Civics, Health & Sports (sorted).

19- Is there any sort of correlation between the donation to projects and the primary focus area? There is a very strong and positive correlation: 99%.

20- On average, how long does it take for a project to be fully funded? 41 days.

EDA concluded.

Check out other articles you might also like:

Machine Learning: costs prediction of a Marketing Campaign (Exploratory Data Analysis – Part I)

15 Business Questions about Mobile Marketing Campaigns: ROAS (Return On Ad Spend)

Descriptive Statistics: Expectations vs. Reality (Exploratory Data Analysis – EDA)

Thanks for reading.


Related Articles