
About the project
DonorsChoose.org is an online charity platform where thousands of teachers can propose and promote online project requests for material and specific equipment to make students’ education opportunities equal for each and anyone.
The project is based on a Kaggle Competition for Predicting Excitement at DonorsChoose.org, to identify projects that are remarkable and make business sense.
Objectives
The purpose of the following Exploratory Data Analysis (EDA) is to reveal trends and insights which, from a business perspective, identifies and put in evidence the most attractive projects for financial donations.
The main goal is to analyze the data in order to answer the following business question: what makes a project exciting?
Set-up
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
%matplotlib inline
import pingouin as pg
color = sns.color_palette()
from pandas.plotting import table
import scipy
from scipy.stats import pearsonr, spearmanr
from wordcloud import WordCloud, STOPWORDS
import datetime as dt
import plotly
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
# Load data
don = pd.read_csv("donations.csv")
ess = pd.read_csv("essays.csv")
out = pd.read_csv("outcomes.csv")
res = pd.read_csv("resources.csv")
pro = pd.read_csv("projects.csv")
Dataset information
The data – download here – is divided in 5 independent datasets: 1- Donations, 2- Essays, 3- Outcomes, 4- Resources, 5- Projects.

There are 3097989 donations and 619326 outcomes for 664098 projects. There is also 3667217 resources and 664098 essays.
Functions
Let’s create a function that prints the shape of the dataframe → number of observations, features, duplicates, missing values.
# SHAPE
def shape_df(df):
print(f"Number of observations: {df.shape[0]}")
print(f"Number of variables: {df.shape[1]}")
print(f"Number of duplicates: {df.duplicated().sum()}")
print(f"Are there any missing values {df.isnull().values.any()}")
print("-----")
print(df.dtypes.sort_values(ascending=True))
print("------")
print("Datatypes' proportion:")
print(df.dtypes.value_counts(ascending=True))
Function to detecs if there are any missing values → if so, return the amount both absolute and percent of missing values.
# MISSING VALUES
def null_val(df):
detect_null_val = df.isnull().values.any()
if detect_null_val:
null_abs = df.isnull().sum()
null_pc = df.isnull().sum() / df.isnull().shape[0] *100
null_concat = pd.concat([null_abs,null_pc], axis=1).round(2)
null_concat.columns = ['Absolute', 'Percent']
return null_concat.sort_values(by="Absolute", ascending=False)
else:
print("There are no missing values.")
Function to display correlations through a heatmap → Pearson method.
# CORRELATIONS
def corrs(x):
mask = np.triu(x.corr(), 1)
plt.figure(figsize=(19, 9))
return sns.heatmap(x.corr(), annot=True, vmax=1, vmin=-1, square=True, cmap='BrBG', mask=mask);
Exploratory Data Analysis
Finding what type (and how many by type) of variables we’re dealing with, check for missing values, correlations, as well as the statistical description of the numerical and categorical variables.
Dataset: Projects
Basic info
How many features are available?
How many observations are in the dataset?
Are there any duplicated records?
Are there any missing values?
What type of variables are there?
How many variables of each type?
# Observations, features, duplicates, datatypes and proportions
shape_df(pro)
>> output:
Number of observations: 664098
Number of variables: 35
Number of duplicates: 0
Are there any missing values? True
-----
fulfillment_labor_materials float64
students_reached float64
school_ncesid float64
school_latitude float64
school_longitude float64
total_price_including_optional_support float64
total_price_excluding_optional_support float64
school_zip float64
primary_focus_subject object
primary_focus_area object
secondary_focus_subject object
secondary_focus_area object
projectid object
poverty_level object
grade_level object
teacher_ny_teaching_fellow object
eligible_double_your_impact_match object
resource_type object
teacher_teach_for_america object
school_charter_ready_promise object
eligible_almost_home_match object
school_kipp object
school_nlns object
school_year_round object
school_magnet object
school_charter object
school_county object
school_district object
school_metro object
school_state object
school_city object
schoolid object
teacher_acctid object
teacher_prefix object
date_posted object
dtype: object
------
Datatypes' proportion:
float64 8
object 27
dtype: int64
# Describe missing values
null_val(pro)

# Stats description (num)
pro.describe().round(2)

From the 664098 records, the average price of ‘total_price_excluding_optional_support’ is 542, pretty much higher than the median which indicates the presence of outliers. The same happens with ‘total_price_including_optional_support’. Looking at the 1st and 3rd quartiles, the data seems to be ok except for the maximum values.
# Stats description (cat)
pro.describe(include="object")

California is the state with the most submitted projects, but Chicago is the city with the highest number of observations (9007).
There are 3 classes of school metropolitan types with urban as being the most frequent.
Amongst the 5 types of teachers, ‘Mrs.’ is by far the most frequent record and female teachers take the lead concerning gender.
From the 7 areas of study, Literacy & Language is the most requested one with Supplies being the most solicited resource type.
Among the students identified as being the most vulnerable regarding poverty level, the most frequent is the youngest ‘Grades PerK-2’.
On the last day of September of 2012, 1490 project requests were posted online.
# Checking correlations
corrs(pro)

From the previous heatmap, there are no insightful correlations.
1. What is the teacher’s profile and how is it distributed amongst the requested projects?
# Checking feature "teacher_prefix" distribution
df_teacher = pro[["teacher_acctid","teacher_prefix"]].copy()
# Replacing np.nan values by ""unknown"
df_teacher.teacher_prefix.replace(np.nan, "Unkown", inplace=True)
# Bar plot
ax = df_teacher["teacher_prefix"].value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("teacher_prefix", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Teacher Prefix Distribution", fontsize=10, loc="right")
# side table
sidetb = df_teacher.teacher_prefix.value_counts().sort_values(ascending=False).to_frame()
sidetb["percent"] = (sidetb.apply(lambda x: x/x.sum())*100).round(2)
sidetb['cumsum'] = sidetb["percent"].cumsum()
display(sidetb.T)
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)

From the analysis, it is clear that ‘Mrs’ and ‘Ms.’ are the titles/ prefixes largely most frequent which together corresponds to around 87% of the records.
TimeDate
Converting the date_posted into "year", "month", "day" and "weekday" variables.
# Convert "date_posted" into "year","month","day","weekday" features
pro["date_posted"] = pd.to_datetime(pro["date_posted"], format="%Y-%m-%d")
pro["year_posted"] = pro["date_posted"].dt.year
pro["month_posted"] = pro["date_posted"].dt.month.map({1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun", 7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dez"})
pro["weekday_posted"] = pro["date_posted"].dt.dayofweek.map({0:"Mon", 1:"Tue", 2:"Wed", 3:"Thu", 4:"Fri", 5:"Sat", 6:"Sun"})
pro[["date_posted","year_posted","month_posted","weekday_posted"]]

2. What are the most common periods in which project requests have been submitted?
2.1. On which weekdays?
# Checking distribution of submited projects regarding the weekdays
df_timedate = pro[["teacher_acctid","teacher_prefix","year_posted","month_posted","weekday_posted","date_posted"]].copy()
# Bar plot
ax = df_timedate["weekday_posted"].value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Weekdays", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Submited projects regarding the days of the week", fontsize=10, loc="right")
# side table
daysofweek = df_timedate.weekday_posted.value_counts().sort_values(ascending=False).to_frame()
daysofweek["percent"] = (daysofweek.apply(lambda x: x/x.sum())*100).round(2)
daysofweek['cumsum'] = daysofweek["percent"].cumsum()
display(daysofweek.T)

Sunday is clearly the day of the week when teachers submit more projects. The trend is that the closer it is to the weekend, the greater the concentration is. A possible explanation is that on Sundays teachers have more spare time to perform this task.
2.2. In which months?
# Checking distribution of submited projects along the academic year
# Bar plot
ax = df_timedate["month_posted"].value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
plt.xlabel("Months", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Submited projects along the year", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)

The analysis shows that the months with the highest incidence of submissions on the platform are September and October. It makes sense to have a greater volume of submissions at the beginning of the academic year. There is a greater concentration in the first semester.
2.3. What is the overall evolution throughout the years?
# Excluding 2014 - not complete:
df_timedate[df_timedate["year_posted"]==2014].month_posted.unique()
>> output: array(['May', 'Apr', 'Mar', 'Feb', 'Jan'], dtype=object)
---
# Checking evolution from 2002 to 2013
# Bar plot
ax = df_timedate.groupby("year_posted")['teacher_acctid'].count()[:-1].plot(kind="line", figsize=(9,6), fontsize=12, linewidth=2, grid=True, table=True)
plt.xlabel("Years", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Evolution of submited projects", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)

The analysis of the annual project post distribution reveals a clear and strong growth trend. Since 2002, the year the platform was born, the growth rate has been exponential.
3. Which areas are more responsible for having online submissions?
schools_ = pro[["school_metro", "school_state", "school_city", "resource_type", "grade_level", "poverty_level", "year_posted"]].copy()
schools_.head()

# Checking feature "school_metro" regarding school located areas
# Replacing np.nan values by ""unknown"
schools_.school_metro.replace(np.nan, "Unkown", inplace=True)
---
# Bar plot
ax = schools_.school_metro.value_counts().plot(kind="pie", figsize=(9,6), autopct='%1.2f%%', explode= (0.005, 0.005, 0.005, 0.005), startangle=80, fontsize=12, table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("", fontsize=10)
plt.title("Metropolitan school areas", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)

From which metropolitan areas do schools belong? The highest concentration is found in the urban areas followed by the suburban and rural. 12% is unknown.
4. In which states is the highest concentration of submitted projects?
# Checking feature "school_state" regarding higher concentration of project openings
high_states = schools_.school_state.value_counts()
# Bar plot
ax = schools_.school_state.value_counts(ascending=False)[:5].plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("States with higher schools' concentration", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)

The selected ‘top 5’ of School States represents almost 50% of all openings. California takes the lead followed by New York and North Carolina with a percentage close to Illinois and Texas.
5. Which school of which cities have more impact?
# Checking feature "school_city" regarding project openings
# Bar plot
ax = schools_.school_city.value_counts(ascending=False)[:10].plot(kind="bar", figsize=(12,8), fontsize=12, color=sns.color_palette("rocket"), table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Cities with higher schools' concentration", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)

If concerning the states, California is followed by New York and North Carolina with a percentage close to Illinois and Texas, regarding to the cities, Chicago, Los Angeles, Brooklyn, Bronx and New York are the ones that best represent the distribution.
6. Which type of resource request is most demanding?
# Checking feature "resource_type" distribution
# Bar plot
ax = schools_.resource_type.value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Resources", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Resource types Distribution", fontsize=10, loc="right")

From the type of resource requested, Supplies and Technology are the most demanding types concerning quantities, followed by Books, Others, Trips and Visitors.
7. What is the profile of students regarding their grades?
# Checking feature "grade_level" distribution
# Bar plot
ax = schools_.grade_level.value_counts().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Grade levels", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Grade levels distribution", fontsize=10,loc="right")

The trend is clear: the yougest the students, the greatter the number of online requests.
8. Is there any relation between the grade levels and the type of resources requested?
# Understanding how resources are distributed over the grade levels
plt.figure(figsize=(16,8))
ax = sns.countplot(x="grade_level", hue="resource_type", data=schools_, dodge=True)
plt.legend()
plt.xlabel("Grade levels", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Resources' distribution by the grade levels", fontsize=10, loc="right");

Understanding how resources are distributed over the grade levels: no matter the grade level, the trend for demanding resources stands the same: Supplies, Technology, Books, others.
9. Does the poverty level influence the trend for requested funds?
# Checking feature "poverty_level" distribution
# Bar plot
ax = schools_.poverty_level.value_counts(dropna=False).plot(kind="pie", figsize=(12,8), autopct='%1.2f%%', explode= (0.005, 0.005, 0.005, 0.005), startangle=80, fontsize=12, table=True)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("", fontsize=10)
plt.title("Poverty classes distribution regarding requested projects", fontsize=10, loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,2)

From the analysis of the distribution of the poverty classes, it is clear that the higher the vulnerability, the bigger is the amount of submitted projects for funding.
10. Is there a trend in the distribution of requested projects regarding the metropolitan placement and poverty level?
# Checking "poverty levels" regarding metropolitan school areas
h_pov = schools_.groupby(["poverty_level","school_metro"]).year_posted.count().iloc[:4]
hpov = pd.DataFrame(h_pov)
hst_pov = schools_.groupby(["poverty_level","school_metro"]).year_posted.count().iloc[4:8]
hstpov = pd.DataFrame(hst_pov)
l_pov = schools_.groupby(["poverty_level","school_metro"]).year_posted.count().iloc[8:12]
lpov = pd.DataFrame(l_pov)
m_pov = schools_.groupby(["poverty_level","school_metro"]).year_posted.count().iloc[12:]
mpov = pd.DataFrame(m_pov)
merge_1 = pd.merge(hstpov, hpov, on=["school_metro"])
merge_1.rename(columns={"year_posted_x":"Highest Poverty", "year_posted_y":"High Poverty"}, inplace=True)
merge_2 = pd.merge(merge_1,mpov,on=["school_metro"])
merge_3 = pd.merge(merge_2,lpov,on=["school_metro"])
merge_3.rename(columns={"year_posted_x":"Moderate Poverty", "year_posted_y":"Low Poverty"},inplace=True)
---
# Bar plot
ax = merge_3.sort_values(by=["High Poverty"], ascending=True).plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Poverty level distribution regarding metropolitan schools' areas", fontsize=10,loc="right")
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)

From the analysis of poverty levels regarding metropolitan school areas, the ones that belong to urban areas associated with high poverty levels are the most frequent and also because urban areas have a lot more density which makes sense with the distribution and relation of both classes.
11. Which are the most submitted donations regarding the areas of study?
# Checking "primary focus areas" regarding requested projects
subject = pro[["resource_type", "grade_level", "primary_focus_area", "total_price_excluding_optional_support", "total_price_including_optional_support", "poverty_level", "year_posted"]].copy()
---
# Bar plot
ax = subject.primary_focus_area.value_counts(ascending=False).plot(kind= "bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Primary Focus Areas", fontsize=10)
plt.xticks(rotation=30, horizontalalignment="right")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Primary Focus Areas distribution", fontsize=10, loc="right")

Regarding the areas of interest/ study, subjects related with Literacy & Language are the most submitted ones followed by almost half of Math & Science requests. Music & The Arts, Applied Learning, Special Needs, History & Civics, Health & Sports are the less requested focus areas.
12. Does the grade level influence the areas of interest/ studies?
# Understanding how primary focus areas are distributed through the grade levels
plt.figure(figsize=(16,8))
ax = sns.countplot(x="grade_level", hue="primary_focus_area", data=subject, dodge=True)
plt.legend()
plt.xlabel("Grade levels", fontsize=10)
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Primary focus areas distribution by grade levels", fontsize=10, loc="right");

As excpected, regardless the grade level, the focus areas of studies doesen’t change significantly.
Dataset: Essays
ess.shape
>> (664098, 6)
ess.head()

# Describe missing values
null_val(ess)

Word Cloud
Wordcloud is a type of visualization that displays the most common and frequent words present in a text string in which the word’ size is proportional to its frequency.
13. Which are the most frequent words present on the Essays dataset?
Let’s define a function to create a Wordcloud highlighting and plotting the most common words on each Series of the Essays’ dataset.
def plot_wordcloud(wc, otherWords=""):
ess[wc].dropna(inplace=True)
text = " ".join(ess[wc][~pd.isnull(ess[wc])])
wordcloud = WordCloud(width=3000, height=2000, random_state=1, background_color="black", colormap="Set2", collocations=False, stopwords=STOPWORDS.update(otherWords)).generate(text)
fig = plt.figure(figsize=(14,14))
plt.imshow(wordcloud)
plt.title("Wordcloud with %s content" % wc, fontsize=12, loc="right")
plt.axis("off")
plt.show()
Title
plot_wordcloud("title", ["need","Help","Classroom"])

The most frequent words present in the title are Book, Reading, Technology, Reade, Science, Math, Library, Grade, Literacy, School, Ipad.
Short description
plot_wordcloud("short_description", ["will", "class", "allow", "able", "one", "year", "many", "time", "remember", "use", "day", "way"])

The most frequent words present in the short description are Student, School, Book, Reading, Technology, Learning, Skill, Learn, Teach, Children, Material, Math, Computer.
Need Statement
plot_wordcloud("need_statement",["need", "student", "students", "fulfillment", "including", "help", "donorschoose", "org", "shipping,", "set", "new", "will"])

The most frequent words present in the need statement are Book, Reading, Math, Cost, Skill, Project, Center, Material, HTML, Science, Game, Skill, Camera, iPad.
Dataset: Resources
Basic info
How many features are available?
How many observations are in the dataset?
Are there any duplicated records?
Are there any missing values?
What type of variables are there?
How many variables of each type?
# Observations, features, duplicates, datatypes and proportions
shape_df(res)
>> output:
Number of observations: 3667217
Number of variables: 9
Number of duplicates: 0
Are there any missing values? True
-----
vendorid float64
item_unit_price float64
item_quantity float64
resourceid object
projectid object
vendor_name object
project_resource_type object
item_name object
item_number object
dtype: object
------
Datatypes' proportion:
float64 3
object 6
dtype: int64
# Describe missing values
null_val(res)

res.head()

res.describe(include="all").round(1)

More than one and a half million books were requested being the most frequent resource type.
The average item-unit price is of $180 and the median is of $14 → this indicates the presence of some heavy outliers. Also, some negative values present in the dataset are awkward, to say the least.
The average quantity is 3.3 units with a median of 1. The Q1 and Q3 indicate a good distribution but those outliers need to be deal with.
14. How much did each corresponding type of resource receive in total donations?
To answer this question, we need to calculate both the item quantity and item price of the items per each type of available resource.
Item quantity
res["item_quantity"].value_counts(dropna=False)
1.0 2304849
2.0 279693
3.0 108075
5.0 81142
4.0 80076
...
315.0 1
10500.0 1
313.0 1
309.0 1
121.0 1
Name: item_quantity, Length: 313, dtype: int64
---
res.item_quantity.value_counts(normalize=True,bins=15).round(6)
(-993.1089999999999, 66207.2] 0.997734
(926900.8, 993108.0] 0.000000
(860693.6, 926900.8] 0.000000
(794486.4, 860693.6] 0.000000
(728279.2, 794486.4] 0.000000
(662072.0, 728279.2] 0.000000
(595864.8, 662072.0] 0.000000
(529657.6, 595864.8] 0.000000
(463450.4, 529657.6] 0.000000
(397243.2, 463450.4] 0.000000
(331036.0, 397243.2] 0.000000
(264828.8, 331036.0] 0.000000
(198621.6, 264828.8] 0.000000
(132414.4, 198621.6] 0.000000
(66207.2, 132414.4] 0.000000
Name: item_quantity, dtype: float64
From the aggregation of the items in 15 buckets, we realize 99% of the data is concentrated between the values -993 and 66207.
Plotting the Empirical Cumulative Distribution Function (ECDF) of the "item quantity" feature and detecting the presence of the outliers in the variable. It shows in which percentage the data has greater and smaller points than the x-axis: very high indeed.
# Plotting ECDF
x = np.sort(res["item_quantity"].dropna())
y = np.arange(1, len(x)+1) / len(x)
plt.plot(x,y,marker = ".", linestyle = "none")
plt.title("ECDF Item Quantity",loc="right");

Let’s search for those distant outliers.
value = res.item_quantity.quantile([.9999995])
p = value.iloc[0]
res[res["item_quantity"] > p]

I have decided to drop the item ‘SWATCH BOOK-SCHOOL SMART CONST 09506’ (has no price also) and keep all the other.
# Removing outlier
res = res[res["item_quantity"] < 993108].copy()
res

res["item_quantity"].describe().round(2)
count 3658906.00
mean 3.03
std 11.75
min 0.00
25% 1.00
50% 1.00
75% 2.00
max 10500.00
Name: item_quantity, dtype: float64
---
res["item_quantity"].value_counts(dropna=False)
1.00 2731320
2.00 317590
3.00 120462
4.00 87816
5.00 85564
...
2.99 1
274.00 1
10500.00 1
268.00 1
290.00 1
Name: item_quantity, Length: 316, dtype: int64
---
# Next step is to deal with np.nan, 0.00 and 2.99 values
res[res["item_quantity"] == 2.99]

# Checking for items with no quantity records
(res[res["item_quantity"] == 0]).head()

The approach on both issues is to replace all the above anomalous values by the median value.
# Replacing anomalous values by np.nan
res["item_quantity"] = res["item_quantity"].replace(2.99,3)
res["item_quantity"] = res["item_quantity"].replace(0,np.nan)
# Replacing the outliers by the median
res["item_quantity"] = res.item_quantity.fillna(res.item_quantity.median())
---
res.item_quantity.describe().round(1)
count 3658906.0
mean 3.0
std 11.8
min 1.0
25% 1.0
50% 1.0
75% 2.0
max 10500.0
Name: item_quantity, dtype:float64
Item Price
# Plotting ECDF
x = np.sort(res.item_unit_price.dropna())
y = np.arange(1, len(x)+1) / len(x)
plt.plot(x,y,marker = ".", linestyle = "none")
plt.title("ECDF Item Price",loc="right");

re = res.copy() # Make a copy of the dataset
# Dealing with the outliers
Q1 = re["item_unit_price"].quantile(0.25)
Q3 = re["item_unit_price"].quantile(0.75)
IQR = Q3 - Q1
re["item_unit_price_"] = re.item_unit_price[~((re.item_unit_price < (Q1 - 1.5 * IQR)) |(re.item_unit_price > (Q3 + 1.5 * IQR)))]
# Displaying data dispersion without the outliers
print(re.item_unit_price.describe().round(1))
print("")
print(int(re.item_unit_price.describe()[0]) - int(re.item_unit_price_.describe()[0]),"outliers were identified with this operation")
count 3653295.0
mean 179.8
std 234667.6
min -99.6
25% 6.3
50% 14.0
75% 36.0
max 448421569.0
Name: item_unit_price, dtype: float64
>> 488446 outliers were identified with this operation
re.item_unit_price_.value_counts(dropna=False)
NaN 494057
29.950 56527
6.290 49729
19.950 43492
3.590 42383
...
3.321 1
11.331 1
7.551 1
62.150 1
78.540 1
Name: item_unit_price_, Length: 8199, dtype: int64
---
# Imputing outliers by the median
res["item_unit_price"] = re["item_unit_price"].fillna(res.item_unit_price.median())
# Removing negative values
res = re[re["item_unit_price_"] > 0].copy()
res.drop(["item_unit_price_"],axis=1,inplace=True)
res.describe().round(2)

# Displaying results without the outliers
plt.hist(res.item_unit_price, bins=50)
plt.title('Numeric variable 'Item unit price'', fontsize=12, loc="right")
plt.ylabel('Absolute Frequency')
plt.show()

project_resource_type
r = res[["projectid","project_resource_type","item_unit_price","item_quantity"]].copy()
r["amount"] = r.apply(lambda x: x["item_unit_price"] * x["item_quantity"], axis=1)
r.head()

# Scatter plot: relationship between Item Unit Price and Item Quantity
fig = plt.figure(figsize=(9,6))
ax = fig.add_subplot(111)
ax.scatter(r["item_quantity"],r["item_unit_price"])
ax.set_xlabel("Iten Quantity")
ax.set_ylabel("Item Unit Price")
ax.set_title("Relationship of Item Unit Price vs. Item Quantity", fontsize=10,loc="right")
ax.set_xlim(0,650);

The scatter plot shows that the more expensive the items are, the less quantities tend to be requested.
# Bar plot
ax = r.groupby(["project_resource_type"])["amount"].count().sort_values(ascending=False).plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("project resource type", fontsize=10)
plt.xticks(rotation=30, horizontalalignment="right")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Distribution of project resources by amounts", fontsize=10, loc="right")
# side table
amount_ = r.groupby(["project_resource_type"])["amount"].count().sort_values(ascending=False).to_frame()
amount_["percent"] = (amount_.apply(lambda x: x/x.sum())*100).round(2)
display(amount_.T)

The total amount by each type of resource indicates that the books were the most expensive resource (cumulative), followed by supplements and technology in the third position.
Datasets: Resources + Outcomes + Donations
Merging dataframes
# Merging dataframes Resources and Outcomes, left join
df_m0 = pd.merge(pro,out,how="left",on="projectid")
df_m0.dropna(inplace=True)
---
df_m0.drop(columns={"school_ncesid","school_latitude","school_longitude","school_zip","school_county","school_charter","school_magnet","school_year_round","school_nlns","school_kipp","school_charter_ready_promise","teacher_teach_for_america","teacher_ny_teaching_fellow","eligible_double_your_impact_match","eligible_almost_home_match","at_least_1_green_donation","great_chat"}, inplace=True)
---
# Merging the above merged dataframe with Donations, left join
df = pd.merge(df_m0,don,how="left",on="projectid")
df.dropna(inplace=True)
---
display(df.shape)
>> (467107, 52)
15. On average, in which states there is a higher concentration of donations measured by individual donors?
Let’s go back and understand in wich states there was a higher concentration of submited projects. The selected ‘top 5’ of School States represents almost 50% of all projects openings.
# Checking feature "school_state" regarding higher concentration of project openings
# Bar plot
ax = schools_.school_state.value_counts(ascending=False)[:5].plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=True)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.title("States with higher schools' concentration", fontsize=10,loc="right")
display(ss.head(5).T)
ax.axes.get_xaxis().set_visible(False)
table = ax.tables[0]
table.set_fontsize(10)
table.scale(1,1.7)

Checking feature "school_state" regarding concentration of project donations above average.
state_schools = df.groupby("school_state")["donation_total"].mean()
# Bar plot
ax = df.groupby("school_state")["donation_total"].mean().sort_values(ascending=False).head().plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("Absolute values", fontsize=10)
plt.xlabel("States with higher schools' concentration", fontsize=10)
plt.title("States with schools receiving donations above avegrage", fontsize=10,loc="right")
plt.axhline(df.groupby("school_state")["donation_total"].mean().mean(), linewidth=2, color ="r")
plt.tight_layout()
display(ss.head(5).T)

What is the average of projects’ donations by state?
df.groupby("school_state")["donation_total"].mean().mean()
>> 111.75994027601705
state_schools.plot.line();

# Describe
state_schools = df.groupby("school_state")["donation_total"].mean()
display(state_schools.describe().to_frame()) # Describe
# Plot "total_price_excluding_optional_support"
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})
mean=state_schools.mean()
median=state_schools.median()
sns.boxplot(state_schools, ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
sns.distplot(state_schools, ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
plt.legend({'Mean':mean,'Median':median})
ax_box.set(xlabel='')
plt.show()


16. Is there any correlation between the number of requested projects by state and donors by the state?
We simply get the donation by state and group it by the number of individual donors then sort values descending.
high_donors = df.groupby("donor_state")["donor_acctid"].count().sort_values(ascending=False).to_frame()
high_donors.head()

Let’s take from the previous calculations the stats regarding the highest submitted project requests.
high_states.head().to_frame()

Concatenating both datasets.
zaq = pd.concat([high_states, high_donors], axis = 1)
zaq.rename(columns={"school_state":"Requests", "donor_acctid":"Donations"}, inplace=True)
zaq.head()

Display correlations between submitted projects by state and donations made by state.
# Drop nans
zaq.dropna(inplace=True)
---
# Correlations
zaq.plot(x="Requests", y="Donations", kind="scatter", figsize=(9,6), fontsize=12)
plt.title("Projects requested by donations made", fontsize=10,loc="right")
plt.xticks(rotation=0,horizontalalignment="center")
plt.ylabel("Donations made",fontsize=10)
plt.xlabel("Requested projects",fontsize=10)

It seems there is a positive correlation between the amout of requested projects by state and donors by state.
# Calculating corr. coef. between Requests and Donations
zaq["Requests"].corr(zaq["Donations"])
>> 0.9689520473560519
17. Which areas where targeted for submitted projects?
# Checking "primary focus areas" regarding requested projects
# Bar plot
ax = df.primary_focus_area.value_counts(ascending=False).plot(kind="bar", figsize=(9,6), fontsize=12, color=sns.color_palette("rocket"), table=False)
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 7), textcoords='offset points')
plt.xlabel("Primary Focus Areas", fontsize=10)
plt.xticks(rotation=30, horizontalalignment="right")
plt.ylabel("Absolute values", fontsize=10)
plt.title("Primary Focus Areas distribution", fontsize=10, loc="right")

Literacy & Language, Math & Science, Music & The Arts, Special Needs, Applied Learning, History & Civics, and Health & Sports.
18. Which areas of the study did gather more donations?
From a business perspective, wich primary focus areas were more attractive to donors?
prisubject_donor = df.groupby("primary_focus_area")["donation_to_project"].sum().sort_values(ascending=False).to_frame().copy()
prisubject_donor.round(2)

prisubject = df.primary_focus_area.value_counts().sort_values(ascending=False).to_frame()
prisubject

19. Is there any sort of correlation between the donation to projects and the primary focus area?
Let’s concatenate both datasets and display the correlation between primary focus area of the projects and its donations.
cde = pd.concat([prisubject, prisubject_donor], axis=1) #Concatenate
cde.rename(columns={"school_state":"Requests", "donor_acctid":"Donations"}, inplace=True)
cde.round(2).head()

cde.dropna(inplace=True) #Drop nans
#Correlations
cde.plot(x="primary_focus_area", y="donation_to_project", kind="scatter", figsize=(9,6), fontsize=12)
plt.title("Primary focus area projects requested and its donations", fontsize=10, loc="right")
plt.xticks(rotation=0, horizontalalignment="center");

There is a very strong and positive correlation (99%) → donors are more willing to support areas like Literacy & Language, as well as Math & Science and Music & The Arts, than Special Needs or Applied Learning.
# Correlation
cde.primary_focus_area.corr(cde.donation_to_project)
>> 0.9917272663581518
20. On average, how long does it take for a project to be fully funded?
df[["projectid","date_posted","donation_timestamp"]].head()

# Filtering fully funded project
dft = (df[df["fully_funded"] == "t"]).copy()
---
# Converting column (object --> datetime)
dft["donation_timestamp"] = pd.to_datetime(dft["donation_timestamp"], format="%Y-%m-%d")
---
# displaying date only
dft["donation_timestamp"] = dft["donation_timestamp"].dt.date
dft["donation_timestamp"] = pd.to_datetime(dft["donation_timestamp"], format="%Y-%m-%d")
---
# Calcultaing durations (in days) and coverting into int64
dft["funding_span"] = dft.donation_timestamp - dft.date_posted
dft["funding_span"] = dft.funding_span.dt.days
---
# Counting the days until fully funded
states_funding_proj = dft.groupby(["school_state","projectid"])["funding_span"].mean()
states_funding_proj
school_state projectid
AK 02059528bc746a40214a5fd9cb5e9d6b 66.000000
0213fbd06fb04ca6a797da6ae0040cb1 0.000000
02158d40acc0744af7d7584531092f07 99.066667
0257042b0165752bc3e686cc90665632 13.000000
02ad156dee38b083f364c8164f7f9b2f 0.000000
...
WY fb0296ec6586036457750d0b1b38a49d 101.000000
fc9c54832601b2ed0ba54237b80de254 27.500000
fce5bf5cd0ae40947a1152bd1b834604 41.500000
fd69421691ce38787a90ce9dac747477 24.500000
fe450a764037db85093d0511315ba944 1.000000
Name: funding_span, Length: 173512, dtype: float64
---
states_funding_avg = states_funding_proj.groupby(["school_state"]).mean()
avg_days = states_funding_avg.mean()
print(f"Fully funding average days:", states_funding_avg.mean())
>> Fully funding average days: 40.45127526109651
---
# Average days until fully funded projects by state
# Bar plot
ax = states_funding_proj.groupby(["school_state"]).mean().sort_values().plot(kind="barh", figsize=(12,9), fontsize=12, color=sns.color_palette("rocket"), table=False)
# Params
plt.xticks(rotation=0, horizontalalignment="center")
plt.ylabel("States", fontsize=10)
plt.xlabel("Days until fully funded projects", fontsize=10)
plt.title("Average days until fully funded projects by state", fontsize=10, loc="right")
plt.axvline(avg_days, linewidth=2, color ="r")
plt.tight_layout()

41 days is the average time projects get fully funded.
Above, the chart differentiates states that get funded earlier than the average days from those that get funded later.
Next, states below average:
states_funding_avg[states_funding_avg < avg_days].sort_values(ascending=False)
school_state
WA 40.141406
ID 40.066648
NJ 39.942280
GA 39.741997
TX 39.658309
NV 39.577678
NY 39.497603
AR 39.418404
AZ 39.395866
NM 39.259313
NH 39.224946
CT 39.036108
IL 39.022429
PA 38.834201
MA 38.724716
MN 38.491491
OR 38.392444
FL 38.354962
MT 37.189795
CO 36.780902
KY 35.623806
WV 35.029220
MO 34.963707
CA 34.799367
TN 32.973356
DC 32.016388
HI 31.547774
DE 31.091836
AK 30.834884
Name: funding_span, dtype: float64
States above average days:
states_funding_avg[states_funding_avg >= avg_days].sort_values(ascending=False)
school_state
MS 53.589771
AL 50.309404
SC 48.911828
RI 48.640019
WY 48.480016
NC 46.838167
MI 46.655316
IA 46.202405
VA 45.981566
LA 44.471631
ME 43.272701
ND 42.832920
VT 42.401578
OH 42.291703
MD 42.054127
NE 41.993862
IN 41.962964
UT 41.597421
WI 41.588098
OK 41.471277
KS 40.946715
SD 40.889714
Name: funding_span, dtype: float64
Wrap-up
1- What is the teacher profile? Female 87% {Mrs: 47%, Ms: 40%}, Male 13%.
2- What are the most common periods for submissions? 2.1- On which weekdays? Sun, Sat, Tue, Mon, Fri, Wed, Thu (sorted). 2.2- In which months? Set, Oct, Jan, Aug, Feb, Dec, Nov, Mar, Apr, Jul, May, Jun. 2.3- What is the overall evolution throughout the years? Growth has been exponential.
3- Which metropolitan areas get more submissions? Urban 53%, Suburban 23%, Rural 12%, unknown 12%.
4- In which states is the highest concentration of projects? The followed ‘top 5’ represents almost 50%: California, New York, North Carolina, Illinois, and Texas.
5- Which school of which cities have more impact? There is a city-state relation: Chicago, Los Angeles, Brooklyn, Bronx, and New York.
6- Which type of resource request is most demanding? Supplies, Technology, Books, Other, Trips, Visitors (sorted).
7- What is the profile of students regarding their grades? The trend is clear: the youngest the students, the greater the number of requests online.
8- Is there any relation between the grade levels and the type of resources requested? No matter the grade level, the trend stands the same: Supplies, Technology, Books, Other, Trips, Visitors (sorted).
9- Does the poverty level influence the trend for requested funds? The higher the vulnerability, the bigger the amount of submitted projects for funding is.
10- Is there a trend in the distribution of requested projects regarding the metropolitan placement and poverty level? The most frequent are the schools that belong to urban areas associated with high poverty levels.
11- Which are the most submitted donations regarding the areas of study? Literacy & Language, Math & Science, Music & The Arts, Applied Learning, Special Needs, History & Civics, Health and Sports.
12- Does the grade level influence the areas of interest/ studies? As expected, regardless of the grade level, the focus areas of studies don’t change significantly.
13- Which are the most frequent words present on the Essays dataset? a) Title: Book, Reading, Technology, Read, Science, Math, Library, Grade, Literacy, School, Ipad. b) Short description: Student, School, Book, Reading, Technology, Learning, Skill, Material, Math, Computer. c) Need statement: Book, Reading, Math, Cost, Skill, Project, Center, Material, HTML, Science, Game, Skill, Camera, iPad.
14- How much did each corresponding type of resource received in total donations? Books (511957), Supplies (1199005), Technology (243412), Other (206560), Trips (520), Visitors (181).
15- On average, in which states there is a higher concentration of donations measured by individual donors? AR, MT, RI, OK, NV.
16- Is there any correlation between the number of requests and donors? There is a positive correlation.
17- Which areas were targeted for submitted projects? Literacy & Language, Math & Science, Music & The Arts, Special Needs, Applied Learning, History & Civics, and Health & Sports.
18- Which areas of the study did gather more donations? Literacy & Language, Math & Science, Music & The Arts, Applied Learning, Special Needs, History & Civics, Health & Sports (sorted).
19- Is there any sort of correlation between the donation to projects and the primary focus area? There is a very strong and positive correlation: 99%.
20- On average, how long does it take for a project to be fully funded? 41 days.
EDA concluded.
- You can find the entire code here.
- The datasets can be downloaded here.
- Kaggle Competition for Predicting Excitement at DonorsChoose.org
Check out other articles you might also like:
Machine Learning: costs prediction of a Marketing Campaign (Exploratory Data Analysis – Part I)
15 Business Questions about Mobile Marketing Campaigns: ROAS (Return On Ad Spend)
Descriptive Statistics: Expectations vs. Reality (Exploratory Data Analysis – EDA)
Thanks for reading.