Introduction
As parents of two young children, my wife and I have researched the quality of schools in several US cities. This research has led to a few popular websites which present a variety of metrics on the quality of individual schools. Two of the most popular websites that present information on school quality are GreatSchools.org and Niche.com. As a data professional, I wondered where these sites obtain their data. Both of these sites use data provided by the US Department of Education’s National Center for Education Statistics (NCES) (GreatSchools.org also uses quite a bit of data from state education departments, which is more detailed than the federal data).
In this series of articles, I will share my process to download and restructure a number of these datasets. First, I will use Python to restructure the data to load it into a Mysql database, then I will do some ad hoc research projects using some of Python’s popular data science libraries.
The main goal of this first article is to cover how I successfully used Pandas to load 10 years of school-level test score results obtained from NCES into a MySQL database. By loading this data into a relational database format, I aim to make a wide array of analysis tasks more efficient as I take a deeper dive into the data in future articles. Hopefully, this will be helpful to those who are interested in exploring Pandas’ SQL capabilities.
Tools
Work in this series will be performed using several popular Python libraries and MySQL with potential for some other tools where it saves me time.
For this article, I will use 4 Python libraries: Pandas will be used for data processing tasks, OS and GLOB will be used to access directories and files, and SQLAlchemy will be used to connect the MySQL database where I will load the data.
Data
NCES makes a set of school and district-level performance files available each year. Each school-level performance file presents the full results for each participating US school for the assessment year. There are 265 columns in the 2017–2018 file. The first 9 columns hold information about the school and the year the data was collected. The additional columns hold the number of students assessed and the percentage deemed proficient (percent proficient) for the whole school and broken down by a wide array of demographic and racial groups. Detailed information on the structure of the files can be found on the US Department of Education’s EDFacts Data Files site. For this project, I have saved all school-level performance files from the 2009–2010 school year through the 2018–2019 academic years.
Building a Data Pipeline
To start with, I will import the libraries to be used and establish a connection to my MySQL database.
Next, l establish a database connection using the SQLAlchemy library, which we will later use to load restructured school-level performance data. One important recommendation for the SQLAlchemy database connection that I learned the hard way is to set the echo parameter equal to false when you establish the connection. If you follow my steps here, you will be writing millions of rows to a database in chunks of 10,000. You will not want the log to be written to your notebook for each successfully executed write command. I made this mistake in a jupyter notebook and ran out of memory after writing data for several hours. For additional tasks that required writing to my database, I downloaded the notebook .py file and ran it from the command line.
Finally, the main goal of this article is to walk through the Pandas tools we are using to load this data to a database. All of the work to reformat the data and map string values to numerics was done using a class. While they are often not critical to making your program work in Python, I have found that writing classes can help with the readability of code and organizing different functions so that they work well together.
In this case, the Python class is going to complete 5 main tasks to prepare the data for my database: 1) read a csv file, 2) find the columns where the test score data is located, 3) Select the columns that contain school identity information(e.g. name, state, agency, etc), 4) process participation and score information and stack the data into a vertical format, and 5) convert string values contained in the test score data to numeric values to allow for analysis when stored in the database.
After these steps have been completed, the transformed data will be stored in my database using an entity attribute value (EAV) data model. The EAV model will enable us to eliminate null values from our data and store the data efficiently for later use. For our EAV model, each entity will be a school and attributes will be test results for different groupings of the school’s students.
The steps taken to process one annual performance file are as follows:
I instantiate the class with 2 arguments. The ‘file’ argument directs us to the file we will be processing and the ‘str_replmts’ argument is a dictionary that maps string values contained in the original version of the performance file to numeric replacements that have been defined and stored in a dictionary.
Next, I have created a function that identifies and builds a list of the columns within the file which store test score participation and results.
Next, there are three different functions used to select columns that contain metadata regarding the school. Different functions are necessary to account for slight variations in column names between different years.
The next two functions will be used to transform test score data into the EAV data format. This conversion function is called inside of our stacking function to convert string values to numerics.
This function leverages the dictionary that maps string values to numerics. Having data for test scores as numbers will be important to facilitate analysis once we have the data loaded into our database. This step is necessary because the NCES files store proficiency scores in ranges for certain student demographic groups depending on the number of registered participants. This measure is taken to protect student privacy. For example, to make this data usable for analysis purposes, the string value ‘LE10’, which stands for "less than or equal to 10 percent proficient", must be converted into a numeric value. To complete this process for all string values in the dataset, I assembled a list of all string values across all data files and performed the mapping manually in a csv file. The full process of developing this mapping file is beyond the scope of this article. The mapping file looks like this:
This file is converted to a dictionary so that it can be used to map strings to numerics.
The function within the class with the most functionality packed in is the final get_scores_data. This function performs a series of tasks to pivot and transform the test score data into the EAV format that I will use to store the data in my MySql database.
This function accomplishes 4 primary tasks: 1) it selects the school id and the columns which contain test score data, 2) it stacks the data vertically 3) it parses metadata contained in the column header (e.g. what demographic group, what subject, etc.) into separate columns to facilitate easier filtering once the data has been loaded into the database and 4) it replaces non-numeric values with mapped values from the ‘str_replace’ dictionary above.
Task 1: Select the school id and the columns with test score data
Task 2: Stack the data vertically
Task 3: Parse the metadata for each column into separate fields.
Task 3: Replace non-numerics with mapped values:
The final action executed by this function before the file is ready to be written to a database is a filter/join operation which creates the final dataset with the number of participants and proficiency percentages presented side by side for each unique year/demographic group.
To execute these operations on the NCES files, I use a for loop to process each file and load it into the MySQL database.
Note that I am using the ‘chunksize’ parameter in Pandas’ to_sql function. In the background, SQL Alchemy is executing write statements to the MySQL database. The size of these statements needs to be limited (the why of this is beyond the scope of this article) and limiting the size of each write command to 10k records seems to keep the database from shutting down the connection. Another important note about this for loop, the ‘del data’ statement will delete the dataframe that was just loaded before bringing the next file in the for loop into memory for processing. This can be helpful if you are working on a machine with limited memory.
Running this for loop on all of the files took about 8 hours on a Windows PC with 4 cores and 16GB of memory.
Here is a sample of what our final database table looks like after loading the 10 years of performance data.
In addition to loading the school performance data in the EAV model. I also need to store additional metadata about each school to enable analysis between and within different schools.
I completed this last task by making calls to different versions of the get_school_index function which were created to account for slight variations in file formats from year to year, concatenating results into a single dataframe, and then loading that dataframe into a single table in my MySQL database.
Once loaded into the database, the data looks like this.
Recap
To recap, we just loaded a large amount of historical data related to test scores from public schools across the US into a MySQL database using Pandas and a few other libraries. Hopefully, this article helps you understand this dataset and gives you a few ideas about how to transform and load data using contemporary data processing tools and techniques. Thanks for reading!