Recently I completed a project to predict at-risk communities for food insecurity in America. You can find information about that project here that includes code and a PowerPoint presentation. For that project, I used government datasets from the USDA called SNAP QC data. These are massive datasets of over 40k records and 800+ features accompanied by a technical document to explain features. I am going to go through a technical analysis of how I narrowed that dataset.
What are QC datasets?
The SNAP program from the USDA uses "QC" data. These are quality control datasets meaning that they have been hand-picked in some way for inclusion into the final set. Unfortunately, these standards for hand-picked do change every year, though it isn’t a drastic change. Since government data is so massive for these programs, statisticians are trying to implement "features" that are representative of external influence on those who participate in the program. In the case of SNAP participants, massively incomplete applications are excluded, which results in these datasets being more representative of those who receive the benefit as opposed to all of those who applied for the program. Also, these datasets are weighted. These weights are determined by economic influences on states. For example, if a state declares an emergency, the weight of the participants is lowered to reduce the outlier effect on the dataset as a whole.
Narrowing the data: GIS
A spatial analysis of 2007–2008 data showed counties that were outliers compared to their neighbors. This was done during an ESRI Spatial Data Science MOOC. First, a 2D model was built that visually displays the change over time from one year to the next in a county. This was built using ESRI’s "Time-Series Analysis 2D" tool. Next, the use of their "Emerging Hot Spot Analysis" tool was used to display counties that were statistically significant outliers from their surrounding neighbors from 2007–2008.
It does this by "using the Conceptualization of Spatial Relationships values you provide to calculate the Getis-Ord Gi* statistic (Hot Spot Analysis) for each bin. Once the space-time hot spot analysis completes, each bin in the input NetCDF cube has an associated z-score, p-value, and hot spot bin classification added to it. Next, these hot and cold spot trends are evaluated using the Mann-Kendall trend test. With the resultant trend z-score and p-value for each location with data, and with the hot spot z-score and p-value for each bin, the Emerging Hot Spot Analysis tool categorizes each study area location." – taken from the ArcGIS Pro Documentation on the tool.
The result highlighted a hot spot in San Juan, New Mexico of increased usage. And an emerging cold spot of Cherry County, Nebraska where there was a decrease in participants. Since the QC Snap data came in at the state level, with no way to filter by county, I chose to compare Nebraska vs New Mexico to highlight the extreme cases. I also wanted to do a 10-year gap analysis to reflect the impact of the 2008 crash on food insecure communities at risk. So I ended up with 4 datasets: 2007 New Mexico & Nebraska, 2017 New Mexico & Nebraska.

Narrowing the data: High Nullity
To narrow that data further, I chose to remove columns that had 100% nulls in them, meaning they were features that contained no data. These were features that had been added perhaps for a nationwide use that simply did not meet any use cases for the applicants in Nebraska or New Mexico. These columns were different for Nebraska and New Mexico, as well as they were different columns from 2007 to 2017. Simply put, any of the columns in each of the datasets that had all null values were removed.
Next, I found a paper entitled, "The proportion of missing data should not be used to guide decisions on multiple imputation" that talked about the potential concepts of high nullity. What the paper concluded was that the integrity of the information gleaned from the columns was at risk if not enough information was present to represent those features truthfully. The authors tested and concluded that the value of the data is more important than the amount of missing information. Therefore, I made the determination to define high nullity in these datasets as columns with more than 50% of the records marked as null values. I made that assumption based on the fact that I had a high number of records in each dataset that were best represented fairly by at least 50% or more of the data being original. I then removed those high nullity columns.
Lastly, I had to account for the remaining null values in the dataset before I could use a predictive model. Therefore, I used simple imputer from sklearn to input the mean for the remaining values. The mean was used because it was a fair representation of the original data once the high nullity features were removed.

Narrowing the data: Correlations
I chose to run my model as a supervised model with my target variable as "CAT_ELIG". This was a field that marked each application with yes or no as eligible to receive SNAP benefits. This was fine for 2007, but in 2017, all applications in Nebraska and New Mexico were accepted. So correlation to a target variable was only valid using the 2007 datasets. To further add a dimension to the correlations, I could run 2017 data using PCA (Principal Component Analysis) which is best for inferring correlations among unsupervised data (meaning not defining a target variable since all applications were marked "yes" for "CAT_ELIG" in 2017). My intent here though was to find statistically significant correlations in 2007, the year of the initial GIS hotspot analysis, and see how they changed in 2017.
The technical document included 6 sections of observations. My final dataset was the top 5 correlated features per section as a final set of columns. Each dataset had very similar top 5 correlated features, with only a few differences from New Mexico to Nebraska. The end result was 32 features + the target column.

Narrowing the data: the model
For this analysis, I wanted an interpretable model in order to create a more accurate picture of what was influencing food insecurity. Because of the intrinsic chances of high multicollinearity I needed to reduce that risk in my final model. A paper entitled "Support Vector Machine vs. Random Forest for Remote Sensing Image Classification: A Meta-analysis and systematic review" discusses that SVM has been around since the ’70s and was a true advancement in machine learning, but it is highly affected by correlated features. Whereas, Random Forest has outperformed SVM’s in several situations. And since Random Forest uses bootstrap and feature sampling, it addresses a dataset at the row and column levels, simultaneously, and therefore greatly reduces the effect of multicollinearity on results. Ultimately, I chose Random Forest to:
- Help buffer the effect of features that were related to other features.
- This gave me a higher degree of accuracy with less hyperparameter tuning, thus increasing its coefficient interpretability by not making it a "black box" model.
A good practice to ultimately find the right model is to run several initial tests on the data and compare the scores. This can tell you which ones are working well on the data. Cross-validated scores were compared on test and train sets to view how much variance was in the data. For this project, it was important to error on the side of positive predictions, whether they were correct or not. This is because it is more dangerous to under-represent areas in need rather than worry about areas tagged as needing increased support but do not need it. Therefore, recall was used to show how many positive predictions were made compared to precision which showed how many positive predictions were right. The closer the numbers, the more accurate the prediction model.

I used a Voting Classifier with Random Forest, Gradient Boosting, and Bagging Classifier to give a cross-validated score of 95%.
- I used the random forest for the reasons given above.
- A gradient boost to increase accuracy by balancing out the positive accuracy.
- And a bagging classifier to further even out the recall and precision scores.
Final thoughts
Every project is unique. And though choosing an interpretable model in this case reduced hyperparameters that can be set, that leaves feature selection as one of the few ways to improve the accuracy. A data scientist is the final determination of what story the data will tell. There are many ways to fine-tune the steps I used to dwindle down the data, but dwindling the number of features is vital since high variance is very detrimental to accuracy and interpretation. Ultimately, a data scientist must make decisions based on what is the best representation of the data and the project at hand.