Predicting overdose mortality per US county

Published in

Towards Data Science

13 min readAug 28, 2018

Background and Motivation

The opioid epidemic has turned into one of the major public health catastrophes for this generation of Americans. Similar to what tobacco/smoking or HIV/AIDS were to earlier generations, the opioid epidemic appears to be this era’s defining public health crisis. I wanted to see if it was possible to build a model to predict opioid-related mortality on a county by county basis, since this type of model might give insights into where and how to target interventions.

I had a suspicion that a decent amount of the variance in which areas of the country have been most affected by the crisis could probably be explained by demographic or economic factors, but I was actually more curious about whether or not other predictors which were more modifiable would turn out to be significant. For example, the common narrative is that the opioid crisis started when physicians started prescribing opioid painkillers much more liberally in the 1990s, in part due to pharmaceutical companies reassuring physicians that these painkillers had a low incidence of addiction and few side effects, neither of which was true. But is opioid prescribing rate still the main driver of opioid overdose deaths? Or are there are other variables that are stronger predictors by now, e.g. perhaps the volume of illegal opioids such heroin or fentanyl is stronger predictor?

Design

The goal of this project was to predict a quantitative target variable using a multiple linear regression model. The design of the project was fairly straightforward: identify an appropriate target variable, then find publicly available datasets that would provide information on what I thought would be useful and appropriate predictors. I then planned to use these independent variables/predictors to build a multiple linear regression model, and then perform feature selection in order to see which predictors were the most important. Finally, I tested the model on a holdout test data set to see how well the model performed.

Data

After exploring the CDC’s description of their public databases, I decided to use the CDC WONDER database to scrape my target variable, which was the crude mortality rate due to drug overdose per county in 2016. A few comments on the decision to use this as a target variable: I initially thought about predicting mortality rate per state, but that would only give me 50 records. The CDC MCD (multiple causes of death) database actually does allow you to query for deaths specifically due to opioid overdose (and you can be even more specific, e.g. querying for the # of deaths due to heroin vs opioid analgesics vs other types of opioids), but the numbers became so small for individual categories that it was easier to get actual numbers for overall mortality rate due to drug overdose, without specifying what type of drug. This is because when the number of total deaths per county is between 10–20, the CDC will give you the # of deaths but won’t calculate a crude or age-adjusted mortality rate, because they state that this small number results in ‘unreliable’ statistics. If the number of total deaths is between 0–9, then they won’t give a number at all and will instead state that the result is ‘suppressed’, because with a small number of deaths in a small county, individual patients could potentially be identified.

Therefore, it was easiest to get data for crude mortality rate per county due to drug overdose, and the most recent dataset available was 2016. Statistically speaking, it would have been more accurate to compare age-adjusted mortality rates, however the CDC did not provide age-adjusted mortality rates for any counties where it gave a ‘Unreliable’ result (total deaths between 10–20), and without the specific breakdown of how many of these deaths were in each age bracket, I couldn’t calculate the age-adjusted mortality rate directly. It does look like there are ways to indirectly calculate the age-adjusted mortality rate, but I decided to go ahead and use the crude mortality rate and then to use median age per county as a predictor as rough way to sort of account for this. This meant that even though there are about 3000 counties in the US, because I ended up with 1000 records (exactly!) where it was possible to actually calculate a crude mortality rate for those counties.

Overdose-related crude mortality rate per 100,000 in 2016

For my independent variables, broadly speaking they fell into 4 categories:

Demographic

% Caucasian
% High school graduate or more (meaning the percent of people who at least graduated high school)
Median age

Economic

Median household income
% Poverty
% Unemployment

Medical

Age of PDMP (prescription drug monitoring program) in the state
Opioid prescribing rate per 100 people

Geographic

US Census regional division

The demographic and economic data were all obtained using the US Census API to query the ACS (American Communities Survey) 5-year 2016 dataset. It’s definitely a bit of a learning curve to figure out which US Census dataset to query, and then after that how to figure out how to structure a query. The US Census offers so much information that it looks like it’s possible to get extremely granular query for the exact, specific records that you want, but the downside is that figuring out how to pull even simple data out is pretty difficult at first! Anyways, what I realized after looking at the pairplot of my predictors is that the economic data that I got has relatively high correlation with each other, e.g. % poverty and % unemployment, although they aren’t measuring exactly the same thing, they clearly do correlate with each other. I thought that this might be a big problem in causing collinearity, but in retrospect I think that it was actually okay to have all these variables — when I get to the results section, I’ll discuss that I think that engineering a variable that combines economic information might be even more useful for building a better model.

The opioid prescribing rate per 100 people was pulled from the CDC, which provides tables with the annual opioid prescribing rate per 100 people per county for more than the past decade. Some of the numbers were pretty eye-popping; in Norton County, VA, there were 563 opioid prescriptions per 100 people in 2014. That’s right, more than 5 opioid prescriptions per person for that county! Numbers were not provided for many counties, so I ended up imputing the average rate for the state to represent these numbers. I loaded the prescription rate per county for 2014, 2015, and 2016, and the prescription rates for 2014 had the strongest raw correlation with mortality rate when looking at the correlation matrix, so I ended up using the opioid prescription rate for 2014 instead of the other years as my predictor. There wasn’t value in including all 3 years since the prescribing rate was very similar year to year.

Opioid prescription rate per 100 people in 2014

A prescription drug monitoring program is a state-level program (except for Missouri!) which collects and monitors information on prescriptions of certain medications, specifically controlled substances such as opioids and benzodiazepines. Different states have implemented PDMPs at different years, and in the 2000s many states started implementing electronic PDMPs so that doctors, pharmacists, and other agencies can access this data. From the prescribing provider and pharmacist point of view, one of the main goals of the PDMP is to prevent patients from ‘doctor-shopping’ and going to multiple providers to collect lots of prescriptions for opioids or benzos. If PDMPs really had a strong effect on decreasing opioid over-prescription, then hopefully that could translate into a lower opioid overdose mortality rate, and so one could hypothesize that overall the overdose mortality rate would be lower in states with older PDMPs.

Finally, I thought that geographic region could also be a useful predictor of the crude mortality rate, since some parts of the country have been hit harder than others. One could argue about how granular to get, e.g. time zones, regions, divisions, even state by state (although for states the problem is that states are very different sizes). I ended up using US Census regional divisions as it seemed like a good compromise between macro/micro, as there are 10 regions. This ended up getting coded as dummy variables, and I removed the ‘New England’ region dummy variable in order to avoid the dummy trap.

Algorithm

The model that we were assigned to use was a multiple linear regression model. This was straightforward to implement using either sklearn or statsmodels, although statsmodels provides a lot more information on the model. With all of the predictors included, I think R-squared was around 0.3. Damien then helped me transform the predictors using sklearn’s PolynomialFeatures in order to generate quadratic combinations, which then increased the number of predictors from 16 (8 quantitative predictors, 8 dummy variables) to 53, including all of the interaction terms for the quantitative predictors. This increased R-squared on the full dataset to around 0.43-ish, which clearly shows that the interaction terms do contribute information.

The next step was to perform feature selection. I ended up trying two methods, backwards elimination with cross-validation, where for each round, each cross-fold voted for a variable that had a coefficient with the highest p-value, and then I removed the variable that had the most votes. The final model I selected was the one that had the lowest mean validation error. The other strategy I tried was using sklearn’s LassoCV to perform feature selection; again, I selected the Lasso model that had the lowest validation error.

Interestingly, LassoCV and backwards elimination both picked ‘age’ as the most important predictor, but then after that the next four most important predictors were different. Furthermore, backwards elimination selected several of the dummy variables as fairly important, whereas lasso still included them in the final model, but they were less important. They both gave similar R2 values, about 0.41 to 0.42, and also about the same validation MSE, 0.17 to 0.18. The test MSE on the final holdout test data set was also about the same for both models; for the backwards elimination model it was 0.181, and for Lasso it was 0.184.

Tools

For obtaining the predictors and target variable data, I used Selenium and BeautifulSoup for the variables where an API was not provided, and then the US Census API for their data. Cleaning the data was done with pandas, and for analysis I used sklearn and also statsmodels.

Results

I looked at the top 5 positive coefficients with the largest values for the two models. They were quite similar, as you can see below. Backwards elimination included the combination coefficient ‘median household income x opioid rx rate’, whereas lasso included ‘% poverty’, but otherwise they shared 4 coefficients in common. Unsurprisingly, opioid rx rate was one of the top five positive predictors, although since I didn’t include any data regarding illegal drug use e.g. heroin, fentanyl, etc. it’s unclear to me whether or not illegal drug use or prescription drugs are a stronger predictor of overdose mortality. I did think that the combination predictors were interesting; median household income multiplied by the percent unemployed was a strong positive predictor, stronger than either feature individually, which is a little odd because you would think that if median household income went up, that percent unemployment would go down and they would cancel each other out. But that’s probably not a true assumption, since if there are counties with a high degree of income inequality, you could have a rise in median household income (especially in the households that are earning income) where the rich become richer, but the percent unemployment could also go up since the people on the opposite end of the socioeconomic spectrum could also be losing their jobs at the same time.

I also ended up refitting the backwards elimination on the full data set, and then mapping the residuals to a choropleth map. I wanted to see if there was a geographic pattern to where the model overestimated or underestimated the crude mortality rate. The choropleth maps above were created using GeoPandas, which is a module that is build to handle spatial and geographic data, and comes with plotting methods to allow for mapping. However, there is another library called Folium which allows you to create interactive maps and save them as html pages, and so I used Folium to create a residual map. Unfortunately, I can’t embed the interactive map into a Medium post, but the link is here if you’re interested. Pink means that the model overestimated the overdose mortality rate for that county, whereas green means that the model underestimated the mortality rate; the darker the color, the larger the residual error.

After looking at the residual dataframe and sorting by the size of the residual, I could see that the county with the largest positive residual (meaning the model had the worst underestimate of the true crude mortality rate) was Baltimore city in Maryland. The county with the largest negative residual (meaning the model had the worst overestimate) was Saratoga County, New York.

As you can see, the model underestimated the true overdose mortality rate in Baltimore by 64 people/100K! In Saratoga County, it overestimated by a smaller magnitude, but I think it’s interesting to look at the values of each feature for these two outliers, since I think it provides some insight into why the model did not work as well for these counties. For one, I did not realize that the US Census considers Baltimore city as its own county, but it apparently does, and Baltimore has a very low % Caucasian population compared to the rest of the country, and remember that % Caucasian was either the 2nd or 3rd most important predictor in the backwards elimination and lasso linear regression models. It also had a poverty rate of 18.3% in 2016, whereas the national poverty rate was 12.7%. Finally, the median age of Baltimore is younger than the national median age of 38, and again since age was the most important predictor in both models, this also may have contributed to the models underestimating the overdose mortality rate.

For Saratoga county, the percentage of high school graduates or higher is clearly quite high at 94%. Similarly, the % unemployed and the % poverty rate were both national average; for poverty rate, you can see it is quite a bit lower than the national poverty rate of 12.7%.

I think that looking at these outliers tells me that my model is missing some sort of overall economic wellbeing/prosperity or overall quality of life measure which probably isn’t being sufficiently captured in the interaction terms in the quadratic polynomial regression. Baltimore is infamous for its high crime rate, and I think it’s certainly possible that illegal opioids may be a driver of overdose mortality that the model can’t use because I didn’t include any data sources for that. On the other end of the spectrum, after looking up some information about Saratoga county, it is apparently at the heart of upstate New York’s ‘Tech Valley’, and there are a large number of international high-tech and electronic companies that have campuses there, which would also explain its high education rate and low poverty and unemployment rate. It’s possible that creating interaction terms that involve more economic variables (e.g. education rate x poverty rate x unemployment rate) would have been able to lend even more predictive power to the model.

The other thing I noticed from looking at the residual map is that it looks like the model tends to underestimate the mortality rate urban counties (or at least counties including major cities), whereas it overestimates the mortality rate in non-urban counties. I did think about including the US Census’s 2013 Urban/Rural classification into my model, but the data provided by the Census is not a one-to-one mapping of which counties are ‘urban’ and which counties are ‘rural’, and by the time I thought of it, it was close to the deadline for this project and it would have taken too long to figure out how to code the information.

Conclusions

My overall conclusions from this analysis is that it is possible to explain some of the variance in overdose mortality rate using the predictors that I picked, but certainly the models I produced do not explain all of the variance. The learning curves for the models indicate that they are in a high bias regime, and that I am still underfitting the data and need more features. Again, I think that finding data sources that would provide information on the flow of illegal opioids in different counties would be useful, and I think feature engineering to come up with a more holistic variable that would capture overall quality of life or economic prosperity would be useful. Also, there does seem to be a difference between how the model performs in urban vs rural counties, so adding that as a feature would probably also increase the predictive power of the model.

The majority of my predictors were economic or demographic data, but if I were going to expand the model, I think what would be really interesting would be to use different public health interventions as predictor variables for the next year’s overdose mortality rate, to see whether or not the presence or absence (or age) of these interventions was a strong negative predictor for the mortality rate. Lots of states and counties are trying different interventions, e.g. issuing standing naloxone orders, using ‘Vivitrol courts’, opening methadone clinics, etc. I tried to do this by using the age of the PDMP in my model, but it was not a strong predictor. I think that one issue with this was that the PDMP website I got my information from gives the year that any PDMP was started, but electronic PDMPs that providers and pharmacists could query in real time didn’t really start until the 2000s, and so PDMPs likely weren’t able to influence the actual # of opioids released to the public until at least then. Therefore, it would probably be necessary to do more research to find a better feature for PDMPs other than just the age of the program, e.g. finding out the year that an electronic PDMP was first implemented, or even getting some measure that would indicate how much each PDMP was being used.

Finally, the other thing that would be interesting to look at is how the opioid crisis has evolved over time. I think it’s entirely possible that different features have become more or less important over time, with different drivers taking over. To do this analysis, it would be necessary to create different models for each year, and to figure out what the appropriate time lag would be for some predictors; for example, it’s possible that the presence of a standing naloxone order could have a strong negative correlation on the mortality rate in a county in 1 or 2 years, but not necessarily the year that the order was issued. There could certainly be a lot more work done on this problem!