Healthcare prices in the United States are notoriously confusing and increasingly…too damn high (in homage to Jimmy McMillan). In this analysis, I use publicly available datasets from the Centers for Medicare & Medicaid Services (CMS) to answer a number of questions, including:
- How do healthcare prices differ by state? Can it be explained by differences in median income?
- How have prices changed over time?
- If more doctors offer a medical service, does it lower the price for that service (i.e. is competition a good thing in this "market")?
In case you missed it, check out this introductory piece for an overview of why I am using this dataset, what its limitations are, and to see whether I include a gif to try to make Medicare claims seem fun. (Spoiler alert: I do! And they kind of are!)
To download the Jupyter Notebook used in this analysis, click here. Snippets of the code are included in the article below.
Please note that throughout this article whenever I use the term "claim" I am referring to the "submitted charge amount". See the introductory piece for more information on that data point.

How do healthcare prices differ by state? Can it be explained by differences in median income?
The first step in our analysis is to import the 2017 Medicare Provider Utilization and Payment Dataset and necessary packages.
We then need to clean up the dataset, dropping unnecessary columns, US Territories and military bases, along with a few Provider Types (doctors) that do not have billable services. Once cleaned up, we are going to limit the DataFrame down to the fifty most common Provider Types and each Provider Type’s five most common Services. These services are denoted by HCPCS codes, which are standardized medical billing codes. By reducing the data, we are able to rule out uncommon Provider Types that may only exist in one or two states (I don’t know, maybe crystal healers in NY and CA?), which could otherwise skew the states’ claims data. Moral of the story, we are left with the Top 5 Services for the Top 50 Provider Types in the 50 States + D.C.
Separately, to find out if interstate price discrepancies can be explained by differences in median income, we need to import 2017 median income data obtained from the U.S. Census. With the US national median income of $62,868 as a base, we can create multipliers for each state to "normalize" their claims data. For example, Alabama’s Median Income is $52,359 so we can adjust Alabama’s claims by $62,868 / $52,359 = 1.20. For the rest of this article, I will mainly discuss "adjusted" claims because I think it is unfair to compare price differentials without taking income differentials into account.
Back to our 2017 claims DataFrame: now that we have reduced it down to the population of Medicare claims we really care about, we can calculate a number of helpful means and medians by using a whole bunch of "groupby" and pd.merge with pandas. We will create three major DataFrames:
- DataFrame #1: National Mean/Median Claim per HCPCS Code (i.e. a Heart Transplant performed by a Cardiologist costs $XX,XXX on average, nationwide)
- DataFrame #2: Mean/Median Claim per State per HCPCS Code (i.e. a Heart Transplant performed by a Cardiologist in New York costs $YY,YYY on average)
- DataFrame #3: Mean/Median Claim per State (i.e. the New York mean procedure cost is $ZZZ; calculated by grouping by state and taking the mean of DataFrame #2)
The same three calculations are performed for: unadjusted mean claims, adjusted mean claims, unadjusted median claims, and adjusted median claims. The results are added to the corresponding DataFrames above.
Once these three DataFrames have been created, we can merge the State/HCPCS Code DataFrame (DataFrame #2) with the National HCPCS Code DataFrame (DataFrame #1). This allows us to calculate the difference in submitted charge for each State/HCPCS Code vs. the national mean for that HCPCS Code. For example, the average submitted charge for a Heart Transplant performed by a Cardiologist in New York might be $15,000 vs. a national mean of $10,000. In this case, the New York submitted charge is 50% higher than the national mean. After performing this calculation for all HCPCS codes, we can then take the average of these differences to calculate the state’s overall average difference from the national mean (or median).
Like before, we perform this calculation for: unadjusted mean claims, adjusted mean claims, unadjusted median claims, and adjusted median claims. When dealing with mean claims, we compare to the national mean. When dealing with median claims, we compare to the national median. Finally, the average differences per state are added to DataFrame #3.
Below are the finished products (we’ll cover the final three columns of DataFrame #2 later on in this article).
DataFrame #1: Mean Claim per HCPCS Code

DataFrame #2: Mean Claim per State per HCPCS Code

DataFrame #3: Mean Claim per State

To visualize how the 50 States + D.C. stack up against one another, let’s first turn to our (or maybe just my… any finance gals on here?) Excel roots: a conditionally formatted table. We can create a rank table of DataFrame #3, add a color gradient to the table, and sort from low to high on adjusted median claim. We’ll focus on the adjusted median claim in this section, since it’s very possible that an outlier (i.e. super expensive single procedure) could skew mean claims across states. I’ve already made the case for why I want to use adjusted instead of unadjusted claims, above.
50 States + D.C. Medicare Claims HeatmapSorted by Adjusted Median Submitted Charge per State

While this heatmap is nice, it is a pretty high level overview. To dig a little bit deeper into the adjusted median claims, plus each state’s average difference vs. the national median, we stack two bar charts on top of one another and add data labels for good measure.
50 States + D.C. Median (Adjusted) Medicare Claim Bar ChartIncluding Average Difference from National Median, per State

In analyzing this pair of charts, we can see that Wisconsin had a median submitted charge of $236 (when adjusting for median income) in 2017, and an average difference from the national median of 49% across all HCPCS codes. Nearby Michigan’s stats were $163 and -10%, by comparison.
Another way we can get a quick glance at median claim differentials across states is with a chloropleth map. We will use the python package folium to create this map, which we already imported.
50 States + D.C. Median (Adjusted) Medicare Claim Chloropleth Map

We can now see fairly clearly, in three different views of the data, that the median submitted Medicare charge definitely varies by state. Importantly, this difference is not eliminated by adjusting for differences in median income per state. In short, I’d much rather pay for medical care in Vermont ($130) than Alaska ($253)!
How have prices changed over time?
Now it’s time to import the same data for 2012, so we can take a look at what changed over the course of five years! The 2012 dataset can be found here.
Again, we import and clean the data. We calculate means and medians and merge with out 2017 DataFrames to more easily compute the five year changes. We don’t calculate median income "adjusted" means and medians, because in this portion of the analysis we will be comparing intrastate instead of interstate. Note that in order to perform an accurate five year comparison, we have to make sure that our Provider Types and HCPCS codes exist in both years’ datasets. A couple of Provider Types in 2017 do not have a direct match in 2012, and a handful of HCPCS codes did not exist in 2012. These are dropped from the DataFrame so we can perform an apples to apples comparison.
We will start off with a color coded table looking at the data at a state level. We’ll look at the state mean instead of the median, because we are now looking intrastate and are less worried about outliers skewing our results. This was more of a concern when we were performing interstate comparisons. We can probably assume that an outlier within a state in 2012 would likely still be an outlier in 2017.
50 States + D.C. Five Year Change in Mean Medicare Claims5 Yr Change is the change in percentage points

Ahhhhhhh. I would not want to be a patient in Wyoming! Again though… Vermont seems to be doing something right. Let’s visualize these changes on a chloropleth map again just for kicks.
50 States + D.C. 5 Yr Change in Mean Medicare Claims

So we can see that some states had much larger changes than others. Here’s lookin’ at you, Alaska! What about the bigger picture? Taking an average of the five year price changes across the 50 states + D.C. (i.e. "5 Yr Change" in the table above) yields a pretty stunning result of +14.45%. We can also look at changes nationally, which yield a fairly comparable result. If we take the average of the five year change in submitted charge per HCPCS code, nationally, we get an average change of +16.01%. So on average, outpatient procedures/services across the U.S. saw price hikes of 16% between 2012 and 2017.
That seems a bit high, but how does it compare to a consumer’s change in purchasing power over the same five years? To find out, we must look at the change in the Consumer Price Index (a useful tool to gauge inflation), which we can get from the Federal Reserve’s website. Between 1/1/2012 and 1/1/2017, we can note that the change in the Consumer Price Index is (243.717 / 227.842) – 1 = 7.0%. That means the average change in outpatient procedures/services was more than DOUBLE inflation over the same timeframe.

If more doctors offer a medical service, does it lower the price?
I am going to come clean: I do not think the healthcare market can healthily function according to a standard supply and demand model. When the demand is life or death, there isn’t too much room for elasticity! Therefore, the question of whether more doctors offering a service reduces the price of that service strikes me as a bit naïve. However, I wanted to give that notion a chance and explore it using some simple linear regressions.
To investigate this question, let’s turn back to our 2017 Medicare claims dataset. We first need to import state population data. It does not matter how many providers are offering a service in a state. We care about how many people there are per provider offering a service in a state. We can get this population data from the U.S. Census, here.
Great. To calculate People per Provider (PPP), let’s first calculate the number of providers offering a given service, per state. We can do this by pulling the Provider ID, State, Provider Type, and HCPCS Code from our full 2017 DataFrame and removing dupes. We can then use "groupby" and "count" to get the number of providers per service per state and add this to DataFrame #2 (Mean Claim per State per HCPCS Code).
Step one in calculating PPP is now complete. For step two, we merge our newly imported population data into Dataframe #2. We then perform the necessary calculation (state population / number of providers offering service in that state) and add it as a new column to the DataFrame.
We can now check for a relationship between PPP per service per state (what a mouthful!) and the difference in that state’s adjusted price per service vs. the national mean.
State Price, Difference from National Mean vs. People per Provider

Okay. Not particularly pretty or helpful (#relatablecontent?). This graph made me realize that we should look at the data a bit differently. We should REALLY be looking at the difference in PPP per service vs. the national average PPP per service, instead of just looking at purely PPP per service. That way we avoid the issue that there may just be some services that ALWAYS have fewer providers than others – i.e. fewer Occupational Therapists than Physical Therapists.
State Price, Difference from National Mean vs. People per Provider, Difference from National Mean

This looks better, though also kind of a blob. Still, let’s run a regression to see if we can find a useful relationship in here…

The coefficient we get is: 0.01137808, with an intercept of -0.02305326]. Directionally, this very small >0 coefficient would seem to indicate that as PPP (vs. national mean) increases, the price increases (vs. national mean). That would seem to support the notion that more doctors means lower prices (albeit a small impact), and fewer doctors means higher prices. HOWEVER, let’s check to see how well our linear regression model does with our test data.
The result is an R2-score of -1219.81, when I run it. I previously did not even realize an R2 score could be that large or even negative. The model is that bad. Clearly, this simple regression model does not explain the data well. Maybe this is because there are too many other factors influencing the differences in prices between states. Instead of looking at the 2017 interstate data, why don’t we look at 2012–2017 intrastate data? In our second attempt, for each state we will look at the five year change in PPP per service vs. the change in price per service in that state.
The idea is that, if all of a sudden between 2012 and 2017 there was a flood of new Physical Therapists in New York, would this also result in Physical Therapy services falling in price?
To compute this, we’ll need to perform the same calculations described above, except for the 2012 claims data instead of 2017. This means we need to import the 2012 state population data, and calculate number of providers per service, in order to calculate the 2012 PPP per service per state.
By State: 5 Yr Change in People per Provider per Service vs. Change in Price per Service

Again, not gorgeous but let’s see if a simple linear regression can help describe a trend in the data.

The coefficient we get is -0.04030966 and the intercept is 0.15177793. In this case, the coefficient is directionally inconsistent with our previous result: as PPP increases, the change in price actually decreases. In sum, the fewer doctors there are, the lower the prices for the services they offer. That would make no sense in a normal market!
Before we jump to any conclusions, we should first check to see if this simple linear regression model is any good.
With an R2 Score of -213.50, we still have a very bad model.
After two attempts, while we didn’t exactly prove my theory that competition does NOT lower prices in the medical procedures/services "market", we definitely did not find compelling evidence that competition DOES lower prices. In fact, when attempting to reduce noise in the model, we actually got a coefficient that was directionally opposed to the idea that competition lowers prices.
This will require much further analysis and scrutiny in order to form any strong opinions, but it was good practice in performing simple linear regressions using Python, so I am satisfied for the time being.
Conclusion
How do healthcare prices differ by state? Can these differences be explained by differences in median income?
They differ pretty greatly, and adjusting for median income does not neutralize state differences. Alaska, Wisconsin, New York, and Nevada have more expensive submitted charges no matter how you slice the data. On the other hand, South Dakota, Idaho, Utah, and Vermont seem to have some of the least expensive. The range can be pretty wide, too, with income-adjusted median claims spanning from $130 (VT) to $253 (AK).
How have prices changed over time?
They’ve changed pretty greatly between 2012 and 2017! On a state by state basis, you wouldn’t want to live in Wyoming by this measure. But Vermont looks pretty tame in terms of price hikes! Nationally, the average change in price per service of +16% compares to a change in the CPI of 7% over the same timeframe.
If more doctors offer a medical service, does it lower the price for that service (i.e. is competition a good thing in this "market")?
While we cannot say for sure, increasing "competition" does not seem to obviously result in lower prices for medical services. However, these were just two simple linear regressions, so further analysis is required.
Thanks for reading! In Part 2, we’ll take a deep dive into NYC submitted charges to see what we can learn across the five boroughs.
Until next time – stay healthy!