3 Tools For Using Healthcare Claims Data For Predictive Analytics

Published in

Towards Data Science

7 min readMay 16, 2019

A high-school English teacher taught me that starting a paper with “the dictionary defines XYZ as:” generally proves a non-insightful introduction, and I worry that starting this article with “healthcare has important untapped data opportunities” might fall equally flat. However it’s tough to over-estimate the unanswered and unasked questions in this space with respect to the vast amount of data available.

In addition to increasingly well-formulated sets of health status monitoring and electronic health record data, billions of rows of healthcare claims data is available in public and private datasets that are often very high-quality. This article quickly introduces how healthcare claims data works (the structure, uses, difficulties) to present 3 common frameworks for using the data.

Information Available On Claims Forms

Healthcare claims come via 3 form types: physician, facility, and retail pharmacy. Each form has many common characteristics, including member identification (name, date of birth, insurance card number, etc.), provider information (national provider ID , tax ID number etc.), and service dates. Physician and facility claims also contain multiple ICD-10 diagnosis codes describing the condition/symptoms — facility claims allow more than 20 diagnosis codes, but in practice 3 diagnosis codes captures much of the information available for both claim types. Physician and facility claims also contain an AMA place of service code describing the type of facility a service was performed in (i.e. emergency room, urgent care facility, physician’s office, etc.).

The unique characteristics of the 3 claim types reflect how each type of providers get paid (at least generally speaking, historically — like all things healthcare, there are nuances, recent changes, and future plans; keep this in mind throughout, as it will be excessively tedious to repeat this as often as I could).

Physicians are paid per service — each procedure (examinations, blood draw, surgical procedures, etc.) has a dollar amount attached to it, and final payment is the sum of all those procedure’s dollar amounts. Physician claim forms use CPT codes to list each unique service. CPT codes are 5-digit alpha-numeric codes that describe each unique service a physician can perform, with unique codes assigned to similar types of procedures of varying severity for common procedures.

Facilities (hospitals, free-standing labs, ambulatory surgical centers, etc.) in contrast are paid using a higher-level view. You can imagine that tabulating all of the times a nurse stops in your room during an inpatient stay would get tedious, so instead the hospital charges for “Room & Board” — everything involved in laying in a hospital bed. Revenue codes — 4 digit numbers (often containing a leading zero) — capture each unique high-level service included in a hospital stay, such as operating room procedures, physical therapy, labor room/deliver, etc.). Additionally, specific significant procedures such as transplants or arterial bypass are captured using ICD10 procedure codes, with more common procedures using the same CPT code set used on physician claims. The DRG code is a third field that summarizes inpatient stays in a single code.

Pharmacy claims are simple (but often contain tremendous amounts of predictive information) — they simply list the drug prescribed (using an NDC number), quantity, and days supplied.

In summary, the key fields available are listed below. In practice there are at least 3x more fields in play, but for purposes of an introductory discussion these are the big ones.

While specifics vary, there is broad similarity in how databases structure claims data. Often two fact tables are used — a “header” table stores fields that have a single value per claim, such as member/provider, dates of service, and all diagnosis codes, and a “detail” table stores fields that have potentially several values per claim, such as CPT, Revenue, and NDC code. Additional tables in the database may provide descriptions of codes.

And while claims data is often relatively clean, this structure and the clinical complexity of the events and patient characteristics the data is describing necessitate significant pre-processing work. A variety of approaches have been constructed to balance a trade-off between making data that is easy to analyze with minimal work, and data that preserves clinical complexity information. Broadly speaking there are 3 main tools: hierarchical condition category (HCC) coding, episode groupers, and clinical-based feature building. We’ll discuss the strengths and weaknesses of each below.

Tool 1: HCC Models

HCC coding is a broadly used technique, especially in risk-scoring algorithms. Risk-scoring models assign a single number to an individual describing their “risk”, which often means predicted claims costs, but can also signify opportunity for clinical management or other characteristics. The Medicare Advantage system, ACA Individual Exchanges, and many state Managed Medicaid programs use HCC-based risk-adjustment models to produce risk-scores — while specifics differ significantly, the general idea is that by quantifying the relative morbidity of individuals enrolled in a health plan, corresponding revenue transfers can ensure an even playing field for all insurers.

HCC models generally work by enumerating condition categories that individuals are assigned to based on the presence of ICD-10 diagnosis codes and/or pharmacy prescriptions. Each category can be assigned a weight, and an individual’s total score is a combination of the category-level weights. These models generally rely heavily on individual diagnosis codes to quantify a patient’s condition, and heavily summarize the available information into 20–50 buckets. They can capture a breadth of information that is easily analyzable quickly and easily, however they also potentially leave a lot of information out. For example, medical conditions such as Hypertension or Type 2 Diabetes may pose no significant increase in risk if they are appropriately managed, but may cause significant increases in risk when not managed — simply quantifying the presence of these conditions ignores this reality.

A more general style of analysis treats the occurrence of other types of codes (not just diagnosis/drug codes commonly used in risk score models) as dummy variables, potentially including frequency or time-based variables as well. This can be helpful in feature engineering by quickly generating many thousands of combinations, and identifying a smaller set of codes that correlate with a particular analysis. Much care is needed through — data can become very dispersed when many fields are combined, and higher-level correlations can be obscured. For example, an HIV/AIDS diagnosis might be significant regardless of place of service or CPT code, but separating this diagnosis into several distinct buckets may hide that.

Tool 2: Event Groupers

Event groupers attempt to capture nuances not present in HCC models by summarizing many lines and fields of data into a single event line described by customized s — this flattened format can make correlations in the data across fields and separate lines easier to access.

A 35x19 table of claims data I reviewed could be summarized in the following story: “the patient visited his physician’s office where a cardiac implant device was evaluated. Likely referred in this visit, the patient went on the same day to the emergency room. During an 8-day hospital stay the patient experienced shortness of breath and received a chest X-ray and “respiratory services” in response. Diagnosis codes tell us with specific accuracy the breadth of the patient’s heart-related conditions, as well indicating malnutrition”.

Constructing these algorithms is tricky on a large scale, though thoughts on how to initially address the problem are not difficult to formulate. The large scale becomes tricky because medical care is tricky — the same procedure can be performed to achieve different outcomes, the same diagnosis can be low or high risk depending on management and co-morbidities, a diagnosis may present differently in the short and long term, etc.

Due to these complexities, there are an array of commercially available algorithms called “episode groupers” that perform this summarizing. There is significant diversity in how these algorithms work because providers, insurers, public policy researchers, and other users may all be interested in different flavors of story lines.

Even with a well-formulated episode grouper, your data does not readily describe the whole patient — capturing frequency or timeline information might rely on techniques similar to the HCC modeling formation, but are still limited. Clinical-based models attempt to solve this problem.

Tool 3: Clinical-Based Models

A significant amount of domain knowledge is necessary to make full use of claims data (like all data), but here the domain knowledge is usually gained through an MD education and years of clinical experience — not the kind of thing you can teach yourself on a weekend. This reality is especially apparent in clinical-based models, which use clinically-defined algorithms to identify features. For example, a rule to identify patients with heart failure that could potentially lead to inpatient admission (because not all heart failure is chronic in this way) might have rules based on filling combinations of prescriptions, inpatient admissions with varying diagnoses or DRG codes, multiple outpatient events with particular procedures or diagnoses, and combinations of each of these.

This logic can be extremely helpful in extracting features that are difficult to think of algorithmic ways to identify comprehensively. The down-side is that this logic of often highly specialized and does not easily summarize general characteristics of patients (such as “risk” in the HCC models).

Conclusion

My hope in this introductory discussion is to encourage a broader use of medical claims data in data science applications. CMS provides many Medicare-based samples which are publicly available for analysis which likely hide many untapped insights. May the fun commence!