Patient Electronic Medical Record Adoption in the US

An Analysis of NCI HINTS Survey Data

Erik Sorensen
Towards Data Science

--

Project Definition

Project Overview

The aims of this study are to attempt to understand which patient demographic, health-, and internet-/electronic device-related factors drive access to and use of electronic medical records (EMRs) in the US, using publicly available data from the National Cancer Institute’s Health Information National Trends Survey (HINTS).

Problem Statement

The Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 established requirements and incentives for US healthcare providers (HCPs) to adopt and to promote patient use of EMRs [1]. However, patient use of EMRs appears to be low (< 50%), including in chronic disease states [2, 3].

One factor governing EMR use by patients is their being made available by an HCP or other source, e.g. a health insurance provider. Understanding which factors drive EMR availability will help raise HCPs’ and health program administrators’ awareness of possible barriers to patients’ access to EMRs. It will also elucidate whether EMRs are being offered appropriately to those who most need them (e.g. those with multiple chronic medical conditions or unhealthy lifestyles) or to those who may not need them as much, but are more likely to use them (e.g. younger, healthier, better-educated patients who may not need to monitor their health as closely).

Understanding the factors associated with actual use of EMRs will also indicate whether patients who most need access to or easy portability of their medical records (again, patients with multiple chronic conditions or poorer health, who may need to see multiple HCPs or may have more frequent hospital or physician visits) are actually using them.

For both of these outcomes, whether availability and adoption are changing over time and whether there was a discernible change after the onset of the COVID-19 pandemic will be assessed as part of the models.

The strategy for solving this problem will be to apply multivariate logistic regression models to predict the two outcomes. The expected outcome is that these models will yield easily interpreted coefficients for each predictor variable, allowing calculation of odds ratios to determine the strength of influence of each. These coefficients can also be used to predict the likelihood of each outcome, allowing a determination of which patients are more and less likely to be offered and to use EMRs.

Finally, the variables with the largest difference in prevalence between those predicted to have ≥ 80% versus ≤ 20% predicted probability of EMR use and access will be determined.

Metrics

The optimal model will be the one that maximizes precision and recall. This will tend to minimize false positives and false negatives, while avoiding the pitfalls of using accuracy, which can be misleading in cases of outcome class imbalance.

These metrics are defined as follows:

Precision = True_Positives/(True_Positives + False_Positives)

Recall = True_Positives/(True_Positives + False_Negatives)

The scikit-learn implementation of automatic recursive feature elimination (RFECV) used in this project employs a combination of these, the F1 score:

F1_Score = 2 * Precision * Recall/(Precision + Recall)

The odds ratio describes the odds of an outcome when the variable is present vs. absent, in the case of a binary variable. For a continuous variable, it is the odds of the outcome for a unit change in the predictor. In logistic regression, the odds ratio for a predictor with coefficient beta is:

Odds_Ratio = exp(beta)

Analysis

For this analysis, we required data on EMR availability and use, as well as on patient characteristics, such as demographics, health status, medical history, and internet/electronic device (e-device) access that may influence the availability to patients of EMRs and their use.

Most publicly-available datasets in this domain are curated by the US government. The source used for this study was from the Health Information National Trends Survey (HINTS). This is an annual survey conducted by the National Cancer Institute with the aim of:

(collecting) nationally representative data about the American public’s knowledge of, attitudes toward, and use of cancer- and health-related information. HINTS data are used to monitor changes in the rapidly evolving fields of health communication and health information technology and to create more effective health communication strategies across different populations.

Data Exploration & Visualization

The full exploratory analysis and the dataset derived from it can be found in the Jupyter notebook “DSND_Final_Explore.ipynb” in the Github repository.

Methods

Much of the HINTS data are cancer-specific; however, there are several demographic variables (e.g., age, race, gender, geographic area, income, geographic area). There are also several fields related to internet and e-device use, as well as to health status and access to healthcare resources. There are also fields specific to the availability and use of EMRs. The survey is designed to be nationally representative.

Data are available for multiple years up to 2020. Each year’s survey is denominated by a cycle number, and is conducted between January and April. The 2020 survey (Cycle 4) partially overlaps with the WHO declaration of the COVID-19 pandemic (March 11, 2020), and has a field denoting whether a response was received before or after this date.

To assess whether adoption of EMRs is evolving over time and whether there was a discernible change in the (admittedly limited) post-pandemic time period, the two prior years’ data (Cycles 2 and 3) were also retrieved.

The datasets contain from 438 to 731 columns (derived from survey questions) and from 3504 to 5438 rows (each a unique survey response). Cycle 3 contains more responses because it included an experiment wherein additional respondents were allowed the option of completing the survey on the Web, rather than on paper. As discussed in HINTS’ “Web Pilot Results Report” (avaible in the Github repository), those randomized to the Web survey differed significantly from those who completed it on paper on several demographics (gender, age, health status, education). For that reason, the web-response data for Cycle 3 were dropped. This left 4573 responses.

The survey responses have been pre-screened by HINTS staff before being compiled in electronic form, and only those at least 50% complete are included. In addition to codes for each allowed response, each field can also contain a code describing the reason the data are missing (e.g., inappropriate response, question answered in error, answer omitted). The codes are:

  • -1: “Valid” missing. The field should not be filled in because a preceding field has been marked with an entry that makes this field not applicable for this respondent.
  • -2: Inappropriately filled in. The field should be empty based on a previous response, but the respondent gave an answer.
  • -4: Illegible or non-conforming. The response couldn’t be read, or was extremely out of the expected range for the questions (e.g. 11 feet for height, age > 105 years).
  • -5: More responses selected than appropriate for the question.
  • -6: Missing values in follow-ups to a missing “filter” question. The superseding question that should trigger the respondent to answer this question was not answered, nor was this question.
  • -9: Missing/not ascertained. Question should’ve been answered but wasn’t.

These missing-data codes were handled on a field-specific basis, as described below.

While most of the survey questions are common across cycles, not all are. The HINTS datasets include a codebook which describes each survey question and lists the possible responses and their frequencies. Using the codebooks, the available variables were pre-screened for relevance and amount of data available. Additionally, variables whose association with access to and use of an EMR could not be distinguished as cause- versus effect-related were dropped. Relevant variables were then reconciled and those common to all three cycles (or which could be reconfigured to match across cycles) were kept. The codebooks and methodology reports are available in the Github repository.

Results

Preliminary screening using the HINTS codebooks identified 59 variables deemed relevant and which were present in all three years’ data. A list of these variables can be found in the repository “data” folder, in file “HINTS-variables.ods”. The merged dataset contained 11942 records.

The merged data were split 70%/30% into training and test sets. Exploratory visualizations and univariate statistical analyses were then conducted on the training set to identify potential features for a multivariate machine learning model. Univariate analysis was conducted using a Kruskal-Wallis test for continuous variables and a chi-squared test for an n by m contingency table for categorical variables. Variables with a two-sided p-value < 0.05 were considered for inclusion in the multivariate model.

Univariate visualizations

Due to the large number of variables, representative univariate plots are shown. The remainder can be viewed in the notebook “DSND_Final_Explore.ipynb”.

Distribution of representative demographic predictors: Age, gender, race, education:

Figure 1. Distribution of respondent age
Figure 1. Distribution of age
Figure 2. Distribution of respondent gender
Figure 2. Distribution of gender
Figure 3. Distribution of respondent race
Figure 3. Distribution of race
Figure 4. Distribution of respondent education
Figure 4. Distribution of education

Age follows a relatively normal distribution, with the peak at 50–64. Most respondents identify as white and female, and the most common level of education is a college degree.

Outcome variables:

Figure 5. Distribution of EMR access outcome variable
Figure 5. Distribution of EMR access outcome variable
Figure 6. Distribution of EMR use outcome variable.
Figure 6. Distribution of EMR use outcome variable

Most respondents have been offered EMR access by an HCP or insurer. However, most have not used an EMR in the past 12 months. Among those who have, the most common frequency is 1–2 times.

One variable, “phq4”, was modified based on this analysis. This variable represents the PHQ-4 psychological distress score. It ranges from 0–12. Its distribution on this scale is shown in Figure 7.

Figure 7. Distribution of PHQ-4 score based on all possible values
Figure 7. Distribution of PHQ-4 score based on all possible values

Most of the categories are sparse. Also, it is typically scored based on ranges, as follows [4]:

  • 0–2 points: no distress
  • 3–5 points: mild distress
  • 6–8 points: moderate distress
  • 9–12 points: severe distress

This variable was reconfigured with four categories, representing the ranges above. The revised variable is shown in Figure 8. This is still sparse, but slightly less so and more medically relevant.

Figure 8. Distribution of PHQ-4 category
Figure 8. Distribution of PHQ-4 category

Multivariate visualizations and univariate statistics

Missing entries in the outcome variables

Before examining the individual relationships between the potential predictors and the outcome variables, a strategy for handling the omitted responses (missing data code -9) in the outcome variables needs to be determined. The preliminary suspicion was that lack of a response may indicate disinterest or disengagement.

For the EMR access outcome, characteristics of patients answering “don’t know” and those skipping the question may be similar. If so, the missing entries can be combined with the “don’t know” entries. If they aren’t similar, the missing entries need to be retained as a separate category, or dropped.
To assess for similarity, key demographic characteristics were compared between those answering “don’t know” and those omitting an answer.

The demographic variables analyzed were:

  • “stratum” (minority status of census tract)
  • “highspanli” (prevalence of less proficient English speakers)
  • “useinternet” (internet usage)
  • “healthinsurance” (any form of insurance)
  • “selfgender” (gender)
  • “agegrpb” (age groups)
  • “educa” (education-level groups)
  • “raceethn5” (race/ethnicity groups)
  • “hhinc” (household income groups)

A chi-squared contingency table analysis was used to assess the relationship between each demographic variable and the outcome variables. The null hypothesis is that respondents who selected “don’t know” (code 3) for the outcome variables have the same demographic characteristics as respondents who failed to answer (code -9). The alternative hypothesis is that demographic characteristics are different between these groups. The null hypothesis will be rejected for tests with a p-value < 0.05.

Results of the analysis revealed p-values < 0.05 for all variables except “stratum” (p = 0.09) and “highspanli” (p = 0.80). This indicates a lack of demographic similarity between the “don’t know” and omitted-answer respondents. Based on these criteria, the two responses can’t be combined. The differences appear to be almost completely driven by a higher frequency of omitted answers (code -9) to the demographic questions for those who also omitted an answer to the EMR access ( “offeredaccesseither”) outcome question. This could indicate a general lack of interest in the survey, or general concerns about providing information. Because these surveys are likely to have a large number of fields with missing data, requiring further assumptions to handle, the missing (-9) entries for this variable were dropped.

For the EMR use variable, categories relate to frequency of use and there isn’t a “don’t know” category. A missing answer could indicate that the respondent doesn’t use an EMR, doesn’t remember whether they have used it, or doesn’t want to respond. Thus, it would be difficult to determine which category to merge the missing-answer category with. Also, since the categories denote increasing frequency, keeping the missing code as a separate category creates a disruption in that order. For these reasons, and since the missing data are relatively infrequent, surveys with missing responses to this question will also be dropped.

Dropping these entries reduces the dataset to 11578 records.

Univariate relationship of continuous predictors to outcomes

There are three continuous variables: body mass index (BMI), average minutes of weekly exercise, and average weekly alcoholic drinks. Graphical exploration was done using box & whisker plots. Since these variables displayed skewed univariate distributions (see full results in “DSND_Final_Explore.ipynb”), the relationship to outcome variables was tested with the nonparametric Kruskal-Wallis test.

Figures 9 and 10 show the box and whisker plots for each continuous variable against each outcome variable.

Figure 9. Relationship between continuous variables and EMR access variable
Figure 9. Relationship between continuous variables and EMR access variable
Figure 10. Relationship between continuous variables and EMR use variable
Figure 10. Relationship between continuous variables and EMR use variable

All relationships were significant at a p < 0.05 level; in fact, all except BMI vs. EMR use ( p = 0.043) had p-values < 0.001. However, as the plots show, all three predictor variables have many outliers.

These fields should exclude obviously non-conforming data, as those would have been flagged with code -4, per the methodology.

The values for BMI, for instance, cover an extreme range but are physiologically possible. The values for average exercise are also not implausible (i.e., not more than the number of minutes in a week). The values for average drinks also run to large values, but not impossible, e.g. 120 drinks/week would be ~17/day.

Thus, it’s possible the extremes just represent extreme behavior or physiologic extremes. For this reason outliers were not discarded. Instead, their influence was diminished by transforming these variables into categoricals.

Before doing this, the missing entries (all negative codes) were deleted, as these can’t be placed into quantiles. Doing this left 9803 entries, mostly due to 940 omitted responses to the “average weekly drinks” question.

BMI was divided into quartiles, with cutoffs as follows:

1st quartile: 21.6 kg/m²

2nd quartile: 25.6 kg/m²

3rd quartile: 29.4 kg/m²

4th quartile: 37.8 kg/m²

The other two variables have large numbers of zero-valued entries. Quantiles do not work well for these, since the bin edges are non-unique. For these variables, physiologic cutoffs were used instead.

For “weeklyminutesmoderateexercise” (minutes/week of moderate exercise), the Centers for Disease Control (CDC) recommendation for adults is at least 150 minutes/week [5]. Categories based on this recommendation are:

  • 0 minutes/week (the most frequently observed value)
  • > 0 to < 50% recommended (0 — <75 mins)
  • ≥ 50% to < 100% recommended (75 — < 150 mins)
  • ≥ 100 to < 150% recommended (150– 224 mins)
  • ≥ 150% recommended (≥ 225 mins)

For “avgdrinksperweek” (average number of weekly alcoholic drinks), CDC recommendations [6] were again used. Those recommendations define heavy drinking as ≥ 8 drinks/week for women and ≥ 15 drinks/week for men [6]. For respondents who didn’t specify a gender, the mean value of ≥ 11.5 drinks/week was used. Cutoffs similar to those used for exercise were chosen:

  • 0 drinks/week (the most frequent category)
  • > 0 to < 50% heavy drinking (M: 1–7; F 1–3; not specified: 1–5 drinks)
  • ≥ 50% to < 100% heavy drinking (M: 8–14; F: 4–7; not specified: 6–11 drinks)
  • ≥ 100 to < 150% heavy drinking (M: 15–22; F: 8–12; not specified: 12–17 drinks)
  • ≥ 150% heavy drinking (M: ≥ 23; F: ≥ 13; not specified: ≥ 18 drinks)

Univariate relationship of categorical predictors to outcomes

Graphical relationships were explored with bar plots. Statistical analysis was performed using chi-squared contingency table methods.

The initial analysis included the -9 “omitted answer” code.

For both outcome variables, the only predictor with a non-significant relationship (chi-squared p-value > 0.05) was “eciguse”, denoting active electronic cigarette consumption. This variable was dropped. Representative plots, including for “eciguse” are shown below. The entire set is available in the “DSND_Final_Explore.ipynb” notebook.

Figure 11. Relationship between age and outcome variables. Chi-squared p-values < 0.001 for both outcomes.
Figure 11. Relationship between age and outcome variables. Chi-squared p-values < 0.001 for both outcomes.
Figure 12. Relationship between gender and outcome variables. Chi-squared p-values < 0.001 for both outcomes.
Figure 12. Relationship between gender and outcome variables. Chi-squared p-values < 0.001 for both outcomes.
Figure 13. Relationship between race and outcome variables. Chi-squared p-values < 0.001 for both outcomes.
Figure 13. Relationship between race and outcome variables. Chi-squared p-values < 0.001 for both outcomes.
Figure 14. Relationship between education and outcome variables. Chi-squared p-values < 0.001 for both outcomes.
Figure 14. Relationship between education and outcome variables. Chi-squared p-values < 0.001 for both outcomes.
Figure 15. Relationship between e-cigarette use and outcome variables. Chi-squared p-value 0.214 for EMR access and 0.882 for EMR use.
Figure 15. Relationship between e-cigarette use and outcome variables. Chi-squared p-value 0.214 for EMR access and 0.882 for EMR use.

Generally, the -9 code was rare, except for in gender, household income, and race. For these variables, the -9 entries were retained as separate categories, since they may indicate non-identification with the given choices (for race & gender), or unwillingness to disclose personal information.

For the remaining variables, an assessment of this code’s influence was performed by examining its frequency relative to other response categories and by re-running the analysis with it removed. This revealed that the significant p-values of some categories were likely due to the presence of the -9 flag. In some fields, this was because other responses were rare. In others it was due to small frequencies in the -9 field leading to relatively large discrepancies from the expected values. To avoid having these non-answers influence the final multivariate model, rows containing them were deleted.

With the missing-data codes removed, the dataset shrunk to 7818 entries (5490 in the training and 2328 in the test set). On this training set, the following predictors no longer have a significant relationship to the outcome (chi-squared p > 0.05):

EMR Access Outcome:

  • “healthins_tricare” (Tricare or other military insurance)
  • “healthins_va” (VA insurance)
  • “healthins_ihs” (Indian Health Service benefits only)
  • “healthins_other” (other insurance not specified above)
  • “medconditions_highbp” (ever diagnosed with hypertension)
  • “medconditions_heartcondition” (ever diagnosed with heart condition)
  • “medconditions_lungdisease” (ever diagnosed with lung disease)

EMR Use Outcome:

  • “healthins_tricare” (Tricare or other military insurance)
  • “healthins_va” (VA insurance)
  • “healthins_ihs” (Indian Health Service benefits only)
  • “medconditions_lungdisease” (ever diagnosed with lung disease)

The last four predictors, which are non-significant for both outcomes, were removed.

Additionally, there were several variables related to modality and location of internet access wherein respondents could make multiple choices (e.g., access via cell, wifi, broadband, dial-up; using internet at home, at work, in public places). There were also multiple choices available for type of health insurance. To avoid redundancy and the potential for overfitting by using multiple related variables, these were examined more closely and some consolidations were performed.

For internet access modality, there was a large amount of overlap, with some respondents even reporting using both broadband and dial-up. All questions are yes/no, so there’s nothing to indicate which modality is used most frequently by those who selected multiple options. Given this, only the broadband category (along with the base internet access vs no access variable) was retained. Those who choose this option are acknowledging having home-based, relatively high-speed internet access, and are computer-literate enough to know this is what they have. Thus, this category is likely to differentiate between somewhat more and less internet-savvy respondents.

For internet access location, there was again heavy overlap between categories, even for those locations where respondents report daily use. However, fewer respondents who primarily use public places have other opportunities to access the internet. This may limit their ability to use EMRs. For this reason, one category of access location “whruseinet_pubvother”, denoting “daily” or “sometimes” public internet use, compared to “never” or “N/A”, was created, and the other categories removed.

Finally, for insurance, several possible responses were already eliminated due to non-significant relationships to the outcomes (see above). Additionally, “healthins_other” was deleted for non-significant relationship to EMR use, and for having more missing responses than respondents who selected “Yes”. This imbalance would likely make this an ineffective predictor.

Insurance categories showed some overlap, since a patient may have primary and secondary insurance. Relatively few respondents have only Medicaid, which is considered “safety net” insurance for the poor or disabled. However, it is of interest to compare those with publicly-funded insurance (Medicare or Medicaid) to those with private insurance. These and the remaining insurance categories were consolidated to one variable “healthins_pubpriv”, denoting private/employer-provided insurance without Medicare/Medicaid, vs. Medicare/Medicaid without private or employer-issued insurance, vs. none or other.

After these maneuvers, the dataset contained 41 total predictor variables (38 with a relationship to EMR access; 41 to EMR use). This dataset was used for the development of the multivariate logistic regression models.

Methodology

Data Preprocessing

The data were read into pandas dataframes. The HINTS data files are available in SAS, SPSS, and STATA formats. Pandas’ SAS import doesn’t allow limiting the set of columns to be read in, while the SPSS file for Cycle 2 seemed to be corrupted and wouldn’t load. For those reasons, STATA files were used.

Fifty-five of 59 variables are categorical. They were converted from floating-point to integer.

As mentioned above, Cycle 3 included an experiment wherein a subset of respondents could complete the survey on the web. These response were dropped, as discussed previously, since the demographics of those respondents differed significantly from those of paper-based respondents.

From the fields in these datasets, two outcome variables were created:

  1. EMR Availability (variable name “offeredaccesseither”): Whether a patient was offered EMR access by either their HCP or their insurer (coded as “Yes”, “No”, or “Don’t Know”. This was a single variable for Cycle 2, but had to be created by merging two variables (“offeredaccesshcp2”: was respondent offered access by an HCP?; and “offeredaccessinsurer2”: was respondent offered access by an insurer?) for Cycles 3 and 4.
  2. EMR Use (variable name “accessonlinerecord”): How often in the past 12 months a patient accessed their EMR (divided into 5 categories ranging from “None” to “≥ 10”) . This variable was available in the same form in all three datasets.

Both variables are multi-class, with some sparse response categories. Preliminary analyses were conducted with them left as-is, with the plan to binarize them if the multi-class classifications were poor.

Additionally, a variable named “survey_cycle” was added to account for the effect of time and the post-pandemic period. This was coded with 2 and 3 representing those cycles, 4 representing Cycle 4 pre-pandemic, and 5 representing Cycle 4 post-pandemic.

The datasets for the three cycles were then merged and the frequency of missing data were assessed. Codes -1 and -2 were ignored for this analysis since they represent questions the respondent shouldn’t have answered. Most variables were missing ≤ 2% of data, and only “avgdrinksperweek”, was > 10%, at 12.1%. Based on this, no variable was deleted due to excessive missingness.

Missing data were generally handled by deletion. The exceptions are discussed above.

Implementation

The project was implemented in Python. Data processing, cleaning, and preliminary analysis were performed in the Jupyter notebook “DSND_Final_Explore.ipynb”.

Kruskal-Wallis and chi-squared contingency table analyses were performed using the SciPy implementations of these tests.

Multivariate logistic regression models were created using scikit-learn’s LogisticRegression classifier. Balanced class weights were used to account for imbalances across outcome response categories. Categorical variables were one-hot encoded and a reference category was held out from each model. Each model was fit to the training dataset, and evaluated on the test set.

Multi-class models using all of each outcome variable’s possible responses were attempted first. If these proved inadequate due to class imbalance, the responses were binarized and the models re-fit. The machine-learning analysis was performed in the notebook “DSND_Final_Analysis.ipynb”.

Feature selection and reduction were implemented with scikit-learn’s Recursive Feature Elimination algorithms (RFE and RFECV). Model tuning was performed using their grid search implementation (GridSearchCV).

Finally, precision, recall, and the confusion matrix were calculated using scikit-learn’s classification metrics implementations for each.

Refinement

Starting with all the features identified in preliminary screening, feature selection and reduction were accomplished using Recursive Feature Elimination (RFE). First, automated selection was performed using RFE with Cross-Validation (RFECV), with model assessment via the F1 score.

Model tuning was attempted using a grid search method with cross-validation. Parameters used in the grid search were the logistic regression regularization parameter C, balanced vs. no class weighting, and the feature-reduction step size used in RFECV. Subsequently, more parsimonious models were sought by using manual RFE with a range of prescribed numbers of features.

Overfitting was assessed by comparing precision and recall between the training and test sets, checking for large decreases in test-set performance. The optimal model for each outcome was the most parsimonious one which maximized precision and recall.

Intermediate and final model solutions are discussed below.

Results

The full process and code used for the machine-learning model development outlined below is contained in the notebook “DSND_Final_Analysis.ipynb”.

Model Evaluation & Validation

Model for EMR Access

One-hot encoding the 38 categorical variables and holding out a reference category for each yielded 106 potential predictive features. Fitting a multi-class logistic regression model with automatic RFECV pruning for all three outcomes (“Yes”, “No”, and “Don’t Know”) resulted in 64 features, with relatively poor precision (0.622) and recall (0.555) on the test set. Training set parameters were similar (0.666 and 0.598), indicating no obvious overfitting.

This model was tuned using a grid search over the following parameters:

  • Logistic regression class weights: Balanced, None
  • Logistic regression regularization parameter C: 0.01, 0.1, 1, 10, 100
  • Number of features removed at each RFECV iteration: 1, 3, 5

Tuning resulted in a model with 81 features, and minimal improvements in precision (0.630) and recall (0.567). The optimal parameters were C = 0.01, balanced class weights, and removal of one feature per RFECV iteration. Again overfitting was not evident (training-set precision 0.667, recall 0.602).

To potentially improve fit, the sparse “Don’t Know” category was merged with “No”, creating a binary outcome with better class balance. The binary model with default parameters and RFECV pruning had 52 features, and improved precision (0.702) and recall (0.696). Training-set values were 0.728 and 0.719, so overfitting was not suspected.

Grid-search tuning with the same parameter space described above yielded a 93-feature model with essentially the same precision (0.706) and recall (0.700). Training set values were similar (0.724 and 0.714). Optimal parameters were C = 0.01, balanced class weights, and elimination of one feature per RFECV iteration.

Due to the increase in features and minimal improvement in fit, the grid-search-tuned model was discarded, and the initial model was taken as the starting point for manual feature reduction with RFE. From this 52-feature model, reduced models selected by RFE with between 5 and 50 parameters (in 5-parameter increments) were fit. Figure 16 shows the results of manual RFE tuning. Best precision (0.705) and recall (0.699) were obtained at 30 features, and this model was selected as the final one. Again, training and test set precision (0.723) and recall (0.715) were similar, indicating no obvious overfitting.

Figure 16. Precision and recall as function of number of features in logistic regression model for EMR access (as a binary outcome).
Figure 16. Precision and recall as function of number of features in logistic regression model for EMR access (as a binary outcome).

Features included in the final model are given below.

Features associated with a higher likelihood of being offered EMR access:

Demographic & Temporal:

  • “educa_4” : College or higher education (vs. all other levels)
  • “selfgender_2” : Female (vs. male or no answer)
  • “survey_cycle_3” : 2019 (vs. 2018, 2020 pre- & post-pandemic)
  • “survey_cycle_4” : 2020 pre-pandemic (vs. 2018, 2019, 2020 post-pandemic)
  • “survey_cycle_5” : 2020 post-pandemic (vs. 2018, 2019, 2020 pre-pandemic)
  • “agegrpb_4” : Age 65–74 (vs. all other age strata; highest is ≥ 75)

Health-Related:

  • “regularprovider” : Have regular HCP (vs. do not)
  • “healthinsurance” : Have some form of health insurance (vs. do not)
  • “everhadcancer” : Ever diagnosed with cancer (vs. never)
  • “qualitycare_1” : Rate quality of HCP’s care “excellent” (vs. don’t go, very good, good, fair, poor)
  • “qualitycare_2” : Rate quality of HCP’s care “very good” (vs. don’t go, excellent, good, fair, poor)
  • “freqgoprovider_2” : See HCP 2 times yearly (vs. 0, 1, 3, 4, 5–9, and ≥ 10)
  • “freqgoprovider_3” : See HCP 3 times yearly (vs. 0, 1, 2, 4, 5–9, and ≥ 10)
  • “freqgoprovider_4” : See HCP 4 times yearly (vs. 0, 1, 2, 3, 5–9, and ≥ 10)
  • “freqgoprovider_5” : See HCP 5–9 times yearly (vs. 0, 1, 2, 3, 4, and ≥ 10)
  • “freqgoprovider_6” : See HCP ≥ 10 times yearly (vs. 0, 1, 2, 3, 4, and 5–9)

Electronic Device & Internet-Related:

  • “useinternet” : Use internet for web browsing/email (vs. do not)
  • “electronic_selfhealthinfo” : Have used electronic means to search for health-related info in last 12 mos (vs. haven’t)
  • “whruseinet_pubvother_1” : Use internet in public place (eg library) “often” or “sometimes”(vs. never or don’t use internet)
  • “whruseinet_pubvother_2” : Do not use internet in public place (eg library) (vs. often/sometimes or don’t use internet)
  • “tablethealthwellnessapps_1” : Have health/wellness apps on a tablet (vs. no or don’t own tablet)
  • “tablet_discussionshcp_1” : Use tablet as aid for discussion with HCP (vs. no or don’t own tablet)

Features associated with a lower likelihood of being offered EMR access:

Demographic & Temporal:

  • “highspanli” : Linguistically isolated (high prevalence less proficient English speakers)
  • “raceethn5_4” : Non-Hispanic Asian (vs. all other racial groupings)
  • “censdiv_6” : East South Central census division (KY, TN, MS, AL; vs. all other divisions)
  • “hhinc_1” : Household income in lowest category (< $20k/yr; vs. all higher categories & not reported)
  • “maritalstatus_6” : Single (vs. all other categories)

Health-Related:

  • “healthins_pubpriv_2” : Public insurance (Medicare/Medicaid) without employer-provided insurance (vs. private/employer-provided or other/none)
  • “avgdrinks_cat_5” : ≥ 150% of number of drinks CDC classifies as heavy drinking (M ≥ 23, F ≥ 13; other: ≥ 18; this is highest category; vs. all lower categories)
  • “ownabilitytakecarehealth_5” : “Not at all” confident in own ability to take care of health (vs. completely, very, somewhat, or a little confident)

The strength of each variable, as measured by its odds ratio, is shown in Figure 17. In the figure, the dividing line at 1.0 demarcates features associated with EMR access (in green; odds ratio > 1.0) and those not associated with EMR use (in red; odds ratio < 1.0).

Figure 17. Odds ratios for features associated with EMR access (as a binary variable).
Figure 17. Odds ratios for features associated with EMR access (as a binary variable).

Having insurance of any type is most strongly associated with being offered EMR access, followed by care rated “excellent”. Female gender, higher educational attainment, and older age are also heavily weighted. The only chronic condition with a significant effect is a history of cancer, although more frequent visits to an HCP are associated with higher likelihood of being offered access. Using the internet, as well as using it and e-devices for health-related purposes, are also predictors. Finally, the 2019–2020 (versus 2018) survey cycles are associated with increased EMR access, with the highest weight for 2020 pre-pandemic, followed by 2020 post-pandemic, then 2019, indicating there is a time effect although perhaps not a linear one.

By contrast, being “not at all confident” in one’s ability to take care of one’s health is most associated with not been offered EMR access, followed by by Non-Hispanic Asian racial identity. Being in the lowest income stratum, being single, having only Medicare and/or Medicaid, and residing in the East South Central census division or in a linguistically isolated area also associated with reduced access. No chronic condition appears, but very heavy drinking also predicts reduced access.

Figure 18 depicts the 10 features with the biggest difference in prevalence between predicted probabilities of ≥ 80% and ≤ 20% of being offered access to an EMR. Red bars indicate features more prevalent in patients with predicted probability ≤ 20% of being offered EMR access, while green bars indicate those more prevalent in patients with ≥ 80% predicted probability.

Figure 18. Features with largest difference in prevalance between patients with ≤ 20% and ≥ 80% probability of being offered EMR access.
Figure 18. Features with largest difference in prevalance between patients with ≤ 20% and ≥ 80% probability of being offered EMR access.

The only variable of the top 10 which is more prevalent in the low-probability group is being in the lowest household income stratum (< $20,000/yr).

In the high-probability group, patients are more likely to be female and have at least a college degree. Medically, they are more likely to have a regular HCP and to give the highest rating (excellent) to the quality of the HCP’s care. The rest of the variables relate to internet access and use: they are more likely to use the internet, but less likely to do so via public access (e.g. a library). They are more likely to use the internet and a device like a tablet to look for health information, monitor their health, and have discussions with their HCP.

Model for EMR Use

Expanding the 41 categorical variables with one-hot encoding and leaving out a reference category resulted in 109 potential features for predicting the likelihood of having used an EMR in the past 12 months. The categories are “None” (which includes those who don’t have access to an EMR), 1–2 times, 3–5 times, 6–9 times, and ≥ 10 times.

Again, a multi-class logistic regression model was used to predict all five possible outcomes. The initial RFECV + logistic regression model for this outcome had 82 features and fair precision (0.607) with poor recall (0.471). Training-set values were similar (0.627 and 0.512).

Grid-search tuning was again performed. Because the classes for this outcome are clearly imbalanced, class weights = None was not attempted. The grid was:

  • Logistic regression regularization parameter C: 1x10^-5, 1x10^-4, 1x10^-3, 0.01, 0.1, 1, 10, 100
  • Number of features removed at each RFECV iteration: 1, 3, 5

This grid search gave a 99-parameter model with decreased precision (0.551) and slightly improved, but still poor, recall (0.517). Optimal paramters were C = 1x10^-5 and one feature removed per RFECV iteration. Overfitting was not evident: training set precision was 0.554 and recall 0.521.

As with the EMR access outcome, the culprit was believed to be the less-frequent categories causing imbalanced predictions. A binary outcome variable comparing “None” to “Any” use of the EMR was therefore created.

The RFECV + logistic regression model for this outcome showed greatly improved precision (0.741) and recall (0.724) using 62 features. Training-set values (0.752 and 0.740) did not indicate overfitting.

Grid-search tuning using the same parameter grid as above decreased the feature space to 54 but with lower precision (0.709) and recall (0.689). Optimal parameters were the same as for the multi-class model. Training-set performance (precision 0.714, recall 0.703) did not indicate overfitting.

Based on these results, the 62-feature model was used as the starting point for manual RFE tuning. Models containing from 5 to 60 features (again in 5-feature increments) were created and their scores compared.

Figure 19 shows the results of manual RFE tuning. Optimal precision (0.742) and recall (0.724) were obtained with both 45 and 50 features; the model with 45 features was selected as most parsimonious. As previously, minimal differences in precision (0.749) and recall (0.736) were observed for the training set, decreasing the likelihood of overfitting.

Figure 19. Precision and recall as function of number of features in logistic regression model for EMR use (as a binary outcome).
Figure 19. Precision and recall as function of number of features in logistic regression model for EMR use (as a binary outcome).

The features selected for this model are listed below.

Features associated with a higher likelihood of EMR use:

Demographic & Temporal:

  • “educa_2” : High school education (vs. all other levels; lowest/reference is < high school)
  • “educa_3” : Some college education (vs. all other levels)
  • “educa_4” : College or higher education (vs. all other levels)
  • “selfgender_2” : Female (vs. male or no answer)
  • “censdiv_9” : Pacific census division (CA, OR, WA, AK, HI; vs. all other divisions)
  • “survey_cycle_3” : 2019 (vs. 2018, 2020 pre- & post-pandemic)
  • “survey_cycle_4” : 2020 pre-pandemic (vs. 2018, 2019, 2020 post-pandemic)
  • “survey_cycle_5” : 2020 post-pandemic (vs. 2018, 2019, 2020 pre-pandemic)

Health-Related:

  • “regularprovider” : Have regular HCP (vs. do not)
  • “healthinsurance” : Have some form of health insurance (vs. do not)
  • “medconditions_diabetes” : Ever diagnosed with diabetes (vs. never)
  • “everhadcancer” : Ever diagnosed with cancer (vs. never)
  • “qualitycare_1” : Rate quality of HCP’s care “excellent” (vs. don’t go, very good, good, fair, poor)
  • “qualitycare_2” : Rate quality of HCP’s care “very good” (vs. don’t go, excellent, good, fair, poor)
  • “qualitycare_3” : Rate quality of HCP’s care “good” (vs. don’t go, excellent, very good, fair, poor)
  • “qualitycare_4” : Rate quality of HCP’s care “fair” (vs. don’t go, excellent, very good, good, poor)
  • “qualitycare_5” : Rate quality of HCP’s care “poor” (vs. don’t go, excellent, very good, good, fair)
  • “freqgoprovider_3” : See HCP 3 times yearly (vs. 0, 1, 2, 4, 5–9, and ≥ 10)
  • “freqgoprovider_4” : See HCP 4 times yearly (vs. 0, 1, 2, 3, 5–9, and ≥ 10)
  • “freqgoprovider_5” : See HCP 5–9 times yearly (vs. 0, 1, 2, 3, 4, and ≥ 10)
  • “freqgoprovider_6” : See HCP ≥ 10 times yearly (vs. 0, 1, 2, 3, 4, and 5–9)
  • “smokestat_2” : Former smoker (vs. current, never)
  • “smokestat_3” : Never smoker (vs. current, former)

Electronic Device & Internet-Related:

  • “useinternet” : Use internet for web browsing/email (vs. do not)
  • “electronic_selfhealthinfo” : Have used electronic means to search for health-related info in last 12 mos (vs. haven’t)
  • “intrsn_visitedsocnet” : Used internet to visit social network (vs. no or don’t browse)
  • “whruseinet_pubvother_1” : Use internet in public place (eg library) “often” or “sometimes” (vs. never or don’t use internet)
  • “whruseinet_pubvother_2” : Do not use internet in public place (eg library) (vs. often/sometimes or don’t use internet)
  • “tablethealthwellnessapps_1” : Have health/wellness apps on a tablet (vs. no or don’t own tablet)
  • “tablet_discussionshcp_1” : Use tablet as aid for discussion with HCP (vs. no or don’t own tablet)
  • “havedevice_cat_5” : Have multiple electronic devices (cell phone, regular phone, tablet; vs. none or one of these)
  • “internet_broadbnd_1” : Access the internet through a broadband connection (vs. don’t or no internet)

Features associated with a lower likelihood of EMR use:

Demographic & Temporal:

  • “highspanli” : Linguistically isolated (high prevalence less proficient English speakers)
  • “raceethn5_3” : Hispanic (vs. all other racial groupings)
  • “censdiv_2”: Middle Atlantic census division (NJ, NY, PA; vs. all other divisions)
  • “censdiv_6” : East South Central census divsion (KY, TN, MS, AL; vs. all other divisions)
  • “censdiv_8” : Mountain census divsion (AZ, CO, ID, NM, MT, UT, NV, WY; vs. all other divisions)
  • “nchsurcode2013_4” : Metropolitan: small metro urban vs rural classification (4th smallest of 6; vs. all other classifications)
  • “nchsurcode2013_5” : Non-metropolitan: micropolitan urban vs rural classification (5th smallest of 6; vs. all other classifications)
  • “hhinc_1” : Household income in lowest category (< $20k/yr; vs. all higher categories & not reported)
  • “hhinc_2” : Household income in second-lowest category ($20–34.99k/yr; vs. all other categories & not reported)
  • “maritalstatus_5” : Separated (vs. all other categories)

Health-Related:

  • “phq4_cat_4” : Severe psychological distress based on PHQ-4 score (vs. none, mild, or moderate)
  • “avgdrinks_cat_4” : ≥ 100% to < 150% of number drinks CDC classifies as heavy drinking (M: 15–22, F: 8–12; missing: 12–17; this is second-highest category; vs. other categories)
  • “ownabilitytakecarehealth_5” : “Not at all” confident in own ability to take care of health (vs. completely, very, somewhat, or a little confident)

Odds ratios for each variable are shown in Figure 20, where again green bars (odds ratio > 1.0) are associated with EMR use while red (odds ratio < 1.0) are not.

Figure 20. Odds ratios for features associated with EMR use (as a binary variable).
Figure 20. Odds ratios for features associated with EMR use (as a binary variable).

Here, rating care “excellent” is most associated with having used an EMR, followed by having attained a college degree or higher. The HCP-care rating categories are associated with EMR use to varying degrees (versus the default category of no rating). Being insured and female, having a regular HCP and higher HCP visit frequencies all appear again, as does cancer history. An additional chronic condition, diabetes, is a predictor as well, while being a current non-smoker is also included. The survey cycles appear in the same order as they do for EMR access. Lower levels of education are also present with less influence (compared to having not achieved a high school diploma). E-device and internet-related factors are similar to those related to EMR access. Finally, residing in the Pacific census division predicts higher likelihood of EMR use.

Again similar to the model for EMR access, residing in the East South Central census division and in a linguistically isolated area are most associated with not having used an EMR. Low household income and poor rating of one’s self-care ability appear, along with heavier drinking. Additional census divisions (Mountain and Middle Atlantic) that are not present in the EMR access model predicted lower likelihood of use, as do residing in more rural areas, being separated, and having a PHQ-4 score consistent with severe psychological distress.

Figure 21 illustrates the 10 features with the biggest difference in prevalence between predicted probabilities of ≥ 80% and ≤ 20% of using an EMR. Red bars again indicate higher prevalence in those with a predicted probability ≤ 20% and green bars those with ≥ 80% predicted probability of EMR use.

Figure 21. Features with largest difference in prevalance between patients with ≤ 20% and ≥ 80% probability of EMR use.
Figure 21. Features with largest difference in prevalance between patients with ≤ 20% and ≥ 80% probability of EMR use.

All of the top 10 features are more prevalent in the high-probability group. Unlike the model for EMR access, there is no gender difference. Similarly, these patients are more likely to have at least a college degree. Medically, they are again more likely to have a regular HCP. Also similarly, the remaining variables relate to internet access and use: they are more likely to use the internet and to have broadband internet access. They are less likely to access the internet via public resources (e.g. a library). They are more likely to use the internet to access social networking sites. They tend to have multiple portable electronic devices, to look for health information on the web, monitor their health with table apps, and use a tablet in discussions with their HCP.

Justification

The machine learning models were able to identify features associated with EMR access using data from publicly-available US government datasets.

Logistic regression was selected over other machine-learning models because the goal of the study was to elucidate the influence of each predictor variable. Logistic regression provides easily interpretable coefficients and odds ratios for each variable, as well as a final model that can be easily deployed, e.g. in a spreadsheet.

The initial multi-class models for both outcomes demonstrated poor precision and recall, not improved by tuning with a parameter grid search. This was due to class imbalance, with some outcome categories having few responses, resulting in poor fitting of these categories. The multiple outcomes were therefore consolidated to “Yes”/”No” binary variables, reducing class imbalance.

Dichotomizing both outcomes greatly improved precision and recall. While this results in some loss of granularity in the outcomes (e.g. the inability to differentiate characteristics of those who use an EMR more versus less frequently), the poor fit of the multi-class models renders them much less useful for making predictions.

Grid search tuning did not improve the fit of the binary models. In logistic regression, the regularization parameter C is the only adjustable model variable, and the default value of 1.0 provided as good or better fit than the lower values obtained via grid search. This is likely because the number of survey responses (5490 in the training set and 2328 in the test set) was sufficiently larger than the number of features (106 & 109), so that overfitting was not a major concern. Absence of overfitting was confirmed by minimal observed differences in precision and recall between the training and test sets for all models.

Additionally, balanced class weights were used for the initial model, and performed better than no class weights. This likely indicates that balanced weights helped compensate for any outcome class imbalance.

Finally, the default number of features removed by RFECV was adjusted. Removing one feature per iteration fared better than removing three or five, which indicates the model tolerated smaller changes better than larger ones.

Due to the large number of predictive variables (106 for EMR access and 109 for EMR use), recursive feature elimination was used to prune the feature space. Automatic feature pruning using the F1 score was able to reduce the number of features for the binary models to 52 and 62 features for EMR access and use. However, by starting with these models and using manual RFE, the feature space could be reduced to 30 and 45 features, respectively, with no loss in precision and recall. These more parsimonious models were selected as the final ones for determination of features most associated with EMR access and use.

Survey respondents more likely to be offered access to an EMR by their HCP or insurer tended to be better-educated, female, and to more heavily use electronic resources for health information. These characteristics could indicate some bias toward offering EMR access to those most likely to use them. This possibility should be studied further to enhance equality of access to EMRs.

They also tended to visit their HCPs more frequently and to be of advanced age, possibly indicating more complex medical needs. In terms of chronic conditions, only having had cancer predicted higher likelihood of being offered access. Whether there is a need to more aggressively offer access to patients with other chronic conditions should be further explored.

Finally, there was a time effect, with access increasing with survey cycle. However, the post-pandemic 2020 variable had less influence than the pre-pandemic 2020 variable, so there was not an obvious bump in EMR availability in the (admittedly limited) post-pandemic period for which data are available.

Conversely, those less likely to be offered EMR access were in the lowest education and annual income tiers, were more likely to live in the East South Central census division and linguistically isolated areas, tended to be single, more likely on public health insurance, and to be heavy drinkers and to have very low confidence in their ability to manage their healthcare. Most of these variables indicate patients generally less likely to have access to resources, including healthcare. These patients may have less interaction with their HCPs, or may be pre-judged poor candidates for EMR access by their HCPs. Linguistic isolation indicates they may be poor English speakers and not be able to communicate well with HCPs. These patients could potentially benefit from educational initiatives, preferably in their native language, promoting the value of EMRs and offering instruction on their use. HCPs could also potentially be unaware of unconscious biases regarding these patients, and could possibly benefit from incentives to expand EMR access to underserved patients.

Similarly, those more likely to use an EMR also tended to be female, more educated, more frequent visitors to their HCPs, and to be more frequent users of electronic devices and consumers of health-related information. In terms of chronic conditions, both diabetes and cancer histories predicted increased EMR use.

The time effect was the same as for EMR access, with use increasing with each survey cycle, but with the post-pandemic period less influential than pre-pandemic 2020.

Again, it appears more electronically- and health-literate patients tend to be more likely to use EMRs. Appropriately, those with higher HCP visit frequencies and chronic conditions are also more likely to use EMRs. These patients may have complex medical histories and multiple HCPs, and should benefit from having their information be more portable in electronic form, as well as from having access to their data for discussions with HCPs or family members.

Also similar to those less likely to be offered EMR access, those predicted less likely to use an EMR tend to be in the lowest income category, to reside in the East South Central census division and in linguistically isolated areas, rate their ability to manage their own care poorly, and tend to be heavier drinkers. These similarities may be due to the fact that those not offered EMR access will also not have used an EMR.

Additionally, these patients tend to come from more rural areas, to be in the second-lowest income bracket, to identify as Hispanic, and to meet PHQ-4 criteria for severe psychological distress. These indicators again identify a high-risk group of patients who may have less access to resources, particularly healthcare. Severe psychological distress and low confidence in health-related abilities may indicate those who need more support/assistance to manage their health affairs. As above, these patients may benefit from language-appropriate, targeted outreach and education to emphasize the availability and benefits of an EMR.

Conclusion

Reflection

In this analysis, three years of data from the National Cancer Institute’s HINTS survey were used to analyze patient attributes associated with access to and use of EMRs.

After preliminary screening and analysis, multivariate logistic regression models were created for the two outcome variables. Due to the sparsity of the multiple outcome categories in both variables, multi-class models displayed poor predictive performance, as measured by precision and recall. Dichotomizing the outcomes resulted in better class balance and improved precision and recall.

The final model predicting EMR access had 30 features, and yielded precision and recall of 0.705 and 0.699 on the test set.

Demographic features associated with increased predicted likelihood of EMR access were female gender, higher education, and moderately older age (65–74). There was also an effect of year (survey cycle), and the COVID-19 pandemic. Demographics associated with decreased likelihood of access were being single, low income, non-Hispanic Asian race, linguistic isolation, and residing in the East South Central census division.

Health-related features associated with increased predicted EMR access were having a regular HCP whose care was more highly rated, seeing that HCP two or more times yearly, having health insurance, and a history of cancer. Features in this category associated with less likelihood of access were public insurance (Medicare and/or Medicaid) alone, very heavy drinking, and low confidence in one’s ability to manage health affairs.

E-device and internet-related factors predicting increased likelihood of access were internet use, increased use of e-devices and the web for health-related purposes, visiting social-networking sites, and degree of use of public internet access. No variable in this category predicted reduced access.

The strength of each feature was examined via its odds ratio. The features most strongly associated with increased access were having health insurance and highly rating one’s care. Those most strongly associated with decreased likelihood of access were low confidence in one’s ability to manage health affairs, and non-Hispanic Asian race.

Finally, features with the largest difference in prevalence between those with predicted likelihood of ≥ 80% and ≤ 20% of being offered EMR access were assessed. Of the top ten, lower income was more prevalent in the low-likelihood group, while female gender, higher education level, having a regular HCP, higher HCP rating, and several variables relating to increased internet/device use and use for health-related purposes were more prevalent in the high-likelihood group.

The final model predicting EMR use had 45 features, precision of 0.742, and recall of 0.724.

Demographic features associated with increased predicted likelihood of EMR use were female gender, any education ≥ high school, and living in the Pacific (CA, OR, WA, AK, HI) census division. There was again an effect of year (survey cycle), and the COVID-19 pandemic. Demographics associated with decreased likelihood of use were being separated, lower income, Hispanic race, linguistic isolation, residing in the Middle Atlantic, East South Central, or Mountain census divisions, and residing in a more rural area.

Health-related features associated with increased predicted EMR use were having a regular HCP, see an HCP ≥ 3 times yearly, any care rating (aside from none/don’t visit HCP), having health insurance, a history of diabetes or cancer, and being a current non-smoker. Features in this category associated with less likelihood of use were heavy drinking, severe psychological distress (highest PHQ-4 score) and low confidence in one’s ability to manage health affairs.

E-device and internet-related factors predicting increased likelihood of EMR use were internet use, broadband access, having multiple e-devices, increased use of the devices and the web for health-related purposes, and degree of use of public internet access. As previously, no variable in this category predicted reduced use.

Examining odds ratios, an “excellent” HCP-care rating and ≥ college degree were most strongly associated with increased likelihood of EMR use. Those most strongly associated with decreased likelihood of access were residing in the East South Central census division and in a linguistically-isolated area.

Finally, prevalence differences between those with predicted likelihood of ≥ 80% and ≤ 20% of having used an EMR were assessed. The top ten were all more common in those predicted more likely to use an EMR. With the exception of education ≥ college degree and having a regular HCP, all were related to increased internet and e-device access/use for both general and health-related purposes.

An aspect of this project that I found interesting was that such a large, feature-rich dataset was publicly available, and that these data are being collected on an annual basis. I had never heard of the HINTS survey before starting to search for appropriate datasets to use for this project. For future projects, I will definitely explore other government-curated data sources such as this. I am also curious about whether these data are used in health policy decisions. The HINTS page states that the purpose of collecting the data is:

Survey researchers are using the data to understand how adults 18 years and older use different communication channels, including the Internet, to obtain vital health information for themselves and their loved ones. Program planners are using the data to overcome barriers to health information usage across populations, and obtaining the data they need to create more effective communication strategies. Finally, social scientists are using the data to refine their theories of health communication in the information age and to offer new and better recommendations for reducing the burden of cancer throughout the population.

However, they don’t cite specific initiatives or policies that have been based on the survey results. It would be interesting to see more detail on that aspect. The last users meeting mentioned took place in 2014. It would be unfortunate if these data are just being collected for the sake of having them, rather than being used to further the aims of increasing healthcare access.

What I found most challenging for this project was (I suppose, the universal data scientist complaint) wrangling the data and pruning the features effectively. Even with the codebooks available, it was difficult to determine the relationships between some of the related fields. It was also surprising to me that so many of the features had a significant univariate relationship to the outcomes, which mostly negated my aim of trying to prune the variables prior to creating the multivariate models. My conclusion was that these are the difficulties of using a dataset one didn’t collect oneself, and which isn’t purpose-built for the question being asked.

Improvement

Limitations

The dataset was limited to fields available from HINTS, which is not specifically designed as a survey of EMR use.

While the HINTS survey is designed to be representative of the US population, the fields of interest for this study had a fair number of missing entries, whose deletion may have altered the composition of the sample to be less representative.

While data were available on patients’ history of several chronic conditions, no data were available on the complexity of patients’ overall medical condition or their comorbidity burden. These factors may increase the need for multiple HCP visits as well as visits to multiple different HCPs, which would increase the need for access to a usable, portable EMR.

Similarly, the survey does not elucidate how many different HCPs a patient sees. It also does not have data on urgent-care, emergency, or inpatient medical visits. The need for many of these may indicate patients who need closer follow-up, who may access multiple sources of healthcare, and who might benefit from having access to their many healthcare records.

Finally, since the survey is conducted between January and April of each year, the post-pandemic data for 2020 only cover a limited time period.

Improvement

Further studies could assess in depth the barriers HCPs face to educating patients on and disseminating access to EMRs. Is there a lack of resources in HCPs’ offices to discuss and educate patients on EMRs, potentially creating a bias toward offering them primarily to “likely adopters” who need less assistance? Similarly, could there be a lack of interest in managing the patient-facing aspects of the EMR and its possibly increased burden on HCP staff?

Another area for additional study could include examining the reasons patients with EMR access do not use them. Questions related to this are part of HINTS, but that additional analysis was beyond the scope of this project. Perhaps some patients prefer direct in-person interaction when discussing complex health matters, or perhaps they feel the data in the EMR are not useful outside the context of their visit. There could also be privacy concerns about accessing such sensitive data electronically.

Additionally, it would be informative to assess patient and HCP perceptions of the pros and cons of the various EMRs, especially in terms of their usability. From the patient end, it would be useful to know whether EMR data come with any patient-friendly interpretation, as raw diagnostic test or lab results would be mostly uninterpretable by those with no medical background.

Although interoperability is a very specific goal of the HITECH Act’s interoperability initiative [1], it would also be worthwhile to study how interoperable the various EMR systems actually are, especially from the perspective of the end-user. This would be especially relevant to patients with multiple providers who might use different EMR platforms.

Finally, there was not a clear increase in access to or use of EMRs in the post-pandemic period, compared to pre-pandemic 2020. However, the post-pandemic period studied was relatively short and further follow-up may better elucidate whether or not this unique circumstance affected EMR access and use.

References

  1. Centers for Diseases Control & Prevention, National Program of Cancer Registries: Meaningful Use of Electronic Health records. https://www.cdc.gov/cancer/npcr/meaningful_use.htm. Accessed August 5, 2021.
  2. Lafata JE, Miller CA, Shires DA, Dyer K, Ratliff SM, Schreiber M. Patients’ adoption of and feature access within electronic patient portals. Am J Manag Care 2018; 24(11):e352-e357.
  3. Jhamb M, Cavanaugh KL, Bian A, Chen G, Ikizler TA, Unruh ML, Abdel-Kader K. Disparities in electronic health record patient portal use in nephrology clinics. Clin J Am Soc Nephrol 2015;10(11):2013–22.
  4. Measurement Instrument Database for the Social Sciences: The Patient Health Questionnaire-4 (PHQ-4). https://www.midss.org/content/patient-health-questionnaire-4-phq-4. Accessed August 7, 2021.
  5. Centers for Diseases Control & Prevention: Physical Activity: How much physical activity do adults need? https://www.cdc.gov/physicalactivity/basics/adults/index.htm. Accessed August 10, 2021.
  6. Centers for Diseases Control & Prevention: Achohol and Public Health: Alcohol Use and Your Health. https://www.cdc.gov/alcohol/fact-sheets/alcohol-use.htm. Accessed August 10, 2021.

--

--