Electronic Medical Record / Electronic Health Record systems (EMRs) have fallen far from the peak of the hype cycle.[1] For decades now, EMR systems have largely failed to fulfill promises to efficiently deliver real-world evidence for patient-based medicine, primarily due to inconsistent, incorrect, and missing data from source EMRs. Data quality (DQ) and harmonization are critical to deliver value from EMR data.

Traditional and AI-based methods can be applied to extract, harmonize, integrate, and realize useful outcomes from EMR data. A customer case study describes creation of a new research-ready data resource by combining information from different EMR systems into a searchable, harmonized, and integrated database, based on the new data resource. To do this, traditional and advanced methods were applied, including pipelining/API calls, rules, and lexicon-based data identification and normalization as well as a form of AI, specifically machine reasoning (MR).[2]
Multiple Real-World EMR Systems
A scenario featuring the Parkinsons Institute and Clinical Center (PICC, Sunnyvale, CA) and their efforts to extract, curate, normalize, integrate, and realize useful outcomes from EMR Data demonstrates the impact of combined machine reasoning (a form of AI) and traditional data quality methods. PICC’s technical goals were to create a new data resource, combining information from two EMR databases into a single, searchable database with content suited for pharma research.
These two separate EMR systems, from different vendors, contained data from different time periods gathered over more than a decade for a common patient population. The EMR data structure was diverse and included tables, csv, decrypted XML formats, and over 70,000 unstructured clinical notes from more than 5,000 patients. In some surprising cases, medical evidence was mixed with application information (for example, code describing colors and margins in the software application). EMR system 1 was applied from September 2004 to February 2011, and contained unstructured notes and some structured data fields. EMR system 2 was applied from February 9, 2011 to 2017 and offered additional data structure. Content was incompletely ported from EMR 1 into EMR 2, so both source EMR systems were used for this data quality application. Data was extracted under HIPAA compliance from the two EMR "data dumps" and processed by a secure server maintained behind the PICC firewall.
Methods applied to harmonize and integrate these data sources included traditional ETL (e.g., conversion of different sources to common data serialization; lexical matching and normalization) and SPARQL-based machine reasoning (e.g., ontology-enabled inference and entailment). Key goals included researching and publishing discoveries from useful clinical data and remunerative partnerships based on the new data resource.
Challenges and Technical Goals
The team was faced with a number of challenges. For example, source data was rich but showed data entry errors and diversity in nomenclature and representation. In one sample, over 190 different name and spelling variations existed for a single drug (e.g., "Stalevo" or carbidopa-levodopa-entacapone). Data with multiple forms and serializations, including XML, unstructured text, tables, and encrypted data requiring post-access decryption, mixed application, and clinical data, were also present. Due to Emr migration issues, common identifiers (e.g., EMRID, ControlID, PatientID) were not consistently and comprehensively applied to source datasets. Therefore, common patient identification required the use of combined protected personal information and health information (PHI).
Expanding and changing the data structure and content was required to meet PICC’s growing requirements and resources. Support for collaboration and long-term interoperability via the ability to align with multiple standards was also desired. Different projects initiated around the same dataset could require removal or addition of specific data fields from existing reports or from new sources, and other structural and nomenclature transformations. For a primary goal, the resulting resource needed to be research-ready. The data and system also had to support analytics, including advanced searches, pattern recognition, and reasoning.
It was determined that richly connected "hyper-normalized" semantic database technology, with relational database tables for mirroring and outward-facing access and research, would be ideally suited for PICC’s master data management (MDM) operations.
Methods and Results: Realizing Data Quality for Electronic Medical Records
The data team applied a combination of previously developed lexicons, workflows, rules-based and semantic data quality products, and methods to solve challenging EMR data harmonization and integration tasks. They achieved clinical data integration across the different EMR systems by building on traditional methods, including lexicon-based data identification and normalization and extending to formal ontology-enabled Machine Reasoning (AI) for data classification. Data integration modeling was applied for testing the application of business rules, ontologies, and reasoning, as well as for iterative querying and testing of the emerging dataset. Automated QA (e.g., via workflow) and QC following a documented set of post-ETL tests and reports ensured data integrity and quality.
To meet specified technical goals, the semantic system made it possible to rapidly expand and modify content and structure for research and collaboration. The initial dataset was aligned with standardized (FDA UMLS) common nomenclature to support normalization and subsequent harmonization with other standardized data as needed. The semantic data management system supported rapid re-alignment with changing requests, standards, and specifications for compliance with "open world" design requirements and longer term goals for semantic interoperability.
Traditional methods applied for improving data quality included rules-based assessment and transformation, normalization to standardized lists ("lexicons") via scored string matching and statistical analyses. Traditional rules and lexicons were employed to implement transformations rules created in a visual data modeling environment.
Rules-based testing and normalization of drug and disease names and IDs included application of context information. This reduced time and improved confidence in (for example) flagging and audited correction of drug, dosage, and route of administration information. Specifically, targeted API functions were applied as microservices.
Finally, de-identification and application of randomized identifiers were applied post-normalization to support HIPAA compliance.
AI-enabled methods include semantic ontology-enabled machine reasoning and analysis-driven machine learning. For example, machine reasoning was deployed to accurately move data from diverse serializations into a common desired structure, thus reducing normalization time.
To do this, SPARQL reasoning and entailment was used following World-Wide-Web Consortium (W3C) standards and using data quality tools to generate and apply SPARQL. By combining traditional data quality processes with AI, the system was able to effectively address complex data quality and integration challenges that were found within both EMR systems.
Combined DQ Methods Make it Possible to Realize Goals for EMR DataBy blending traditional data quality processes with AI, computer science is solving challenges that have plagued the pharma industry for years. Critical data quality requirements that have commonly blocked business value promised by EMR systems can be addressed by applying traditional workflow-based data quality processing in combination with more advanced machine reasoning methods.
Key technical outcomes featured repeated delivery of curated, structured, integrated, and searchable EMR data, ideally suited for specific research and collaboration requirements. Resulting business goals included delivery of regularly updated content to multiple pharma partnerships. Revenue-bearing business goals achieved in this case included pharmaceutical research partnerships and highly cited publications based on the new data resource.[3] Several targeted data subsets from this resource have been acquired by major pharmaceutical companies, and the full dataset has been acquired by the Neuroscience Research Institute (NRI, www.thenri.org.)). The next goals from the NRI include delivering new patient and market-facing applications from the data – powerful long-term advantages developed directly from smarter and more connected data.
References:
1) Willyard, Catherine. "Can AI Fix Electronic Medical Records?" Scientific American Feb. 1 2020: Online. < https://www.scientificamerican.com/article/can-ai-fix-electronic-medical-records/ >
2) Stanley, Robert. "Machine Reasoning to Reduce Time and Cost for AI Success." Fall 2020 Quarterly. Database Trends and Applications. Berkeley CA, August, 2020.
3) Langston, J. W. et al. "Multisystem Lewy body disease and the other parkinsonian disorders." Nature Genetics 47, 1378 -1384 (2015).