Reading patient data manually takes a huge amount of time. Hence, U.S. Scientists have developed a new, automated, AI-based algorithm that can learn to read patient data from Electronic Health Records (EHR). The scientists, in a side-by-side comparison, showed that their method accurately identified patients with certain diseases as well as the traditional, “gold-standard” method, which requires much more manual labour to develop and perform.
There continues to be an explosion in the amount and types of data electronically stored in a patient’s medical record. Extracting and analysing this complex web of data can be highly ineffective, thus slowing advancements in clinical research.
In this study, we created a new method for mining data from electronic health records with machine learning that is faster and less labour intensive than the industry standard. We hope that this will be a valuable tool that will facilitate further, and less biased, research in clinical informatics.
– Assistant Professor of Genetics and Genomic Sciences
Currently, to mine medical records for new information, scientists rely on a set of established computer programmes or algorithms. A system called the Phenotype Knowledgebase (PheKB) manages the development and storage of these algorithms. While the system is highly effective at correctly identifying a patient diagnosis, the process of developing an algorithm can be very time-consuming and inflexible.
For instance, when researchers want to study disease. They first have to scour through all the medical records to look for relevant information, such as certain lab tests or prescriptions, which are uniquely associated with the disease.
They then programme the algorithm that guides the computer to search for patients who have those disease-specific pieces of data, which constitute a “phenotype”. In turn, the list of patients identified by the computer needs to be manually double-checked by researchers. Each time researchers want to study a new disease, they have to restart the process from scratch. In this study, the researchers tried a different approach in which the computer learns on its own, such as how to spot disease phenotypes and thus save researchers time and effort.
A senior author of the study stated that, previously, the researchers showed that unsupervised machine learning could be a highly efficient and effective strategy for mining EHR. The potential advantage of their approach is that it learns representations of diseases from the data itself. Therefore, the machine does much of the work experts would normally do to define the combination of data elements from health records that best describes a particular disease.
Essentially, a computer was programmed to scour through millions of EHR and learn how to find connections between data and diseases. This programming relied on “embedding” algorithms that had been previously developed by other researchers, such as linguists, to study word networks in various languages. One of the algorithms, called word2vec, was particularly effective. Then, the computer was programmed to use what it learned to identify the diagnoses of nearly 2 million patients whose data was stored in the health system.
Finally, the researchers compared the effectiveness between the new and the old systems. For nine out of ten diseases tested, they found that the new Phe2vec system was as effective as, or performed slightly better than, the gold standard phenotyping process at correctly identifying diagnoses from EHR.
Overall the results are encouraging and suggest that the system is a promising technique for large-scale phenotyping of diseases in EHR data. With further testing and refinement, they hope that it could be used to automate many of the initial steps of clinical informatics research, thus allowing scientists to focus their efforts on downstream analyses like predictive modelling.