bbc 2015
BBC2015_booklet
BBC2015_booklet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
10th Benelux Bioinformatics Conference Poster<br />
<strong>bbc</strong> <strong>2015</strong><br />
P51. INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES<br />
FOR PREDICTING CLINICAL CODES<br />
Elyne Scheurwegs 1,3* , Kim Luyckx 2 , Léon Luyten 2 , Walter Daelemans 3 & Tim Van den Bulcke 1 .<br />
Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Antwerp University Hospital 2 ; Center<br />
for Computation Linguistics and Psycholinguistics (CliPS), University of Antwerp 3 ; * elyne.scheurwegs@uantwerpen.be<br />
Automated clinical coding is a task in medical informatics, in which information found in patient files is translated to<br />
various types of coding systems (e.g. ICD-9-CM). The information in patient files consists of multiple data sources, both<br />
in structured (e.g. lab test results) and unstructured form (e.g. a text describing the progress of a patient over multiple<br />
days during the stay). This work studies the complementarity of information derived from these different sources to<br />
enhance clinical code prediction.<br />
INTRODUCTION<br />
The increased accessibility of healthcare data through the<br />
large-scale adoption of electronic health records stimulates<br />
the development of algorithms that monitor hospital<br />
activities, such as clinical coding applications.<br />
Clinical coding consists of the translation of information<br />
found in a patient file to diagnostic and procedural codes,<br />
originating from a medical ontology to patient files.<br />
In our work, we investigate if unstructured (textual) and<br />
structured data sources, present in electronic health<br />
records, can be combined to assign clinical diagnostic and<br />
procedural codes (specifically ICD-9-CM) to patient stays.<br />
Our main objective is to evaluate if integrating these<br />
heterogeneous data types improves prediction strength<br />
compared to using the data types in isolation.<br />
METHODS<br />
Several datasets were collected from the clinical data<br />
warehouse of the Antwerp University Hospital (UZA).<br />
The resulting dataset consists of a randomized subset of<br />
anonymized data of patient stays, in 14 different medical<br />
specialties. Two separate data integration approaches were<br />
evaluated on each dataset from a medical specialty.<br />
With early data integration, multiple sources are combined<br />
prior to training a model. This is achieved by using a<br />
single bag of features that are given to the prediction<br />
pipeline. Feature selection is performed with tf-idf for<br />
unstructured sources and gainratio and minimal<br />
redundancy, maximum relevance (mRMR) for structured<br />
source filtering.<br />
The late data integration method trains a separate model<br />
on each data source, and then combines the prediction<br />
output for each code in a meta-learner. This meta-learner<br />
is mainly used to find which sources perform best for a<br />
certain code.<br />
The prediction task in both approaches was cast as a multiclass<br />
classification task, in which an array of binary<br />
predictions was made (one for each clinical code).<br />
RESULTS & DISCUSSION<br />
Late data integration improves the predictions of ICD-9-<br />
CM diagnostic codes made in comparison to the best<br />
individual prediction source (i.e. overall F-measure<br />
increased from 30.6% to 38.3%). Early data integration<br />
does not show this trend and only performs well with a<br />
limited number of combinations of sources. ICD-9-CM<br />
procedure codes also show this trend, with the exception<br />
of the RIZIV data source, which shows a better prediction<br />
when used individually. The predictive strength of the<br />
models varies strongly between different medical<br />
specialties.<br />
The results show that the data sources, independent of<br />
their structured or unstructured nature, are able to provide<br />
complementary information when predicting ICD-9-CM<br />
codes, particularly when combined within the late data<br />
integration approach. This approach also allows for<br />
including as many sources as possible, as the effects of<br />
including a source that does not contain any additional<br />
information barely influences the end result. This is an<br />
advantage when the information content of a data source is<br />
not previously known. A disadvantage is the loss of<br />
information due to the strong generalisation as each data<br />
source is effectively reduced to a single feature for the<br />
meta-learner.<br />
Early data integration seems to suffer when combining<br />
sources that have features with a largely differing<br />
information content and different numbers of features. An<br />
unstructured data source typically renders 30,000<br />
different, weak features, while a structured source often<br />
contains only 500 different features.<br />
CONCLUSIONS<br />
Models using multiple electronic health record data<br />
sources systematically outperform models using data<br />
sources in isolation in the task of predicting ICD-9-CM<br />
codes over a broad range of medical specialties.<br />
ACKNOWLEDGEMENT<br />
This work is supported by a doctoral research grant (nr.<br />
131137) by the Agency for Innovation by Science and<br />
Technology in Flanders (IWT). The datasets used in this<br />
research were made available by the Antwerp University<br />
Hospital (UZA) for restricted use.<br />
REFERENCES<br />
Scheurwegs, E et al. Data integration of structured and unstructured<br />
sources for assigning clinical codes to patient stays. Journal of the<br />
American Medical Informatics Association (<strong>2015</strong>): ocv115.<br />
95