03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

10th Benelux Bioinformatics Conference Poster<br />

<strong>bbc</strong> <strong>2015</strong><br />

P51. INTEGRATING STRUCTURED AND UNSTRUCTURED DATA SOURCES<br />

FOR PREDICTING CLINICAL CODES<br />

Elyne Scheurwegs 1,3* , Kim Luyckx 2 , Léon Luyten 2 , Walter Daelemans 3 & Tim Van den Bulcke 1 .<br />

Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Antwerp University Hospital 2 ; Center<br />

for Computation Linguistics and Psycholinguistics (CliPS), University of Antwerp 3 ; * elyne.scheurwegs@uantwerpen.be<br />

Automated clinical coding is a task in medical informatics, in which information found in patient files is translated to<br />

various types of coding systems (e.g. ICD-9-CM). The information in patient files consists of multiple data sources, both<br />

in structured (e.g. lab test results) and unstructured form (e.g. a text describing the progress of a patient over multiple<br />

days during the stay). This work studies the complementarity of information derived from these different sources to<br />

enhance clinical code prediction.<br />

INTRODUCTION<br />

The increased accessibility of healthcare data through the<br />

large-scale adoption of electronic health records stimulates<br />

the development of algorithms that monitor hospital<br />

activities, such as clinical coding applications.<br />

Clinical coding consists of the translation of information<br />

found in a patient file to diagnostic and procedural codes,<br />

originating from a medical ontology to patient files.<br />

In our work, we investigate if unstructured (textual) and<br />

structured data sources, present in electronic health<br />

records, can be combined to assign clinical diagnostic and<br />

procedural codes (specifically ICD-9-CM) to patient stays.<br />

Our main objective is to evaluate if integrating these<br />

heterogeneous data types improves prediction strength<br />

compared to using the data types in isolation.<br />

METHODS<br />

Several datasets were collected from the clinical data<br />

warehouse of the Antwerp University Hospital (UZA).<br />

The resulting dataset consists of a randomized subset of<br />

anonymized data of patient stays, in 14 different medical<br />

specialties. Two separate data integration approaches were<br />

evaluated on each dataset from a medical specialty.<br />

With early data integration, multiple sources are combined<br />

prior to training a model. This is achieved by using a<br />

single bag of features that are given to the prediction<br />

pipeline. Feature selection is performed with tf-idf for<br />

unstructured sources and gainratio and minimal<br />

redundancy, maximum relevance (mRMR) for structured<br />

source filtering.<br />

The late data integration method trains a separate model<br />

on each data source, and then combines the prediction<br />

output for each code in a meta-learner. This meta-learner<br />

is mainly used to find which sources perform best for a<br />

certain code.<br />

The prediction task in both approaches was cast as a multiclass<br />

classification task, in which an array of binary<br />

predictions was made (one for each clinical code).<br />

RESULTS & DISCUSSION<br />

Late data integration improves the predictions of ICD-9-<br />

CM diagnostic codes made in comparison to the best<br />

individual prediction source (i.e. overall F-measure<br />

increased from 30.6% to 38.3%). Early data integration<br />

does not show this trend and only performs well with a<br />

limited number of combinations of sources. ICD-9-CM<br />

procedure codes also show this trend, with the exception<br />

of the RIZIV data source, which shows a better prediction<br />

when used individually. The predictive strength of the<br />

models varies strongly between different medical<br />

specialties.<br />

The results show that the data sources, independent of<br />

their structured or unstructured nature, are able to provide<br />

complementary information when predicting ICD-9-CM<br />

codes, particularly when combined within the late data<br />

integration approach. This approach also allows for<br />

including as many sources as possible, as the effects of<br />

including a source that does not contain any additional<br />

information barely influences the end result. This is an<br />

advantage when the information content of a data source is<br />

not previously known. A disadvantage is the loss of<br />

information due to the strong generalisation as each data<br />

source is effectively reduced to a single feature for the<br />

meta-learner.<br />

Early data integration seems to suffer when combining<br />

sources that have features with a largely differing<br />

information content and different numbers of features. An<br />

unstructured data source typically renders 30,000<br />

different, weak features, while a structured source often<br />

contains only 500 different features.<br />

CONCLUSIONS<br />

Models using multiple electronic health record data<br />

sources systematically outperform models using data<br />

sources in isolation in the task of predicting ICD-9-CM<br />

codes over a broad range of medical specialties.<br />

ACKNOWLEDGEMENT<br />

This work is supported by a doctoral research grant (nr.<br />

131137) by the Agency for Innovation by Science and<br />

Technology in Flanders (IWT). The datasets used in this<br />

research were made available by the Antwerp University<br />

Hospital (UZA) for restricted use.<br />

REFERENCES<br />

Scheurwegs, E et al. Data integration of structured and unstructured<br />

sources for assigning clinical codes to patient stays. Journal of the<br />

American Medical Informatics Association (<strong>2015</strong>): ocv115.<br />

95

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!