22.07.2013 Views

Automatic Mapping Clinical Notes to Medical - RMIT University

Automatic Mapping Clinical Notes to Medical - RMIT University

Automatic Mapping Clinical Notes to Medical - RMIT University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Named Entity Recognition for Astronomy Literature<br />

Tara Murphy and Tara McIn<strong>to</strong>sh and James R. Curran<br />

School of Information Technologies<br />

<strong>University</strong> of Sydney<br />

NSW 2006, Australia<br />

{tm,tara,james}@it.usyd.edu.au<br />

Abstract<br />

We present a system for named entity<br />

recognition (ner) in astronomy journal<br />

articles. We have developed this<br />

system on a ne corpus comprising approximately<br />

200,000 words of text from<br />

astronomy articles. These have been<br />

manually annotated with ∼40 entity<br />

types of interest <strong>to</strong> astronomers.<br />

We report on the challenges involved<br />

in extracting the corpus, defining entity<br />

classes and annotating scientific<br />

text. We investigate which features of<br />

an existing state-of-the-art Maximum<br />

Entropy approach perform well on astronomy<br />

text. Our system achieves an<br />

F-score of 87.8%.<br />

1 Introduction<br />

Named entity recognition (ner) involves assigning<br />

broad semantic categories <strong>to</strong> entity references<br />

in text. While many of these categories<br />

do in fact refer <strong>to</strong> named entities, e.g.<br />

person and location, others are not proper<br />

nouns, e.g. date and money. However, they<br />

are all syntactically and/or semantically distinct<br />

and play a key role in Information Extraction<br />

(ie). ner is also a key component of<br />

Question Answering (qa) systems (Hirschman<br />

and Gaizauskas, 2001). State-of-the-art qa<br />

systems often have cus<strong>to</strong>m-built ner components<br />

with finer-grained categories than existing<br />

corpora (Harabagiu et al., 2000). For ie<br />

and qa systems, generalising entity references<br />

<strong>to</strong> broad semantic categories allows shallow extraction<br />

techniques <strong>to</strong> identify entities of interest<br />

and the relationships between them.<br />

Another recent trend is <strong>to</strong> move beyond the<br />

traditional domain of newspaper text <strong>to</strong> other<br />

corpora. In particular, there is increasing interest<br />

in extracting information from scientific<br />

documents, such as journal articles, especially<br />

in biomedicine (Hirschman et al., 2002).<br />

A key step in this process is understanding<br />

the entities of interest <strong>to</strong> scientists and building<br />

models <strong>to</strong> identify them in text. Unfortunately,<br />

existing models of language perform<br />

very badly on scientific text even for the categories<br />

which map directly between science and<br />

newswire, e.g. person. Scientific entities often<br />

have more distinctive orthographic structure<br />

which is not exploited by existing models.<br />

In this work we identify entities within astronomical<br />

journal articles. The astronomy<br />

domain has several advantages: firstly, it is<br />

representative of the physical sciences; secondly,<br />

the majority of papers are freely available<br />

in a format that is relatively easy <strong>to</strong> manipulate<br />

(L ATEX); thirdly, there are many interesting<br />

entity types <strong>to</strong> consider annotating;<br />

finally, there are many databases of astronomical<br />

objects that we will eventually exploit as<br />

gazetteer information.<br />

After reviewing comparable named entity<br />

corpora, we discuss aspects of astronomy that<br />

make it challenging for nlp. We then describe<br />

the corpus collection and extraction process,<br />

define the named entity categories and present<br />

some examples of interesting cases of ambiguity<br />

that come up in astronomical text.<br />

Finally, we describe experiments with retraining<br />

an existing Maximum Entropy tagger<br />

for astronomical named entities. Interestingly,<br />

some feature types that work well for<br />

newswire significantly degrade accuracy here.<br />

We also use the tagger <strong>to</strong> detect errors and<br />

inconsistencies in the annotated corpus. We<br />

plan <strong>to</strong> develop a much larger freely available<br />

astronomy ne corpus based on our experience<br />

described here.<br />

Proceedings of the 2006 Australasian Language Technology Workshop (ALTW2006), pages 57–64.<br />

57

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!