Automatic Mapping Clinical Notes to Medical - RMIT University

More documents

Recommendations

Info

$journal of latex class files, vol. 6, no. 1, january ... - RMIT University$

2 Existing annotated corpora Much of the development in ner has been driven by the corpora available for training and evaluating such systems. This is because the state-of-the-art systems rely on statistical machine learning approaches. 2.1 Message Understanding Conference The muc named entity recognition task (in muc 6/7) covered three types of entities: names person, location, organisation; temporal expressions date, time; numeric expressions money, percent. The distribution of these types in muc 6 was: names 82%, temporal 10% and numeric 8%, and in muc 7 was: names 67%, temporal 25% and numeric 6%. The raw text for the muc 6 ner corpus consisted of 30 Wall Street Journal articles, provided by the Linguistic Data Consortium (ldc). The text used for the English ner task in muc 7 was from the New York Times News Service, also from the ldc. There are detailed annotation guidelines available. 1 2.2 GENIA corpus The genia corpus (Kim et al., 2003) is a collection of 2000 abstracts from the National Library of Medicine’s medline database. The abstracts have been selected from search results for the keywords human, blood cells and transcription factors. genia is annotated with a combination of part of speech (pos) tags based on the Penn Treebank set (Marcus et al., 1994) and a set of biomedical named entities described in the genia ontology. One interesting aspect of the genia corpus is that some named entities are syntactically nested. However, most statistical ner systems are sequence taggers which cannot easily represent hierarchical tagging. 2.3 Astronomy Bootstrapping Corpus The Astronomy Bootstrapping Corpus (Becker et al., 2005; Hachey et al., 2005) is a small corpus consisting of 209 abstracts from the nasa Astronomical Data System Archive. The corpus was developed as part 1 www.cs.nyu.edu/cs/faculty/grishman/muc6.html 58 of experiments into efficient methods for developing new statistical models for ner. The abstracts were selected using the query quasar + line from articles published between 1997 and 2003. The corpus was annotated with the following named entity types: 1. instrument name (136 instances) 2. source name (111 instances) 3. source type (499 instances) 4. spectral feature (321 instances) The seed and test sets (50 and 159 abstracts) were annotated by two astronomy PhD students. The abstracts contained on average 10 sentences with an average length of 30 tokens, implying an tag density (the percentage of words tagged as a named entity) of ∼ 2%. 3 NLP for astronomy Astronomy is a broad scientific domain combining theoretical, observational and computational research, which all differ in conventions and jargon. We are interested in ner for astronomy within a larger project to improve information access for scientists. There are several comprehensive text and scientific databases for astronomy. For example, nasa Astrophysics Data System (ADS, 2005) is a bibliographic database containing over 4 million records (journal articles, books, etc) covering the areas of astronomy and astrophysics, instrumentation, physics and geophysics. ads links to various external resources such as electronic articles, data catalogues and archives. 3.1 iau naming conventions The naming of astronomical objects is specified by the International Astronomical Union’s (iau) Commission 5, so as to minimise confusing or overlapping designations in the astronomical literature. The most common format for object names is a catalogue code followed by an abbreviated position (Lortet et al., 1994). Many objects still have common or historical names (e.g. the Crab Nebula). An object that occurs in multiple catalogues will have a separate name in each catalogue (e.g. PKS 0531+21 and NGC 1952 for the Crab Nebula).
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2 90 218 421 1909 3221 4320 5097 5869 6361 6556 7367 7732 3495 Table 1: Number of L ATEX astro-ph articles extracted for each year. 3.2 The Virtual Observatory There is a major effort in astronomy to move towards integrated databases, software and telescopes. The umbrella organisation for this is the International Virtual Observatory Alliance (Hanisch and Quinn, 2005). One of the aims is to develop a complete ontology for astronomical data which will be used for Unified Content Descriptors (Martinez et al., 2005). 4 Collecting the raw corpus The process of collecting the raw text for named entity annotation involved first obtaining astronomy text, extracting the raw text from the document formatting and splitting it into sentences and tokens. 4.1 arXiv arXiv (arXiv, 2005) is an automated distribution system for research articles, started in 1991 at the Los Alamos National Laboratory to provide physicists with access to prepublication materials. It rapidly expanded to incorporate many domains of physics and thousands of users internationally (Ginsparg, 2001). The astrophysics section (astro-ph) is used by most astronomers to distribute papers before or after publication in a recognised journal. It contains most of astrophysics publications from the last five to ten years. Table 1 shows that the number of articles submitted to astro-ph has increased rapidly since 1995. The articles are mostly typeset in L ATEX. We have downloaded 52 658 articles from astro-ph, totalling approximately 180 million words. In creating the ne corpus we limited ourselves to articles published since 2000, as earlier years had irregular L ATEX usage. 4.2 L ATEX conversion After collecting of L ATEX documents, the next step was to extract the text so that it could be processed using standard nlp tools. This normally involves ignoring most formatting and special characters in the documents. 59 However formatting and special characters play a major role in scientific documents, in the form of mathematical expressions, which are interspersed through the text. It is impossible to ignore every non-alphanumeric character or or map them back to some standard token because too much information is lost. Existing tools such as DeTeX (Trinkle, 2002) remove L ATEX markup including the mathematics, rendering scientific text nonsensical. Keeping the L ATEX markup is also problematic since the tagger’s morphological features are confused by the markup. 4.3 L ATEX to Unicode Our solution was to render as much of the L ATEX as possible in text, using Unicode (Unicode Consortium, 2005) to represent the mathematics as faithfully as possible. Unicode has excellent support for mathematical symbols and characters, including the Greek letters, operators and various accents. Mapping L ATEX back to the corresponding Unicode character is difficult. For example, \acirc, \^{a} and \hat{a} are all used to produce â, which in Unicode is 0x0174. Several systems attempt to convert L ATEX to other formats e.g. xml (Grimm, 2003). No existing system rendered the mathematics faithfully enough or with high enough coverage for our purposes. Currently our coverage of mathematics is very good but there are still some expressions that cannot be translated, e.g. complex nested expressions, rare symbols and non-Latin/Greek/Hebrew alphabetic characters. 4.4 Sentences and Tokenisation We used MXTerminator (Reynar and Ratnaparkhi, 1997) as the sentence boundary detector with an additional Python script to fix common errors, e.g. mistaken boundaries on Sect. and et al. We used the Penn Treebank (Marcus et al., 1994) sed script to tokenize the text, again with a Python script to fix common errors, e.g. splitting numbers like 1,000 on the
Page 1:
Proceeding of the Australasian Lang
Page 4 and 5:
Program Chairs: Committees Lawrence
Page 6 and 7:
Towards the Evaluation of Referring
Page 8 and 9:
The cat ate the hat that I made NP/
Page 10 and 11:
Figure 2: Cells affected by adding
Page 12 and 13:
were used because the accuracy for
Page 14 and 15: TIME ACC COVER FAIL RATE % secs % %
Page 16 and 17: model. We explore different estimat
Page 18 and 19: S.S. Source Sense Source S.S. Acc.
Page 20 and 21: acy. The difference in correctly as
Page 22 and 23: Experiments with Sentence Classific
Page 24 and 25: datasets with skewed distribution t
Page 26 and 27: words produced by the feature selec
Page 28 and 29: Sentence Class NB DT SVM Apology 1.
Page 30 and 31: Computational Semantics in the Natu
Page 32 and 33: S[sem = ] -> NP[sem=?subj] VP[sem=?
Page 34 and 35: Any string which cannot be decompos
Page 36 and 37: satisfy(Formula,model(D,F),G,pos):n
Page 38 and 39: Classifying Speech Acts using Verba
Page 40 and 41: The principles are dichotomous —
Page 42 and 43: ance. Several additional example ut
Page 44 and 45: our results. In classifying only th
Page 46 and 47: Word Relatives in Context for Word
Page 48 and 49: create a collection of training exa
Page 50 and 51: Query Sense The nave was rebuilt in
Page 52 and 53: Algorithm Avg S2LS Avg S3LS Avg S2L
Page 54 and 55: B. Snyder and M. Palmer. 2004. The
Page 56 and 57: it is even used as a black box that
Page 58 and 59: ecognition in QA, so we will not us
Page 60 and 61: BPER ILOC IPER BLOC BLOC BDATE BLOC
Page 62 and 63: of noise introduced by the addition
Page 66 and 67: Our FUSE|tel spectrum of HD|sta 738
Page 68 and 69: N Word Correct Tagged 23 OH mol non
Page 70 and 71: L ATEX typesetting information into
Page 72 and 73: in English, neither the determiner
Page 74 and 75: con 4 (Sagot et al., 2006) and Morp
Page 76 and 77: Feature type Positions/description
Page 78 and 79: Miriam Butt, Helge Dyvik, Tracy Hol
Page 80 and 81: ently mostly solved by employing hu
Page 82 and 83: Figure 1: Augmented SCT Lexicon Tok
Page 84 and 85: medical terms can be composed by ad
Page 86 and 87: References Aronson, A. R. (2001). E
Page 88 and 89: technique to most question types, i
Page 90 and 91: User Question Analyser QA System Se
Page 92 and 93: Figure 2: Precision Figure 3: Cover
Page 94 and 95: Analysis and Review. Springer-Verla
Page 96 and 97: 2 Background 2.1 Pronoun Reference
Page 98 and 99: ous accounts of the effects of cohe
Page 100 and 101: Table 1: Overall Results for Object
Page 102 and 103: References Jennifer Arnold, Janet E
Page 104 and 105: In order for appropriate documents
Page 106 and 107: y Michael West. He found that his t
Page 108 and 109: low-scoring documents are “Click
Page 110 and 111: a) b) You must complete at least 9
Page 112 and 113: uct). In this paper, we will only c
Page 114 and 115:
indicate the categories of substruc
Page 116 and 117:
Argument lowering Dependent geach
Page 118 and 119:
who [u : np] WH(np, s, s/?gq) λP.
Page 120 and 121:
algorithms, and report the performa
Page 122 and 123:
2.3 Coverage of the Human Data Out
Page 124 and 125:
and minimal subsets of the data. Of
Page 126 and 127:
output. Finally, and related to the
Page 128 and 129:
identifies, to avoid overwhelming t
Page 130 and 131:
y the student is first parsed, usin
Page 132 and 133:
a given word sequence Sorig as a pe
Page 134 and 135:
A problem arises with this scheme w
Page 136 and 137:
Example 1: Original Headline: Europ
Page 138 and 139:
|relations(sentence1) ∩ relations
Page 140 and 141:
2685 True False 1275 Bleu Dep Bleu
Page 142 and 143:
Features Acc. C1-prec. C1-recall C1
Page 144 and 145:
Given the lack of research in VSD u
Page 146 and 147:
ased features: Type 1 Original sibl
Page 148 and 149:
Preposition of the adjunct semantic
Page 150 and 151:
Algorithm 1 The algorithm of combin
Page 152 and 153:
References Timothy Baldwin and Fran
Page 154 and 155:
dependencies in German with respect
Page 156 and 157:
wij moeten dit probleem aanpakken w
Page 158 and 159:
a brevity penalty of BLEU n = 1 n =
Page 160 and 161:
plicitly matching the syntax of the
Page 162 and 163:
Probabilities improve stress-predic
Page 164 and 165:
Towards Cognitive Optimisation of a
Page 166 and 167:
Natural Language Processing and XML
Page 168 and 169:
Extracting Patient Clinical Profile
show all

Automatic Mapping Clinical Notes to Medical - RMIT University

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?