Automatic Mapping Clinical Notes to Medical - RMIT University

More documents

Recommendations

Info

$journal of latex class files, vol. 6, no. 1, january ... - RMIT University$

Our FUSE|tel spectrum of HD|sta 73882|sta is derived from time-tagged observations over the course of 8 orbits on 1999|dat Oct|dat 30|dat . Several “ burst ” events occurred during the observation ( Sahnow|per et al. 2000|dat ) . We excluded all photon|part events that occurred during the bursts , reducing effective on-target integration time from 16.8|dur ksec|dur to 16.1|dur ksec|dur . Strong interstellar extinction and lack of co-alignment of the SiC channels with the LiF channels prevented the collection of useful data shortward of 1010|wav ˚A|wav . Figure 1: An extract from our final corpus, originally from astro-ph/0005090. comma, and reattaching the L ATEX which the tokenizer split off incorrectly. 4.5 The Corpus The articles for the corpus were selected randomly from the downloaded L ATEX documents and annotated by the second author. The annotation was performed using a custom Emacs mode which provided syntax highlighting for named entity types and mapped the keys to specific named entities to make the annotation as fast as possible. The average annotation speed was 165 tokens per minute. An extract from our corpus is shown in Figure 1. There are a total of 7840 sentences in our corpus, with an average of 26.1 tokens per sentence. 5 Named Entity Categories Examples of the categories we used are listed in Table 2. We restricted ourselves to high level categories such as star and galaxy rather than detailed ones typically used by astronomers to classify objects such as red giant and elliptical galaxy. 5.1 Areas of Ambiguity There were some entities that did not clearly fit into one specific category or are used in a way that is ambiguous. This section outlines some of these cases. Temperature and Energy Due to the high temperatures in X-ray astronomy, temperatures are conventionally referred to in units of energy (eV), for example: its 1 MeV temperature, the emission from... Our annotation stays consistent to the units, so these cases are tagged as energies (egy). Angular distance Astronomers commonly refer to angular distances on the sky (in units of arc) because it is not possible to know 60 the true distance between two objects without knowing their redshift. We annotate these according to the units, i.e. angles, although they are often used in place of distances. Spectral lines and ions Absorption or emission of radiation by ions results in spectral line features in measured spectra. Common transitions have specific names (e.g. Hα) whereas others are referred to by the ion name (e.g. Si IV), introducing ambiguity. 5.2 Comparison with genia and muc The corpus has a named entity density of 5.4% of tokens. This is significantly higher than the density of the Astronomy Bootstrapping Corpus. The most frequency named entities types are: per (1477 tags), dat (1053 tags), tel (867 tags), gal (551 tags), and wav (451 tags). The token 10 has the highest degree of ambiguity since it was tagged with every unit related tag: egy, mass, etc. and also as obj. By comparison the genia corpus has a much higher density of 33.8% tokens on a sample the same size as our corpus. The highest frequency named entity types are: other (16171 tags), protein (13197 tags), dna domain (6848 tags) and protein family (6711 tags). The density of tags in muc is 11.8%, higher than our corpus but much lower than genia. The highest frequency named entities are organisation (6373 tags), followed by location (3828 tags) and date (3672 tags). Table 3 gives a statistical comparison of the three corpora. This data suggests that the astronomy data will be harder to automatically tag than muc 7 because the density is lower and there are many more classes. However, if there were more classes or finer grained distinctions in muc this would not necessarily be true. It also demonstrates how different biological text is to other scientific domains.
Class Definition Examples Comments gxy galaxy NGC 4625; Milky Way; Galaxy inc. black holes neb nebula Crab Nebula; Trapezium sta star Mira A; PSR 0329+54; Sun inc. pulsars stac star cluster M22; Palomar 13 supa supernova SN1987A; SN1998bw pnt planet Earth; Mars ; HD 11768 b; tau Boo inc. extra-solar planets frq frequency 10 Hz; 1.4 GHz dur duration 13 seconds; a few years inc. ages lum luminosity 10 46 ergs −1 ; 10 10 L⊙ inc. flux pos position 17:45.6; -18:35:31; 17 h 12 m 13 s tel telescope ATCA; Chandra X-ray observatory inc. satellites ion ion Si IV; HCO + inc. molecular ions sur survey SUMSS; 2 Micron All-Sky Survey dat date 2003; June 17; 31st of August inc. epochs (e.g. 2002.7) Corpus astro genia muc # cats 43 36 8 # entities 10 744 40 548 11 568 # tagged 16,016 69 057 19 056 # avg len 1.49 1.70 1.64 tag density 5.4% 33.8% 11.8% Table 3: Comparison with genia and muc. Condition Contextual predicate f(wi) < 5 X is prefix of wi, |X| ≤ 4 X is suffix of wi, |X| ≤ 4 wi contains a digit wi contains uppercase character wi contains a hyphen ∀wi ∀wi ∀wi wi = X wi−1 = X, wi−2 = X wi+1 = X, wi+2 = X posi = X posi−1 = X, posi−2 = X posi+1 = X, posi+2 = X nei−1 = X nei−2nei−1 = XY Table 4: Baseline contextual predicates 6 Maximum Entropy Tagger The purpose of creating this annotated corpus is to develop a named entity tagger for astronomy literature. In these experiments we adapt the C&C ne tagger (Curran and Clark, 2003) to astronomy literature by investigating which feature types improve the performance of the tagger. However, as we shall see below, the tagger can also be used to test and improve the quality of the annotation. It can also be used to speed up the annotation process by pre-annotating sentences with their most likely tag. We were also interested to see whether > 40 named-entity categories could be distinguished successfully with this quantity of data. Table 2: Example entity categories. 61 Condition Contextual predicate f(wi) < 5 wi contains period/punctuation wi is only digits wi is a number wi is {upper,lower,title,mixed} case wi is alphanumeric length of wi wi has only Roman numerals wi is an initial (x.) wi is an acronym (abc, a.b.c.) ∀wi memory ne tag for wi ∀wi ∀wi ∀wi unigram tag of wi+1, wi+2 wi, wi−1 or wi+1 in a gazetteer wi not lowercase and flc > fuc uni-, bi- and tri-grams of word type Table 5: Contextual predicates in final system The C&C ne tagger feature types are shown in Tables 4 and 5. The feature types in Table 4 are the same as used in MXPost (Ratnaparkhi, 1996) with the addition of the ne tag history features. We call this the baseline system. Note, this is not the baseline of the ne tagging task, only the baseline performance for a Maximum Entropy approach. Table 5 includes extra feature types that were tested by Curran and Clark (2003). The wi is only digits predicates apply to words consisting of all digits. Title-case applies to words with an initial uppercase letter followed by lowercase (e.g. Mr). Mixed-case applies to words with mixed lower- and uppercase (e.g. CitiBank). The length features encode the length of the word from 1 to 15 characters, with a single bin for lengths greater than 15. The next set of contextual predicates encode extra information about ne tags in the current context. The memory ne tag predicate records the ne tag that was most recently assigned to the current word. This memory is reset at
Page 1:
Proceeding of the Australasian Lang
Page 4 and 5:
Program Chairs: Committees Lawrence
Page 6 and 7:
Towards the Evaluation of Referring
Page 8 and 9:
The cat ate the hat that I made NP/
Page 10 and 11:
Figure 2: Cells affected by adding
Page 12 and 13:
were used because the accuracy for
Page 14 and 15:
TIME ACC COVER FAIL RATE % secs % %
Page 16 and 17: model. We explore different estimat
Page 18 and 19: S.S. Source Sense Source S.S. Acc.
Page 20 and 21: acy. The difference in correctly as
Page 22 and 23: Experiments with Sentence Classific
Page 24 and 25: datasets with skewed distribution t
Page 26 and 27: words produced by the feature selec
Page 28 and 29: Sentence Class NB DT SVM Apology 1.
Page 30 and 31: Computational Semantics in the Natu
Page 32 and 33: S[sem = ] -> NP[sem=?subj] VP[sem=?
Page 34 and 35: Any string which cannot be decompos
Page 36 and 37: satisfy(Formula,model(D,F),G,pos):n
Page 38 and 39: Classifying Speech Acts using Verba
Page 40 and 41: The principles are dichotomous —
Page 42 and 43: ance. Several additional example ut
Page 44 and 45: our results. In classifying only th
Page 46 and 47: Word Relatives in Context for Word
Page 48 and 49: create a collection of training exa
Page 50 and 51: Query Sense The nave was rebuilt in
Page 52 and 53: Algorithm Avg S2LS Avg S3LS Avg S2L
Page 54 and 55: B. Snyder and M. Palmer. 2004. The
Page 56 and 57: it is even used as a black box that
Page 58 and 59: ecognition in QA, so we will not us
Page 60 and 61: BPER ILOC IPER BLOC BLOC BDATE BLOC
Page 62 and 63: of noise introduced by the addition
Page 64 and 65: 2 Existing annotated corpora Much o
Page 68 and 69: N Word Correct Tagged 23 OH mol non
Page 70 and 71: L ATEX typesetting information into
Page 72 and 73: in English, neither the determiner
Page 74 and 75: con 4 (Sagot et al., 2006) and Morp
Page 76 and 77: Feature type Positions/description
Page 78 and 79: Miriam Butt, Helge Dyvik, Tracy Hol
Page 80 and 81: ently mostly solved by employing hu
Page 82 and 83: Figure 1: Augmented SCT Lexicon Tok
Page 84 and 85: medical terms can be composed by ad
Page 86 and 87: References Aronson, A. R. (2001). E
Page 88 and 89: technique to most question types, i
Page 90 and 91: User Question Analyser QA System Se
Page 92 and 93: Figure 2: Precision Figure 3: Cover
Page 94 and 95: Analysis and Review. Springer-Verla
Page 96 and 97: 2 Background 2.1 Pronoun Reference
Page 98 and 99: ous accounts of the effects of cohe
Page 100 and 101: Table 1: Overall Results for Object
Page 102 and 103: References Jennifer Arnold, Janet E
Page 104 and 105: In order for appropriate documents
Page 106 and 107: y Michael West. He found that his t
Page 108 and 109: low-scoring documents are “Click
Page 110 and 111: a) b) You must complete at least 9
Page 112 and 113: uct). In this paper, we will only c
Page 114 and 115: indicate the categories of substruc
Page 116 and 117:
Argument lowering Dependent geach
Page 118 and 119:
who [u : np] WH(np, s, s/?gq) λP.
Page 120 and 121:
algorithms, and report the performa
Page 122 and 123:
2.3 Coverage of the Human Data Out
Page 124 and 125:
and minimal subsets of the data. Of
Page 126 and 127:
output. Finally, and related to the
Page 128 and 129:
identifies, to avoid overwhelming t
Page 130 and 131:
y the student is first parsed, usin
Page 132 and 133:
a given word sequence Sorig as a pe
Page 134 and 135:
A problem arises with this scheme w
Page 136 and 137:
Example 1: Original Headline: Europ
Page 138 and 139:
|relations(sentence1) ∩ relations
Page 140 and 141:
2685 True False 1275 Bleu Dep Bleu
Page 142 and 143:
Features Acc. C1-prec. C1-recall C1
Page 144 and 145:
Given the lack of research in VSD u
Page 146 and 147:
ased features: Type 1 Original sibl
Page 148 and 149:
Preposition of the adjunct semantic
Page 150 and 151:
Algorithm 1 The algorithm of combin
Page 152 and 153:
References Timothy Baldwin and Fran
Page 154 and 155:
dependencies in German with respect
Page 156 and 157:
wij moeten dit probleem aanpakken w
Page 158 and 159:
a brevity penalty of BLEU n = 1 n =
Page 160 and 161:
plicitly matching the syntax of the
Page 162 and 163:
Probabilities improve stress-predic
Page 164 and 165:
Towards Cognitive Optimisation of a
Page 166 and 167:
Natural Language Processing and XML
Page 168 and 169:
Extracting Patient Clinical Profile
show all

Automatic Mapping Clinical Notes to Medical - RMIT University

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?