Automatic Mapping Clinical Notes to Medical - RMIT University

More documents

Recommendations

Info

$journal of latex class files, vol. 6, no. 1, january ... - RMIT University$

ecognition in QA, so we will not use any sentence selection method that finds possible answers. The sentences remaining after the sentence selection phase are then analysed for named entities. All named entities found in the sentences are considered to be possible answers to the user question. Once all possible answers to the questions are found, the actual answer selection phase takes place. For this, the question is analysed, which provides information on what kind of answer is expected. This can be, for example, country, river, distance, person, etc. as described in Table 1. The set of possible answers is now considered preferring answers that match the question type. The best answer (i.e. with the highest score and matching the question type) is returned to the user, which finishes a typical question answering interaction. 4 Named Entity Recognition The ability of finding exact answers by the AnswerFinder system relies heavily on the quality of the named entity recognition performed on the sentences that are relevant to the user question. Finding all named entities in the sentences is therefore of utmost importance. Missing named entities may mean that the answer to the question cannot be recovered anymore. We have tried different NERs in the context of question answering. In addition to a general purpose NER, we have developed our own NER. Even though several high quality NERs are available, we thought it important to have full control over the NER to make it better suited for the task at hand. 4.1 ANNIE ANNIE is part of the Sheffield GATE (General Architecture for Text Engineering) system (Gaizauskas et al., 1996) and stands for “A Nearly- New IE system”. This architecture does much more than we need, but it is possible to only extract the NER part of it. Unfortunately, there is not much documentation on the NER in ANNIE. The named entity types found by ANNIE match up with the MUC types as described in Table 2. ANNIE was chosen as an example of a typical NER because it is freely available to the research community and the named entity types are a subset of the MUC types. 52 4.2 AFNER In addition to ANNIE’s NER, we also look at the results from the NER that is developed within the AnswerFinder project, called AFNER. 4.2.1 General Approach The NER process used in AFNER consists of two phases. The first phase uses hand-written regular expressions and gazetteers (lists of named entities that are searched for in the sentences). These information sources are combined with machine learning techniques in the second phase. AFNER first tokenises the given text, applies the regular expressions to each token, and searches for occurrences of the token in the gazetteers. Regular expression matches and list occurrences are used as features in the machine learning classifier. These features are used in combination with token specific features, as well as features derived from the text as a whole. Using a model generated from the annotated corpus, each token is classified as either the beginning of (‘B’) or in (‘I’) a particular type of named entity, or out (‘OUT’) of any named entity. The classified tokens are then appropriately combined into named entities. 4.2.2 First Phase — Regular Expressions and Gazetteers Regular expressions are useful for finding named entities following identifiable patterns, such as dates, times, monetary expressions, etc. As a result, the entities that can be discovered using regular expressions are limited. However, matching a particular regular expression is a key feature used in identifying entities of these particular types. Gazetteers are useful for finding commonly referenced names of people, places or organisations, but are by no means exhaustive. The purpose of combining lists with other features is to supplement the lists used. 4.2.3 Second Phase — Machine Learning The second phase involves the machine learning component of AFNER. The technique used is maximum entropy, and the implementation of the classifier is adapted from Franz Josef Och’s YAS- MET. 4 The system is trained on the Remedia Corpus (Hirschman et al., 1999), which contains annotations of named entities. The regular expression and gazetteer matches are used as features, in combination with others 4 http://www.fjoch.com/YASMET.html
pertaining to both individual tokens and tokens in context. Features of individual tokens include those such as capitalisation, alpha/numeric information, etc. Contextual features are those that identify a token amongst surrounding text, or relate to tokens in surrounding text. For example, whether a token is next to a punctuation mark or a capitalised word, or whether a token is always capitalised in a passage of text. Contextual features relating to global information have been used as described by Chieu and Ng (2002). In addition, features of previous tokens are included. The features are then passed to a maximum entropy classifier which, for every token, returns a list of probabilities of the token to pertain to each category. The categories correspond with each type of entity type prepended with ‘B’ and ‘I’, and a general ‘OUT’ category for tokens not in any entity. The list of entity types used is the same as in the MUC tasks (see Table 2). Preliminary experiments revealed that often the top two or three entity type probabilities have similar values. For this reason the final named entity labels are computed on the basis of the top n probabilities (provided that they meet a defined threshold), where n is a customisable limit. Currently, a maximum of 3 candidate types are allowed per token. Classified tokens are then combined according to their classification to produce the final list of named entities. We have experimented with two methods named single and multiple. For single type combination only one entity can be associated with a string, whereas for multiple type combination several entities can be associated. Also, the multiple type combination allows overlaps of entities. The multiple type combination aims at increasing recall at the expense of ambiguous labelling and decrease of precision. In the case of multiple type combination (see Figure 1 for an example), each label prepended with ‘B’ signals the beginning of a named entity of the relevant type, and each ‘I’ label continues a named entity if it is preceded by a ‘B’ or ‘I’ label of the same type. If an ‘I’ label does not appear after a ‘B’ classification, it is treated as a ‘B’ label. In addition, if a ‘B’ label is preceded by an ‘I’ label, it will be both added as a separate entity (with the previous entity ending) and appended to the previous entity. The single type combination (Figure 2) is im- 53 plemented by filtering out all the overlapping entities of the output of the multiple type combination. This is done by selecting the longest-spanning entity and discarding all substring or overlapping strings. If there are two entities associated with exactly the same string, the one with higher probability is chosen. The probability of a multi-token entity is computed by combining the individual token probabilities. Currently we use the geometric mean but we are exploring other possibilities. If Pi is the probability of token i and P1...n is the probability of the entire sentence, the geometric mean of the probabilities is computed as: 5 Results P1...n = e n i=1 log P i n To evaluate the impact of the quality of NER within the context of question answering, we ran the AnswerFinder system using each of the named entity recognisers, ANNIE, AFNERs and AFNERm. This section first explains the experimental setup we used, then shows and discusses the results. 5.1 Experimental setup To evaluate AnswerFinder we used the data available for participants of the QA track of the 2005 TREC competition-based conference 5 . This competition provides us with a nice setting to measure the impact of the NERs. We simply use the documents and questions provided during the TREC 2005 competition. To determine whether a document or text fragment contains the answer we use Ken Litkowsky’s answer patterns, also available at the TREC website. The questions in TREC 2005 are grouped by topic. The competition consisted of 75 topics, with a total of 530 questions. These questions are divided into three different types: factoid, list, and other. In this paper, we only consider the factoid questions, that is, questions that require a single fact as answer. List asks for a list of answers and other is answered by giving any additional information about the topic. There are 362 factoid questions in the question set. In the experiments, AnswerFinder uses the TREC data as follows. First, we apply docu- 5 http://trec.nist.gov
Page 1:
Proceeding of the Australasian Lang
Page 4 and 5:
Program Chairs: Committees Lawrence
Page 6 and 7:
Towards the Evaluation of Referring
Page 8 and 9: The cat ate the hat that I made NP/
Page 10 and 11: Figure 2: Cells affected by adding
Page 12 and 13: were used because the accuracy for
Page 14 and 15: TIME ACC COVER FAIL RATE % secs % %
Page 16 and 17: model. We explore different estimat
Page 18 and 19: S.S. Source Sense Source S.S. Acc.
Page 20 and 21: acy. The difference in correctly as
Page 22 and 23: Experiments with Sentence Classific
Page 24 and 25: datasets with skewed distribution t
Page 26 and 27: words produced by the feature selec
Page 28 and 29: Sentence Class NB DT SVM Apology 1.
Page 30 and 31: Computational Semantics in the Natu
Page 32 and 33: S[sem = ] -> NP[sem=?subj] VP[sem=?
Page 34 and 35: Any string which cannot be decompos
Page 36 and 37: satisfy(Formula,model(D,F),G,pos):n
Page 38 and 39: Classifying Speech Acts using Verba
Page 40 and 41: The principles are dichotomous —
Page 42 and 43: ance. Several additional example ut
Page 44 and 45: our results. In classifying only th
Page 46 and 47: Word Relatives in Context for Word
Page 48 and 49: create a collection of training exa
Page 50 and 51: Query Sense The nave was rebuilt in
Page 52 and 53: Algorithm Avg S2LS Avg S3LS Avg S2L
Page 54 and 55: B. Snyder and M. Palmer. 2004. The
Page 56 and 57: it is even used as a black box that
Page 60 and 61: BPER ILOC IPER BLOC BLOC BDATE BLOC
Page 62 and 63: of noise introduced by the addition
Page 64 and 65: 2 Existing annotated corpora Much o
Page 66 and 67: Our FUSE|tel spectrum of HD|sta 738
Page 68 and 69: N Word Correct Tagged 23 OH mol non
Page 70 and 71: L ATEX typesetting information into
Page 72 and 73: in English, neither the determiner
Page 74 and 75: con 4 (Sagot et al., 2006) and Morp
Page 76 and 77: Feature type Positions/description
Page 78 and 79: Miriam Butt, Helge Dyvik, Tracy Hol
Page 80 and 81: ently mostly solved by employing hu
Page 82 and 83: Figure 1: Augmented SCT Lexicon Tok
Page 84 and 85: medical terms can be composed by ad
Page 86 and 87: References Aronson, A. R. (2001). E
Page 88 and 89: technique to most question types, i
Page 90 and 91: User Question Analyser QA System Se
Page 92 and 93: Figure 2: Precision Figure 3: Cover
Page 94 and 95: Analysis and Review. Springer-Verla
Page 96 and 97: 2 Background 2.1 Pronoun Reference
Page 98 and 99: ous accounts of the effects of cohe
Page 100 and 101: Table 1: Overall Results for Object
Page 102 and 103: References Jennifer Arnold, Janet E
Page 104 and 105: In order for appropriate documents
Page 106 and 107: y Michael West. He found that his t
Page 108 and 109:
low-scoring documents are “Click
Page 110 and 111:
a) b) You must complete at least 9
Page 112 and 113:
uct). In this paper, we will only c
Page 114 and 115:
indicate the categories of substruc
Page 116 and 117:
Argument lowering Dependent geach
Page 118 and 119:
who [u : np] WH(np, s, s/?gq) λP.
Page 120 and 121:
algorithms, and report the performa
Page 122 and 123:
2.3 Coverage of the Human Data Out
Page 124 and 125:
and minimal subsets of the data. Of
Page 126 and 127:
output. Finally, and related to the
Page 128 and 129:
identifies, to avoid overwhelming t
Page 130 and 131:
y the student is first parsed, usin
Page 132 and 133:
a given word sequence Sorig as a pe
Page 134 and 135:
A problem arises with this scheme w
Page 136 and 137:
Example 1: Original Headline: Europ
Page 138 and 139:
|relations(sentence1) ∩ relations
Page 140 and 141:
2685 True False 1275 Bleu Dep Bleu
Page 142 and 143:
Features Acc. C1-prec. C1-recall C1
Page 144 and 145:
Given the lack of research in VSD u
Page 146 and 147:
ased features: Type 1 Original sibl
Page 148 and 149:
Preposition of the adjunct semantic
Page 150 and 151:
Algorithm 1 The algorithm of combin
Page 152 and 153:
References Timothy Baldwin and Fran
Page 154 and 155:
dependencies in German with respect
Page 156 and 157:
wij moeten dit probleem aanpakken w
Page 158 and 159:
a brevity penalty of BLEU n = 1 n =
Page 160 and 161:
plicitly matching the syntax of the
Page 162 and 163:
Probabilities improve stress-predic
Page 164 and 165:
Towards Cognitive Optimisation of a
Page 166 and 167:
Natural Language Processing and XML
Page 168 and 169:
Extracting Patient Clinical Profile
show all

Automatic Mapping Clinical Notes to Medical - RMIT University

Create successful ePaper yourself

Delete template?

Save as template?