Automatic Mapping Clinical Notes to Medical - RMIT University

More documents

Recommendations

Info

$journal of latex class files, vol. 6, no. 1, january ... - RMIT University$

datasets with skewed distribution this measure can be misleading, and so instead we have used the F1 measure, derived from precision and recall (Salton and McGill, 1983), as follows. The precision of a class i is defined as # sentences correctly classified into class i Precision = # of sentences classified into class i and the recall of class i is defined as # sentences correctly classified into class i Recall = # of sentences that are truly in class i and then F1, the harmonic mean between precision and recall, is defined as F1 = 2 × Precision × Recall Precision + Recall Once the F1 measure is calculated for all the classes, we average it to get an overall indication of performance, and also look at the standard deviation as an indication of consistency. The average can be computed in two different methods to reflect the importance of small classes. The first method, called macro averaging, gives an equal weight to each class. The second, called micro averaging, gives proportional weight according to the proportion of the classes in the dataset. For classes with only a few positive training data, it is generally more difficult to achieve good classification, and their poor performance will have a larger effect on the overall performance when the macro average is used. The choice between the two measures depends on the relative preference that an experimenter places on the smaller classes. Since the classes in our corpus have unbalanced distributions (Table 1) we consider both alternatives and discuss their differences. 3.1 Experiments with representation Before looking at feature selection, we investigate different techniques for extracting features from sentences. Finding a useful representation for textual data can be very challenging, and the success of classification hinges on this crucial step. Many different techniques have been suggested for text classification, and we have investigated the most common ones. 3.1.1 Representation techniques Bag-of-words (BoW). Each distinct word in the text corresponds to a feature, and the text is transformed to a vector of N weights (< w1, w2, . . . , 18 wN >), where N is the total number of distinct words in the entire corpus, and wk is the weight of the k th word in the vector. Information about sentence order, word order and the structure of the text and sentence are discarded. The BoW representation is widely used due to its simplicity and computational efficiency (Cardoso-Cachopo and Oliveira, 2003). There are various methods for setting the weights, for example, solely taking into account the presence of the word, or also considering the frequency of the word. Since we are dealing with sentences that are usually quite short, we do not believe that the frequency of each word conveys any meaning. This is in contrast to typical text classification tasks. We use a binary word-presence representation, indicating whether a word is present or absent from the sentence. 1 Stop-word removal. Generally, the first step to reduce the feature space is to remove the stopwords (connective words, such as “of”, “the”, “in”). These words are very common words and are conjectured in TC to provide no information to the classifier. Stop-word removal is said to be used in almost all text classification experiments (Scott and Matwin, 1999). Tokenization. This involves separating any symbols from the numbers or alphabets. For example, the word “(manual12.txt)” is separated into five tokens, “(”, “manual12”, “.”, “txt” and “)”, all considered as features. Without tokenization, a word that is coupled with different symbols may lose its discriminative power because the BoW treats each coupling as a distinct feature. Similarly, the symbols lose any discriminative power. Lemmatization. The process of mapping words into their base form. For example, the words “installed”, “installs” and “installing” are mapped to “install”. This mapping makes the bag-ofwords approach treat words of different forms as a single feature, hence reducing the total number of features. This mapping can increase the discriminative power of a word if that word appears in a particular sentence class but in different forms. Grouping. This involves grouping certain types of words into a single feature. For instance, all words that are valid numbers, like “1”, “444” and 1 We have also attempted a bigram representation, however, our results so far are inconclusive and require further investigation.
Representation Num Features Measure NB DT SVM Basic 2710 micro-F1 ave 0.481 0.791 0.853 macro-F1 ave 0.246 0.693 0.790 Best 1622 micro-F1 ave 0.666 0.829 0.883 macro-F1 ave 0.435 0.803 0.866 Table 2: Classification performance using different representations. “9834”, are grouped to represent a single feature. Grouping was also applied to email addresses, phone numbers, URLs, and serial numbers. As opposed to the other techniques mentioned above, grouping is a domain-specific preprocessing step. 3.1.2 Results. We begin our investigation by looking at the most basic representation, involving a binary BoW without any further processing. The first row in Table 2 shows the results obtained with this basic setup. The second column shows the number of features resulting from this representation, the third column shows the performance measure, and the last three columns show the results for the three classifiers. We see that SVM outperforms the other classifiers on both measures. We also inspected the standard deviations of the macro averages and observed that SVM is the most consistent (0.037 compared to 0.084 and 0.117 for DT and NB, respectively). This means the SVM’s performance is most consistent across the different classes. These results are in line with observations reported in the literature on the superiority of SVMs in classification tasks involving text, where the dimensionality is high. We will return to this issue in the next sub-section when we discuss feature selection. We can also see from Table 2 that the micro F1 average consistently reports a better performance than the macro F1 average. This is expected due to the existence of very small classes: their performance tends to be poorer, but their influence on the micro average is proportional to their size, as opposed to the macro average which takes equal weights. We have experimented with different combinations of the various representation techniques mentioned above (Anthony, 2006). The best one turned out to be one that uses tokenization and grouping, and its results are shown in the second row of Table 2. We can see that it results in a significant reduction in the number of features (approximately 40%). Further, it provides a consistent improvement in all performance measures for 19 all classifiers, with the exception of NB, for which the standard deviation is slightly increased. We see that the more significant improvements are reported by the macro F1 average, which suggests that the smaller classes are particularly benefiting from this representation. For example, serial numbers occur often in SPECIFICATION class. If grouping was not used, serial numbers often appear in different variation, making them distinct from each other. Grouping makes them appear as a single more predictive feature. To test this further, the SPECIFICATION class was examined with and without grouping. Its classification performance improved from 0.64 (no grouping) to 0.8 (with grouping) with SVM as the classifier. An example of the effect of tokenization can be observed for the QUESTION class, which improved largely because of the question mark symbol ‘?’ being detected as a feature after the tokenization process. Notice that there is a similar increase in performance for NB when considering either the micro or macro average. That is, NB has a more general preference to the second representation, and we conjecture that this is due to the fact that it does not deal well with many features, because of the strong assumption it makes about the independence of features. The surprising results from our investigations are that two of the most common preprocessing techniques, stop-word removal and lemmatization, proved to be harmful to performance. Lemmatization can harm classification when certain classes rely on the raw form of certain words. For example, the INSTRUCTION class often has verbs in imperative form, for example, “install the driver”, but these same verbs can appear in a different form in other classes, for example the SUGGES- TION sentence “I would try installing the driver”, or the QUESTION sentence “Have you installed the driver?”. Stop-words can also carry crucial information about the structure of the sentence, for example, “what”, “how”, and “please”. In fact, often the words in our stop-list appeared in the top list of
Page 1: Proceeding of the Australasian Lang
Page 4 and 5: Program Chairs: Committees Lawrence
Page 6 and 7: Towards the Evaluation of Referring
Page 8 and 9: The cat ate the hat that I made NP/
Page 10 and 11: Figure 2: Cells affected by adding
Page 12 and 13: were used because the accuracy for
Page 14 and 15: TIME ACC COVER FAIL RATE % secs % %
Page 16 and 17: model. We explore different estimat
Page 18 and 19: S.S. Source Sense Source S.S. Acc.
Page 20 and 21: acy. The difference in correctly as
Page 22 and 23: Experiments with Sentence Classific
Page 26 and 27: words produced by the feature selec
Page 28 and 29: Sentence Class NB DT SVM Apology 1.
Page 30 and 31: Computational Semantics in the Natu
Page 32 and 33: S[sem = ] -> NP[sem=?subj] VP[sem=?
Page 34 and 35: Any string which cannot be decompos
Page 36 and 37: satisfy(Formula,model(D,F),G,pos):n
Page 38 and 39: Classifying Speech Acts using Verba
Page 40 and 41: The principles are dichotomous —
Page 42 and 43: ance. Several additional example ut
Page 44 and 45: our results. In classifying only th
Page 46 and 47: Word Relatives in Context for Word
Page 48 and 49: create a collection of training exa
Page 50 and 51: Query Sense The nave was rebuilt in
Page 52 and 53: Algorithm Avg S2LS Avg S3LS Avg S2L
Page 54 and 55: B. Snyder and M. Palmer. 2004. The
Page 56 and 57: it is even used as a black box that
Page 58 and 59: ecognition in QA, so we will not us
Page 60 and 61: BPER ILOC IPER BLOC BLOC BDATE BLOC
Page 62 and 63: of noise introduced by the addition
Page 64 and 65: 2 Existing annotated corpora Much o
Page 66 and 67: Our FUSE|tel spectrum of HD|sta 738
Page 68 and 69: N Word Correct Tagged 23 OH mol non
Page 70 and 71: L ATEX typesetting information into
Page 72 and 73: in English, neither the determiner
Page 74 and 75:
con 4 (Sagot et al., 2006) and Morp
Page 76 and 77:
Feature type Positions/description
Page 78 and 79:
Miriam Butt, Helge Dyvik, Tracy Hol
Page 80 and 81:
ently mostly solved by employing hu
Page 82 and 83:
Figure 1: Augmented SCT Lexicon Tok
Page 84 and 85:
medical terms can be composed by ad
Page 86 and 87:
References Aronson, A. R. (2001). E
Page 88 and 89:
technique to most question types, i
Page 90 and 91:
User Question Analyser QA System Se
Page 92 and 93:
Figure 2: Precision Figure 3: Cover
Page 94 and 95:
Analysis and Review. Springer-Verla
Page 96 and 97:
2 Background 2.1 Pronoun Reference
Page 98 and 99:
ous accounts of the effects of cohe
Page 100 and 101:
Table 1: Overall Results for Object
Page 102 and 103:
References Jennifer Arnold, Janet E
Page 104 and 105:
In order for appropriate documents
Page 106 and 107:
y Michael West. He found that his t
Page 108 and 109:
low-scoring documents are “Click
Page 110 and 111:
a) b) You must complete at least 9
Page 112 and 113:
uct). In this paper, we will only c
Page 114 and 115:
indicate the categories of substruc
Page 116 and 117:
Argument lowering Dependent geach
Page 118 and 119:
who [u : np] WH(np, s, s/?gq) λP.
Page 120 and 121:
algorithms, and report the performa
Page 122 and 123:
2.3 Coverage of the Human Data Out
Page 124 and 125:
and minimal subsets of the data. Of
Page 126 and 127:
output. Finally, and related to the
Page 128 and 129:
identifies, to avoid overwhelming t
Page 130 and 131:
y the student is first parsed, usin
Page 132 and 133:
a given word sequence Sorig as a pe
Page 134 and 135:
A problem arises with this scheme w
Page 136 and 137:
Example 1: Original Headline: Europ
Page 138 and 139:
|relations(sentence1) ∩ relations
Page 140 and 141:
2685 True False 1275 Bleu Dep Bleu
Page 142 and 143:
Features Acc. C1-prec. C1-recall C1
Page 144 and 145:
Given the lack of research in VSD u
Page 146 and 147:
ased features: Type 1 Original sibl
Page 148 and 149:
Preposition of the adjunct semantic
Page 150 and 151:
Algorithm 1 The algorithm of combin
Page 152 and 153:
References Timothy Baldwin and Fran
Page 154 and 155:
dependencies in German with respect
Page 156 and 157:
wij moeten dit probleem aanpakken w
Page 158 and 159:
a brevity penalty of BLEU n = 1 n =
Page 160 and 161:
plicitly matching the syntax of the
Page 162 and 163:
Probabilities improve stress-predic
Page 164 and 165:
Towards Cognitive Optimisation of a
Page 166 and 167:
Natural Language Processing and XML
Page 168 and 169:
Extracting Patient Clinical Profile
show all

Automatic Mapping Clinical Notes to Medical - RMIT University

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?