Automatic Mapping Clinical Notes to Medical - RMIT University

More documents

Recommendations

Info

$journal of latex class files, vol. 6, no. 1, january ... - RMIT University$

words produced by the feature selection methods. We conclude that unlike text classification tasks, where each item to be classified is rich with textual information, sentence classification involves small textual units that contain valuable cues that are often lost when techniques such as lemmatization and stop-word removal are employed. 3.2 Experiments with feature selection Since there can be thousands or even tens of thousands of distinct words in the entire email corpus, the feature space can be very large, as we have seen in our baseline experiments (Table 2). This means that the computational load on a classification algorithm can be very high. Thus feature selection (FS) is desirable for reducing this load. However, it has been demonstrated in text classification tasks that FS can in fact improve classification performance as well (Yang and Pedersen, 1997). We investigate four FS methods that have been shown to be competitive in text classification (Yang and Pedersen, 1997; Forman, 2003; Gabrilovich and Markovitch, 2004), but have not been investigated in sentence classification. 3.2.1 Feature selection algorithms Chi-squared (χ 2 ). Measures the lack of statistical independence between a feature and a class (Seki and Mostafa, 2005). If the independence is high, then the feature is considered not predictive for the class. For each word, χ 2 is computed for each class, and the maximum score is taken as the χ 2 statistic for that word. Information Gain (IG). Measures the entropy when the feature is present versus the entropy when the feature is absent (Forman, 2003). It is quite similar to χ 2 in a sense that it considers the usefulness of a feature not only from its presence, but also from its absence in each class. Bi-Normal Separation (BNS). This is a relatively new FS method (Forman, 2003). It measures the separation along a Standard Normal Distribution of two thresholds that specify the prevalence rate of the feature in the positive class versus the negative class. It has been shown to be as competitive as χ 2 and IG (Forman, 2003; Gabrilovich and Markovitch, 2004), and superior when there is a large class skew, as there is in our corpus. 20 Sentence Frequency (SF). This is a baseline FS method, which simply removes features that are infrequent. The sentence frequency of a word is the number of sentences in which the word appears. Thus this method is much cheaper computationally than the others, but has been shown to be as competitive when at least 10% of the words are kept (Yang and Pedersen, 1997). 3.2.2 Results. We evaluate the various FS methods by inspecting the performance of the classifiers when trained with increasing number of features, where we retain the top features as determined by each FS method. Figure 1(a) shows the results obtained with the χ 2 method, reported using the macro F1 average, where the error bars correspond to the 95% confidence intervals of these averages. We can see from the figure that SVM and DT are far less sensitive to feature selection than NB. As conjectured in the previous sub-section, NB does not deal well with many features, and indeed we can see here that it performs poorly and inconsistently when many of the features are retained. As we filter out more features, its performance starts to improve and become more consistent. In contrast, the SVM seems to prefer more features: its performance degrades slightly if less than 300 features are retained (although it still outperforms the other classifiers), and levels out when at least 300 features are used. As well as having an overall better performance than the other two classifiers, it also has the smallest variability, indicating a more consistent and robust behaviour. SVMs have been shown in text classification to be more robust to many features (Joachims, 1998). When comparing the FS methods against each other, it seems their performance is not significantly distinguishable. Figure 1(b) shows the performance of the four methods for the NB classifier. We see that when at least 300 features are retained, the performances of the FS methods are indistinguishable, with IG and χ 2 slightly superior. When less than 300 features are retained, the performance of the SF method deteriorates compared to the others. This means that if we only want very few features to be retained, a frequency-based method is not advisable. This is due to the fact that we have small classes in our corpus, whose cue words are therefore infrequent, and therefore we need to select features more carefully. However, if we can afford to use many features, then this
macro F1 average 0.9 0.8 0.7 0.6 0.5 NB DT SVM 0.4 100 300 500 700 900 1100 num features retained 1300 1500 macro F1 average 0.9 0.8 0.7 0.6 0.5 0.4 CHI 2 IG BNS SF (a) (b) 100 300 500 700 900 1100 1300 1500 num features retained Figure 1: Results from feature selection: (a) the effect of the χ 2 method on different classifiers, (b) the effect of different feature selection methods on the Naive Bayes classifier. simple method is adequate. We have observed the pattern seen in Figure 1(b) also with the other classifiers, and with the micro F1 average (Anthony, 2006). Our observations are in line with those from text classification experiments: the four FS methods perform similarly, except when only a small proportion is retained, when the simple frequencybased method performs worse. However, we expected the BNS method to outperform the others given that we are dealing with classes with a high distributional skew. We offer two explanations for this result. First, the size of our corpus is smaller than the one used in the text classification experiments involving skewed datasets (Forman, 2003) (these experiments use established benchmark datasets consisting of large sets of labelled text documents, but there are no such datasets with labelled sentences). Our smaller corpus therefore results in a substantially fewer number of features (1622 using the our “best” representation in Table 2 compared with approximately 5000 in the text classification experiments). Thus, it is possible that the effect of BNS can only be observed when a more substantial number of features is presented to the selection algorithm. The second explanation is that sentences are less textually rich than documents, with fewer irrelevant and noisy features. They might not rely on feature selection to the extent that text classification tasks do. Indeed, our results show that as long as we retain a small proportion of the features, a simple FS method suffices. Therefore, the effect of BNS cannot be observed. 21 3.3 Class-by-class analysis So far we have presented average performances of the classifiers and FS methods. It is also interesting to look at the performance individually for each class. Table 3 shows how well each class was predicted by each classifier, using the macro F1 average and standard deviation in brackets. The standard deviation was calculated over 10 crossvalidation folds. These results are obtained with the “best” representation in Table 2, and with the χ 2 feature selection method retaining the top 300 features. We can see that a few classes have F1 of above 0.9, indicating that they were highly predictable. Some of these classes have obvious cue words to distinguish them from other classes. For instance, “inconvenience”, “sorry”, “apologize” and “apology” to discriminate APOLOGY class, “?” to discriminate QUESTION, “please” to discriminate REQUEST and “thank” to discriminate THANKING. It is more interesting to look at the less predictable classes, such as INSTRUCTION, INSTRUCTION-ITEM, SUGGESTION and SPECIFI- CATION. They are also the sentence classes that are considered more useful to know than some others, like THANKING, SALUTATION and so on. For instance, by knowing which sentences are instructions in the emails, they can be extracted into a to-do list of the email recipient. We have inspected the classification confusion matrix to better understand the less predictable classes. We saw that INSTRUCTION, INSTRUCTION-ITEM,
Page 1: Proceeding of the Australasian Lang
Page 4 and 5: Program Chairs: Committees Lawrence
Page 6 and 7: Towards the Evaluation of Referring
Page 8 and 9: The cat ate the hat that I made NP/
Page 10 and 11: Figure 2: Cells affected by adding
Page 12 and 13: were used because the accuracy for
Page 14 and 15: TIME ACC COVER FAIL RATE % secs % %
Page 16 and 17: model. We explore different estimat
Page 18 and 19: S.S. Source Sense Source S.S. Acc.
Page 20 and 21: acy. The difference in correctly as
Page 22 and 23: Experiments with Sentence Classific
Page 24 and 25: datasets with skewed distribution t
Page 28 and 29: Sentence Class NB DT SVM Apology 1.
Page 30 and 31: Computational Semantics in the Natu
Page 32 and 33: S[sem = ] -> NP[sem=?subj] VP[sem=?
Page 34 and 35: Any string which cannot be decompos
Page 36 and 37: satisfy(Formula,model(D,F),G,pos):n
Page 38 and 39: Classifying Speech Acts using Verba
Page 40 and 41: The principles are dichotomous —
Page 42 and 43: ance. Several additional example ut
Page 44 and 45: our results. In classifying only th
Page 46 and 47: Word Relatives in Context for Word
Page 48 and 49: create a collection of training exa
Page 50 and 51: Query Sense The nave was rebuilt in
Page 52 and 53: Algorithm Avg S2LS Avg S3LS Avg S2L
Page 54 and 55: B. Snyder and M. Palmer. 2004. The
Page 56 and 57: it is even used as a black box that
Page 58 and 59: ecognition in QA, so we will not us
Page 60 and 61: BPER ILOC IPER BLOC BLOC BDATE BLOC
Page 62 and 63: of noise introduced by the addition
Page 64 and 65: 2 Existing annotated corpora Much o
Page 66 and 67: Our FUSE|tel spectrum of HD|sta 738
Page 68 and 69: N Word Correct Tagged 23 OH mol non
Page 70 and 71: L ATEX typesetting information into
Page 72 and 73: in English, neither the determiner
Page 74 and 75: con 4 (Sagot et al., 2006) and Morp
Page 76 and 77:
Feature type Positions/description
Page 78 and 79:
Miriam Butt, Helge Dyvik, Tracy Hol
Page 80 and 81:
ently mostly solved by employing hu
Page 82 and 83:
Figure 1: Augmented SCT Lexicon Tok
Page 84 and 85:
medical terms can be composed by ad
Page 86 and 87:
References Aronson, A. R. (2001). E
Page 88 and 89:
technique to most question types, i
Page 90 and 91:
User Question Analyser QA System Se
Page 92 and 93:
Figure 2: Precision Figure 3: Cover
Page 94 and 95:
Analysis and Review. Springer-Verla
Page 96 and 97:
2 Background 2.1 Pronoun Reference
Page 98 and 99:
ous accounts of the effects of cohe
Page 100 and 101:
Table 1: Overall Results for Object
Page 102 and 103:
References Jennifer Arnold, Janet E
Page 104 and 105:
In order for appropriate documents
Page 106 and 107:
y Michael West. He found that his t
Page 108 and 109:
low-scoring documents are “Click
Page 110 and 111:
a) b) You must complete at least 9
Page 112 and 113:
uct). In this paper, we will only c
Page 114 and 115:
indicate the categories of substruc
Page 116 and 117:
Argument lowering Dependent geach
Page 118 and 119:
who [u : np] WH(np, s, s/?gq) λP.
Page 120 and 121:
algorithms, and report the performa
Page 122 and 123:
2.3 Coverage of the Human Data Out
Page 124 and 125:
and minimal subsets of the data. Of
Page 126 and 127:
output. Finally, and related to the
Page 128 and 129:
identifies, to avoid overwhelming t
Page 130 and 131:
y the student is first parsed, usin
Page 132 and 133:
a given word sequence Sorig as a pe
Page 134 and 135:
A problem arises with this scheme w
Page 136 and 137:
Example 1: Original Headline: Europ
Page 138 and 139:
|relations(sentence1) ∩ relations
Page 140 and 141:
2685 True False 1275 Bleu Dep Bleu
Page 142 and 143:
Features Acc. C1-prec. C1-recall C1
Page 144 and 145:
Given the lack of research in VSD u
Page 146 and 147:
ased features: Type 1 Original sibl
Page 148 and 149:
Preposition of the adjunct semantic
Page 150 and 151:
Algorithm 1 The algorithm of combin
Page 152 and 153:
References Timothy Baldwin and Fran
Page 154 and 155:
dependencies in German with respect
Page 156 and 157:
wij moeten dit probleem aanpakken w
Page 158 and 159:
a brevity penalty of BLEU n = 1 n =
Page 160 and 161:
plicitly matching the syntax of the
Page 162 and 163:
Probabilities improve stress-predic
Page 164 and 165:
Towards Cognitive Optimisation of a
Page 166 and 167:
Natural Language Processing and XML
Page 168 and 169:
Extracting Patient Clinical Profile
show all

Automatic Mapping Clinical Notes to Medical - RMIT University

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?