26.04.2015 Views

Full proceedings volume - Australasian Language Technology ...

Full proceedings volume - Australasian Language Technology ...

Full proceedings volume - Australasian Language Technology ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

them to Aim, Method, Results and Conclusions<br />

in this order. Using sentence ordering labels for<br />

unstructured abstracts is the main difference<br />

compared to earlier methods (Kim et al., 2011;<br />

Verbeke et al., 2012).<br />

We tried 6 combinations of features which will<br />

be discussed in Results section.<br />

Kim et al. Verbeke et al. Our System<br />

Class S U S U S U<br />

Background 81.84 68.46 86.19 76.90 95.55 95.02<br />

Intervention 20.25 12.68 26.05 16.14 23.71 50.79<br />

Outcome 92.32 72.94 92.99 77.69 95.24 99.04<br />

Population 56.25 39.8 35.62 21.58 42.11 60.36<br />

Study Design 43.95 4.40 45.5 6.67 0.67 3.57<br />

Other 69.98 24.28 87.98 24.42 83.71 91.76<br />

Overall 80.9 66.9 84.29 67.14 81.7 89.2<br />

Table 3: F-score per class for structured (S) and unstructured<br />

(U) abstracts (bold states improvement over<br />

other systems)<br />

4.2 Algorithm<br />

Our sentence classifier uses CRF learning algorithm<br />

4 . We have also executed few experiments<br />

using SVM and observed CRF performed better<br />

over this dataset with our choice of features. Due<br />

to space constraints in this paper we are not comparing<br />

CRF versus SVM results.<br />

For feature selection, we used Fselector 5 package<br />

from R-system 6 . From the pool of features,<br />

we select the ”meaningful” features based on the<br />

selecting criteria. We have tested several criteria<br />

including (1) information gain (2) oneR (3) chisquare<br />

test (4) spearman test. Among them, information<br />

gain outperformed the others. We select<br />

the 700 best features from our pool of features<br />

based on information gain score.<br />

Other technique we used for this shared task<br />

is ”bootstrapping”. Our system performed very<br />

well on training data but did not perform well on<br />

test data, perhaps it suffered over-fitting. To overcome<br />

this, we ran our current best model on test<br />

data (without using gold-standard labels) and then<br />

merge the result with train data to get the new<br />

train. In that way, under ROC evaluation, we improved<br />

our final scores by 3%. In addition, we<br />

also pre-process the data. Since the heading such<br />

as ”AIM,OUTCOME,INTRODUCTION etc.” are<br />

always classified as ”other” in train data, when we<br />

4 We used open-source CRF++ tool.<br />

http://crfpp.googlecode.com<br />

5 http://cran.r-project.org/web/packages/FSelector/index.html<br />

6 http://www.r-project.org/<br />

find sentence which has less than 20 characters<br />

and all in upper case (our notion of heading), we<br />

directly classify it as ”other” in test data.<br />

5 Results<br />

Features B I O P SD Oth All<br />

BOW 9.1 3.2 68.8 2.9 0 31.7 38.4<br />

+lexical 18.2 7.0 71.6 11.1 0 65.2 55.3<br />

+struct 60.7 8.3 87.7 17.1 0.6 57.4 62.2<br />

+ordering 93.7 23.7 96.6 41.0 1.3 80.9 80.8<br />

All 95.2 23.7 95.2 42.1 0.6 83.7 81.7<br />

All+seq 95.5 23.7 94.9 44.2 0.6 82.9 81.4<br />

Table 4: Analysis of structured abstracts: microaverage<br />

f-score, best performance per column is given in<br />

bold<br />

Features B I O P SD Oth All<br />

BOW 13.0 0.7 79.1 1.8 0 14.3 38<br />

+lexical 34.2 1.5 68.0 2.2 0 13.3 40.0<br />

+struct 58.1 5.0 72.1 12.3 1.2 26.9 52.6<br />

+ordering 93.7 40.2 99.2 52.4 1.2 96.6 88.0<br />

All 95.0 50.7 99.0 60.3 3.5 91.7 89.2<br />

All+seq 94.9 50.7 98.7 60.1 3.5 90.8 89.0<br />

Table 5: Analysis of unstructured abstracts: microaverage<br />

f-score, best performance per column is given in<br />

bold<br />

In this section we present the analysis of results<br />

on structured and unstructured abstracts separately.<br />

In all our experiments, we performed 10-<br />

fold cross validation on the given dataset. Shared<br />

task organisers have used Receiver operating<br />

characteristic (ROC) to evaluate the scores. According<br />

to ROC our best system scored 93.78%<br />

(public board) and 92.16% (private board). However,<br />

we compare our results with (Kim et al.,<br />

2011) and (Verbeke et al., 2012) using the microaveraged<br />

F-scores as in Table 3. Our system<br />

outperformed previous works in unstructured<br />

abstracts (22% higher than state-of-the-art).<br />

Our system performed well in classifying Background,<br />

Outcome and Other for both structured<br />

and un-structured data. However, our system performed<br />

poor in classifying study design as very<br />

few instances of it is available in both test and<br />

train.<br />

We present the results of 6 systems learned<br />

using different feature sets: Table 4 for structured<br />

abstracts and Table 5 for unstructured abstracts.<br />

We choose bag-of-words (BOW) as the<br />

base features, +lexical includes BOW and lexical<br />

features, +struct include BOW and structural<br />

features, +ordering includes BOW and sentence<br />

132

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!