all-SSC outperforms participants on GSC
all-SSC outperforms participants on GSC
all-SSC outperforms participants on GSC
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge I vs. Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge II<br />
D. Rebholz-Schuhmann<br />
16 th March 2011, CALBC Workshop II
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Stakes<br />
� Corpus alignment is doable: different semantic types, single<br />
document<br />
� Different similarity measures for harm<strong>on</strong>isati<strong>on</strong>: exact, nested, cos.<br />
� <str<strong>on</strong>g>SSC</str<strong>on</strong>g> generati<strong>on</strong> scales / generalisable: Iterative generati<strong>on</strong><br />
� Partner-<str<strong>on</strong>g>SSC</str<strong>on</strong>g> <str<strong>on</strong>g>outperforms</str<strong>on</strong>g> partners <strong>on</strong> <strong>GSC</strong><br />
� <str<strong>on</strong>g>all</str<strong>on</strong>g>-<str<strong>on</strong>g>SSC</str<strong>on</strong>g> <str<strong>on</strong>g>outperforms</str<strong>on</strong>g> <str<strong>on</strong>g>participants</str<strong>on</strong>g> <strong>on</strong> <strong>GSC</strong><br />
� <str<strong>on</strong>g>SSC</str<strong>on</strong>g> is the best opti<strong>on</strong> when no <strong>GSC</strong> available<br />
� Incremental improvement of <str<strong>on</strong>g>SSC</str<strong>on</strong>g> performance: Large number of<br />
c<strong>on</strong>tributi<strong>on</strong>s leads to higher performance<br />
� Performance of the <str<strong>on</strong>g>SSC</str<strong>on</strong>g> c<strong>on</strong>cerning the other types<br />
� Normalisati<strong>on</strong> of the <str<strong>on</strong>g>SSC</str<strong>on</strong>g>: c<strong>on</strong>cept id annotati<strong>on</strong>s for menti<strong>on</strong>s<br />
� Use cases of the <str<strong>on</strong>g>SSC</str<strong>on</strong>g><br />
2
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
CALBC corpus development<br />
December 2010<br />
850K<br />
June 2010<br />
100K<br />
December 2009<br />
50K<br />
June 2009<br />
1,5K<br />
3
1st Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge Rollout (Overview)<br />
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
4
2nd Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge Rollout (Overview)<br />
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
� <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-I: result of the Pilot Project<br />
� <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-II: result of the first ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
New<br />
(850k + <strong>GSC</strong>)<br />
5
Time<br />
Timeline of the CALBC project<br />
PMB<br />
SAC<br />
www.calbc.eu<br />
Project<br />
start<br />
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Submissi<strong>on</strong> site<br />
Test runs<br />
Alignment of<br />
Partners’ c<strong>on</strong>tributi<strong>on</strong>s<br />
Alignment<br />
soluti<strong>on</strong>s<br />
available<br />
Invitati<strong>on</strong><br />
to ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
Access to<br />
test data +<br />
submissi<strong>on</strong><br />
site<br />
Harm<strong>on</strong>ised<br />
Corpus<br />
available<br />
Training<br />
data<br />
Active<br />
participati<strong>on</strong><br />
Final<br />
Corpus<br />
from<br />
Pilot<br />
6<br />
40 <str<strong>on</strong>g>participants</str<strong>on</strong>g><br />
X submissi<strong>on</strong>s<br />
Closing<br />
First<br />
Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
CALBC Silver Standard Corpus<br />
6
Timeline of the CALBC project<br />
15 <str<strong>on</strong>g>participants</str<strong>on</strong>g><br />
22 submissi<strong>on</strong>s<br />
Time<br />
Closing<br />
1 st Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
27<br />
Participants<br />
1 st<br />
CALBC<br />
Workshop<br />
Invitati<strong>on</strong><br />
to ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
Access to<br />
test data +<br />
submissi<strong>on</strong><br />
site<br />
Harm<strong>on</strong>ised<br />
Corpus<br />
available<br />
Training<br />
data<br />
Active<br />
participati<strong>on</strong><br />
Final<br />
Corpus<br />
from<br />
1 st<br />
Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
7<br />
16 <str<strong>on</strong>g>participants</str<strong>on</strong>g><br />
54 submissi<strong>on</strong>s<br />
Closing<br />
2 nd Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
CALBC Silver Standard Corpus<br />
7
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
The Ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enges<br />
1 st<br />
ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
2 nd<br />
ch<str<strong>on</strong>g>all</str<strong>on</strong>g>enge<br />
# registered <str<strong>on</strong>g>participants</str<strong>on</strong>g> 40 28<br />
# active <str<strong>on</strong>g>participants</str<strong>on</strong>g> 15 16<br />
# submissi<strong>on</strong>s 18 54<br />
8
Soluti<strong>on</strong><br />
Dicti<strong>on</strong>ary-based<br />
c<strong>on</strong>cept<br />
recogniti<strong>on</strong><br />
Indexing of<br />
tokens and terms<br />
Both, trained &<br />
rule-based<br />
soluti<strong>on</strong>s<br />
Case-based<br />
reas<strong>on</strong>ing<br />
CRF based,<br />
trained NER<br />
soluti<strong>on</strong><br />
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Partners & Participants<br />
PPs |<br />
CPs<br />
Use of Training<br />
Data<br />
PRGE CHED DISO SPE<br />
P01 [ / ] UniProtKb Jochem UMLS NCBI tax<strong>on</strong>omy<br />
P02 [ / ]<br />
P04 [ / ]<br />
Different<br />
resources incl.<br />
UniProtKb,<br />
EntrezGene<br />
UniProtKb,<br />
EntrezGene<br />
Jochem UMLS NCBI tax<strong>on</strong>omy<br />
Jochem<br />
MeSH,<br />
MedDRA, NCI,<br />
SNOMED-CT<br />
P06 [ / ] UMLS<br />
P10 [ / ]<br />
P13 [ / ]<br />
UniProtKb,<br />
EntrezGene<br />
NCI, MeSH,<br />
SNOMED-CT<br />
NCBI tax<strong>on</strong>omy<br />
P15 [ / ] UMLS UMLS UMLS UMLS<br />
P03 [ / ]<br />
UniProtKb,<br />
EntrezGene<br />
Jochem UMLS NCBI tax<strong>on</strong>omy<br />
P09 [ / ] UMLS<br />
P07 [ / ]<br />
P16 [ / ] Genia UMLS<br />
P11 YES [ / ] [ / ] [ / ] [ / ]<br />
P12 YES [ / ] [ / ] [ / ] [ / ]<br />
P14 YES [ / ] [ / ] [ / ] [ / ]<br />
9
Rec<str<strong>on</strong>g>all</str<strong>on</strong>g><br />
100.0%<br />
90.0%<br />
80.0%<br />
70.0%<br />
60.0%<br />
50.0%<br />
40.0%<br />
30.0%<br />
20.0%<br />
10.0%<br />
0.0%<br />
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Assessment: PRGE, CHED<br />
PRGE 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%<br />
Precisi<strong>on</strong><br />
Rec<str<strong>on</strong>g>all</str<strong>on</strong>g><br />
100.0%<br />
90.0%<br />
80.0%<br />
70.0%<br />
60.0%<br />
50.0%<br />
40.0%<br />
30.0%<br />
20.0%<br />
10.0%<br />
0.0%<br />
CHED<br />
<str<strong>on</strong>g>SSC</str<strong>on</strong>g>-I<br />
<str<strong>on</strong>g>SSC</str<strong>on</strong>g>-I<br />
0.0% 20.0% 40.0% 60.0% 80.0% 100.0%<br />
Precisi<strong>on</strong><br />
Rec<str<strong>on</strong>g>all</str<strong>on</strong>g><br />
100.0%<br />
90.0%<br />
80.0%<br />
70.0%<br />
60.0%<br />
50.0%<br />
40.0%<br />
30.0%<br />
20.0%<br />
10.0%<br />
0.0%<br />
PRGE<br />
Series1 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%<br />
Rec<str<strong>on</strong>g>all</str<strong>on</strong>g><br />
100.0%<br />
90.0%<br />
80.0%<br />
70.0%<br />
60.0%<br />
50.0%<br />
40.0%<br />
30.0%<br />
20.0%<br />
10.0%<br />
0.0%<br />
CHED<br />
<str<strong>on</strong>g>SSC</str<strong>on</strong>g>-II<br />
Series1 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0%<br />
Precisi<strong>on</strong><br />
Precisi<strong>on</strong><br />
<str<strong>on</strong>g>SSC</str<strong>on</strong>g>-II<br />
10
F-Meas<br />
90%<br />
80%<br />
70%<br />
60%<br />
50%<br />
40%<br />
30%<br />
20%<br />
10%<br />
0%<br />
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Assessment: PRGE, CHED<br />
PRGE F-Meas CHED<br />
<str<strong>on</strong>g>SSC</str<strong>on</strong>g>-I / PRGE <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-II / PRGE<br />
0 2 4 6 8 10<br />
� Performance of the partners’ systems decrease for<br />
PRGE/CHEM from <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-I to <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-II<br />
� Performance of the <str<strong>on</strong>g>participants</str<strong>on</strong>g>’ systems increase<br />
90%<br />
80%<br />
70%<br />
60%<br />
50%<br />
40%<br />
30%<br />
20%<br />
10%<br />
0%<br />
0 2 4 6 8<br />
<str<strong>on</strong>g>SSC</str<strong>on</strong>g>-I / CHEM<br />
<str<strong>on</strong>g>SSC</str<strong>on</strong>g>-II / CHEM<br />
11
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Assessment: SPE, DISO<br />
F-Meas <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-I / SPE <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-II / SPE<br />
F-Meas<br />
100%<br />
90%<br />
80%<br />
70%<br />
60%<br />
50%<br />
40%<br />
30%<br />
20%<br />
10%<br />
0%<br />
Partners<br />
SPE DISO<br />
Part.<br />
0 2 4 6 8 10<br />
� For SPE/DISO the drop in performance for the PP’s<br />
systems is not very big<br />
� Performance for the <str<strong>on</strong>g>participants</str<strong>on</strong>g>’ systems increase<br />
90%<br />
80%<br />
70%<br />
60%<br />
50%<br />
40%<br />
30%<br />
20%<br />
10%<br />
0%<br />
<str<strong>on</strong>g>SSC</str<strong>on</strong>g>-I / DISO <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-II / DISO<br />
0 2 4 6 8 10<br />
12
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Partners & Participants<br />
Nr. Of<br />
anntoati<strong>on</strong>s<br />
in the <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-<br />
I<br />
Nr. Of CPs<br />
Nr. Of<br />
submissi<strong>on</strong>s<br />
from CPs<br />
Average<br />
nr. Of<br />
annotati<strong>on</strong>s<br />
from <str<strong>on</strong>g>all</str<strong>on</strong>g><br />
CPs<br />
Nr. Of<br />
annotati<strong>on</strong>s<br />
in the <str<strong>on</strong>g>SSC</str<strong>on</strong>g>-<br />
II<br />
CHED 228,622 6 11 233,398 238,431<br />
PRGE 275,235 9 15 343,681 435,797<br />
DISO 300,637 8 11 255,599 245,524<br />
SPE 317,211 7 9 277,071 304,503<br />
Cos-98 P12 P11 P03 P04 P01 P02 P10 P14 P08 P15 P09 P06 P07 P09 P13 P16<br />
SPE 93% 93% 79% 83% 71% 69% 84% 69% 56% 42% 2%<br />
DISO 87% 89% 71% 69% 82% 76% 78% 62% 51% 32% 3% 73%<br />
CHEM 83% 84% 75% 82% 49% 68% 51% 20% 17% 3% 23%<br />
PRGE 81% 73% 77% 66% 66% 59% 40% 52% 12% 18% 2% 50% 11% 28%<br />
Avg. 86% 85% 76% 75% 67% 68% 68% 58% 35% 27% 2%<br />
13
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Partners & Participants<br />
C<strong>on</strong>fusi<strong>on</strong> Matrix, exact matching<br />
Reference set<br />
Assessment PGN DISO SPE CHEM<br />
PGN 412,866 2,673 395 106,560<br />
DISO 451,175 4,024 2,126<br />
SPE 474,453 912<br />
CHEM 414,798<br />
C<strong>on</strong>fusi<strong>on</strong> Matrix, nested matching<br />
Reference set<br />
Assessment PGN DISO SPE CHEM<br />
PGN 412,866 2,927 695 113,546<br />
DISO 3,055 451,175 7,910 2,826<br />
SPE 516 7,992 474,453 1,376<br />
CHEM 107,436 2,859 1,155 414,798<br />
14
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
Stakes<br />
� Corpus alignment is doable: different semantic types, single<br />
document<br />
� Different similarity measures for harm<strong>on</strong>isati<strong>on</strong>: exact, nested, cos.<br />
� <str<strong>on</strong>g>SSC</str<strong>on</strong>g> generati<strong>on</strong> scales / generalisable: Iterative generati<strong>on</strong><br />
� Partner-<str<strong>on</strong>g>SSC</str<strong>on</strong>g> <str<strong>on</strong>g>outperforms</str<strong>on</strong>g> partners <strong>on</strong> <strong>GSC</strong><br />
� <str<strong>on</strong>g>all</str<strong>on</strong>g>-<str<strong>on</strong>g>SSC</str<strong>on</strong>g> <str<strong>on</strong>g>outperforms</str<strong>on</strong>g> <str<strong>on</strong>g>participants</str<strong>on</strong>g> <strong>on</strong> <strong>GSC</strong><br />
� <str<strong>on</strong>g>SSC</str<strong>on</strong>g> is the best opti<strong>on</strong> when no <strong>GSC</strong> available<br />
� Incremental improvement of <str<strong>on</strong>g>SSC</str<strong>on</strong>g> performance: Large number of<br />
c<strong>on</strong>tributi<strong>on</strong>s leads to higher performance<br />
� Performance of the <str<strong>on</strong>g>SSC</str<strong>on</strong>g> c<strong>on</strong>cerning the other types<br />
� Normalisati<strong>on</strong> of the <str<strong>on</strong>g>SSC</str<strong>on</strong>g>: c<strong>on</strong>cept id annotati<strong>on</strong>s for menti<strong>on</strong>s<br />
� Use cases of the <str<strong>on</strong>g>SSC</str<strong>on</strong>g><br />
15
CALBC workshop II<br />
Mar 16, 2011, EBI<br />
End<br />
16