ontology-based information extraction systems (obies 2008)
ontology-based information extraction systems (obies 2008)
ontology-based information extraction systems (obies 2008)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
in abreadth-first-searchmanner.Differencesaredueto minormodelingissues(see<br />
below),theslightlydifferentevaluationofpatterns<strong>based</strong>directlyonsupportcounts<br />
producedbyaprioriand,mostimportantly,thefactthatlearningiscutoffafteronehour<br />
periteration.Indeedthestandardimplementationfrequentlyreachedthistimelimitof<br />
anhour,thusleadingtobetterresultsfortheFIMversionofthealgorithmwhichdoes<br />
notsufferfromthistimelimit.<br />
Oneexampleofslightmodelingdifferenceswhichinfluencedperformanceisthe<br />
treatmentofmulti-wordinstances.Thelearnerhastodecidewhethertoinsertonewildcard∗inanargumentposition(nearlyalwaysmatchingexactlyoneword)ortwo(allowingfortwoormorewords).Theclassicversionheuristicallytakesthenumberof<br />
wordsintheargumentofthefirstoccurrenceusedforpatterncreationassamplefor<br />
thewildcardstructure.TheFIMversionencodesthefactthatanargumenthasmore<br />
thanonewordasanadditionalconstraint.Ifthisitemiscontainedinalearnedfrequent<br />
itemset,adoublewildcardisinserted.ThestrongerperformancewiththebornInYear<br />
(+48%),currencyOf (+10.9%)andproductOf (+12%)relationscanbeexplainedin<br />
thatway(compareTable1).Forexample,theFIMversionlearnsthatpersonnames<br />
havetypicallylength2andbirthyearsalwayshavelength1whiletheclassicinduction<br />
approachdoesnotallowthisadditionalconstraint.Thisexplainsthedecreasedperformanceoftheclassicapproachfortherelationsmentionedaboveforwhichatleastone<br />
argumenthasaratherfixedlength(e.g.years).<br />
AsindicatedinFigure2,theclearbenefitoftheFIMabstractionstepliesinitsruntimebehavior.Thedurationofapatterngenerationprocessisplottedoverthenumber<br />
ofsampleinstancestobegeneralized.Tomeasurethesetimes,bothlearningmodules<br />
wereprovidedwiththesamesetsofoccurrencesisolatedfromtherestoftheinduction<br />
procedure.TheFIMshowsaclosetolinearincreaseofprocessingdurationforthegiven<br />
occurrencecounts.Eventhoughimplementedwithanumberofoptimizations(see[3]),<br />
theclassicinductionapproachclearlyshowsaquadraticincreaseincomputationtime<br />
w.r.t.thenumberofinputoccurrences.<br />
5 Related Work<br />
Theiterativeinductionoftextualpatternsisamethodwidelyusedinlarge-scale<strong>information</strong><strong>extraction</strong>.SergeyBrinpioneeredtheuseofWebsearchindicesforthispurpose[4].Recentsuccessful<strong>systems</strong>includeKnowItAllwhichhasbeenextendedtoautomaticlearningofpatterns[9]andEspresso[12].TheprecisionofEspressoonvarious<br />
relationsrangesbetween49%and85%,whichiscomparabletoourrangeofprecisions<br />
Pmanual.Concerningthestandardrestrictiontobinaryrelations,Xuetal.[17]have<br />
shownhowapproachesusedforextractingbinaryrelationscanbeappliedton-aryrelationsinarathergenericmannerbyconsideringbinaryrelationsasprojectionsofthese.Theseandthemanyotherrelated<strong>systems</strong>varyconsiderablywithrespecttotherepresentationofpatternsandinthelearningalgorithmsusedforpatterninduction.ThemethodsusedincludeConditionalRandomFields[16],vectorspaceclustering[1],suffixtrees[14]andminimizingeditdistance[13].Inthispaper,wehaveproposedto<br />
modeldifferentrepresentationaldimensionsofapatternsuchaswordorder,tokenat<br />
acertainposition,part-of-speechetc.asconstraints.Ourapproachallowsstraightfor-