20.11.2012 Views

ontology-based information extraction systems (obies 2008)

ontology-based information extraction systems (obies 2008)

ontology-based information extraction systems (obies 2008)

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

in abreadth-first-searchmanner.Differencesaredueto minormodelingissues(see<br />

below),theslightlydifferentevaluationofpatterns<strong>based</strong>directlyonsupportcounts<br />

producedbyaprioriand,mostimportantly,thefactthatlearningiscutoffafteronehour<br />

periteration.Indeedthestandardimplementationfrequentlyreachedthistimelimitof<br />

anhour,thusleadingtobetterresultsfortheFIMversionofthealgorithmwhichdoes<br />

notsufferfromthistimelimit.<br />

Oneexampleofslightmodelingdifferenceswhichinfluencedperformanceisthe<br />

treatmentofmulti-wordinstances.Thelearnerhastodecidewhethertoinsertonewildcard∗inanargumentposition(nearlyalwaysmatchingexactlyoneword)ortwo(allowingfortwoormorewords).Theclassicversionheuristicallytakesthenumberof<br />

wordsintheargumentofthefirstoccurrenceusedforpatterncreationassamplefor<br />

thewildcardstructure.TheFIMversionencodesthefactthatanargumenthasmore<br />

thanonewordasanadditionalconstraint.Ifthisitemiscontainedinalearnedfrequent<br />

itemset,adoublewildcardisinserted.ThestrongerperformancewiththebornInYear<br />

(+48%),currencyOf (+10.9%)andproductOf (+12%)relationscanbeexplainedin<br />

thatway(compareTable1).Forexample,theFIMversionlearnsthatpersonnames<br />

havetypicallylength2andbirthyearsalwayshavelength1whiletheclassicinduction<br />

approachdoesnotallowthisadditionalconstraint.Thisexplainsthedecreasedperformanceoftheclassicapproachfortherelationsmentionedaboveforwhichatleastone<br />

argumenthasaratherfixedlength(e.g.years).<br />

AsindicatedinFigure2,theclearbenefitoftheFIMabstractionstepliesinitsruntimebehavior.Thedurationofapatterngenerationprocessisplottedoverthenumber<br />

ofsampleinstancestobegeneralized.Tomeasurethesetimes,bothlearningmodules<br />

wereprovidedwiththesamesetsofoccurrencesisolatedfromtherestoftheinduction<br />

procedure.TheFIMshowsaclosetolinearincreaseofprocessingdurationforthegiven<br />

occurrencecounts.Eventhoughimplementedwithanumberofoptimizations(see[3]),<br />

theclassicinductionapproachclearlyshowsaquadraticincreaseincomputationtime<br />

w.r.t.thenumberofinputoccurrences.<br />

5 Related Work<br />

Theiterativeinductionoftextualpatternsisamethodwidelyusedinlarge-scale<strong>information</strong><strong>extraction</strong>.SergeyBrinpioneeredtheuseofWebsearchindicesforthispurpose[4].Recentsuccessful<strong>systems</strong>includeKnowItAllwhichhasbeenextendedtoautomaticlearningofpatterns[9]andEspresso[12].TheprecisionofEspressoonvarious<br />

relationsrangesbetween49%and85%,whichiscomparabletoourrangeofprecisions<br />

Pmanual.Concerningthestandardrestrictiontobinaryrelations,Xuetal.[17]have<br />

shownhowapproachesusedforextractingbinaryrelationscanbeappliedton-aryrelationsinarathergenericmannerbyconsideringbinaryrelationsasprojectionsofthese.Theseandthemanyotherrelated<strong>systems</strong>varyconsiderablywithrespecttotherepresentationofpatternsandinthelearningalgorithmsusedforpatterninduction.ThemethodsusedincludeConditionalRandomFields[16],vectorspaceclustering[1],suffixtrees[14]andminimizingeditdistance[13].Inthispaper,wehaveproposedto<br />

modeldifferentrepresentationaldimensionsofapatternsuchaswordorder,tokenat<br />

acertainposition,part-of-speechetc.asconstraints.Ourapproachallowsstraightfor-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!