ontology-based information extraction systems (obies 2008)

More documents

Recommendations

Info

in abreadth-first-searchmanner.Differencesaredueto minormodelingissues(see below),theslightlydifferentevaluationofpatternsbaseddirectlyonsupportcounts producedbyaprioriand,mostimportantly,thefactthatlearningiscutoffafteronehour periteration.Indeedthestandardimplementationfrequentlyreachedthistimelimitof anhour,thusleadingtobetterresultsfortheFIMversionofthealgorithmwhichdoes notsufferfromthistimelimit. Oneexampleofslightmodelingdifferenceswhichinfluencedperformanceisthe treatmentofmulti-wordinstances.Thelearnerhastodecidewhethertoinsertonewildcard∗inanargumentposition(nearlyalwaysmatchingexactlyoneword)ortwo(allowingfortwoormorewords).Theclassicversionheuristicallytakesthenumberof wordsintheargumentofthefirstoccurrenceusedforpatterncreationassamplefor thewildcardstructure.TheFIMversionencodesthefactthatanargumenthasmore thanonewordasanadditionalconstraint.Ifthisitemiscontainedinalearnedfrequent itemset,adoublewildcardisinserted.ThestrongerperformancewiththebornInYear (+48%),currencyOf (+10.9%)andproductOf (+12%)relationscanbeexplainedin thatway(compareTable1).Forexample,theFIMversionlearnsthatpersonnames havetypicallylength2andbirthyearsalwayshavelength1whiletheclassicinduction approachdoesnotallowthisadditionalconstraint.Thisexplainsthedecreasedperformanceoftheclassicapproachfortherelationsmentionedaboveforwhichatleastone argumenthasaratherfixedlength(e.g.years). AsindicatedinFigure2,theclearbenefitoftheFIMabstractionstepliesinitsruntimebehavior.Thedurationofapatterngenerationprocessisplottedoverthenumber ofsampleinstancestobegeneralized.Tomeasurethesetimes,bothlearningmodules wereprovidedwiththesamesetsofoccurrencesisolatedfromtherestoftheinduction procedure.TheFIMshowsaclosetolinearincreaseofprocessingdurationforthegiven occurrencecounts.Eventhoughimplementedwithanumberofoptimizations(see[3]), theclassicinductionapproachclearlyshowsaquadraticincreaseincomputationtime w.r.t.thenumberofinputoccurrences. 5 Related Work Theiterativeinductionoftextualpatternsisamethodwidelyusedinlarge-scaleinformationextraction.SergeyBrinpioneeredtheuseofWebsearchindicesforthispurpose[4].RecentsuccessfulsystemsincludeKnowItAllwhichhasbeenextendedtoautomaticlearningofpatterns[9]andEspresso[12].TheprecisionofEspressoonvarious relationsrangesbetween49%and85%,whichiscomparabletoourrangeofprecisions Pmanual.Concerningthestandardrestrictiontobinaryrelations,Xuetal.[17]have shownhowapproachesusedforextractingbinaryrelationscanbeappliedton-aryrelationsinarathergenericmannerbyconsideringbinaryrelationsasprojectionsofthese.Theseandthemanyotherrelatedsystemsvaryconsiderablywithrespecttotherepresentationofpatternsandinthelearningalgorithmsusedforpatterninduction.ThemethodsusedincludeConditionalRandomFields[16],vectorspaceclustering[1],suffixtrees[14]andminimizingeditdistance[13].Inthispaper,wehaveproposedto modeldifferentrepresentationaldimensionsofapatternsuchaswordorder,tokenat acertainposition,part-of-speechetc.asconstraints.Ourapproachallowsstraightfor-
wardlytorepresentallthesedimensionsbyanappropriateencoding.Givensuchan encoding,wehaveshownhowfrequentitemsetminingtechniquescanbeusedtoefficientlyfindpatternswithaminimalsupport.Apartfrompattern-basedapproaches,avarietyofsupervisedandsemi-supervisedclassification algorithmshave been applied to relation extraction.The methods include kernel-basedmethods[18,8]andgraph-labelingtechniques[6].Theadvantageofsuch methodsisthatabstractionandpartialmatchesareinherentfeaturesofthelearningalgorithm.Inaddition,kernelsallowincorporatingmorecomplexstructureslikeparsetreeswhichcannotbereflectedintextpatterns.However,suchclassifiersrequiretestingallpossiblerelationinstanceswhilewithtextpatternsextractioncanbesignificantly speededupusingsearchindices.Fromthepointofviewofexecutionperformance,a pattern-basedapproachissuperiortoaclassifierwhichincorporatesalearnedmodel whichcannotbestraightforwardlyusedtoqueryalargecorpussuchastheweb.Classificationthusrequireslinear-timeprocessingofthecorpuswhilesearch-patternscan leadtofasterextraction.Recently,the AsimilarapproachtooursistheonebyJindalandLiu[10].TheyuseSequential PatternMining–amodificationofFrequentItemetMining–toderivetextualpatterns forclassifyingcomparativesentencesinproductdescriptions.While,likeourapproach, encodingsequenceinformation,theirmodelisnotabletoaccountforseveralconstraints perword.Additionally,thescalabilityaspecthasnotbeenfocusoftheirstudyasmining hasonlybeperformedonacorpusof2684sentenceswithaverylimitedalphabet. Anotherapproachorthogonaltooursispresentedby[7].Eachoccurrenceisabstracted overinabottomupmannerwhichsavespairwiseoccurrencecomparisonattheexpense ofevaluatingthelargeamountsofpatterncandidateswithrespecttothetrainingset.The algorithmseemsthusmoreappropriateforfullysupervisedsettingsoflimitedsize. 6 Conclusion Ourcontributioninthispaperliesintheformulationofthepatterninductionstepasa well-knownmachinelearningproblem,i.e.theoneofminingfrequentitemsets.Onthe onehand,thisformulationiselegantandadvantageousaswecanimportalltheresults fromtheliteratureonassociationminingforfurtheroptimization(anoverviewofwhich isgiveninand[15]).Ontheotherhand,wehaveshownthatthisformulationleadstoa significantdecreaseintherunningtimeoftheextraction.Inparticular,wehaveshown thattherunningtimebehaviordecreasesfromquadratictolinearwiththenumberof occurrencestobegeneralizedwithrespecttopreviousimplementations.Further,we havealsoshownthatthequalityofthegeneratedtuplesevenslightlyincreasesinterms ofF-measurecomparedtothestandardpatterninductionalgorithm.Thisincreaseis mainlyduetothemodelingofargumentlengthasanadditionalconstraintwhichcan bestraightforwardlyencodedinourFIMframework.Overall,modelingthedifferent representationaldimensionsofapatternasconstraintsiselegantasitallowstostraightforwardlyaddmoreinformation.Infutureworkweplantoconsidertaxonomicaswell asotherlinguisticknowledge.
Page 1 and 2: Proceedings 1st International and K
Page 3 and 4: Programme Chairs Benjamin Adrian Gu
Page 5 and 6: Scaling up Pattern Induction for We
Page 7 and 8: ITERATIVE PATTERNINDUCTION(Patterns
Page 9 and 10: 3.2 Mining for Text Patterns with A
Page 11: Fig. 2.Precision,recall,F-measurean
Page 15 and 16: Ontology-based information extracti
Page 17 and 18: comprise a head and a set of argume
Page 19 and 20: concept IDs into its record. Using
Page 21 and 22: overcrowded areas, they are obstruc
Page 23 and 24: 2 Brief Overview of the Ex system 2
Page 25 and 26: Fig. 1. General scheme of seminar e
Page 27 and 28: - An alternative to building comple
Page 29 and 30: as ‘oblivious’ as a machine. Th
Page 31 and 32: As far as we know, Relation Validat
Page 33 and 34: Assume that those instances are ext
Page 35 and 36: Precision 100.0% 80.0% 60.0% 40.0%
Page 37 and 38: extraction. The method is exactly t

ontology-based information extraction systems (obies 2008)

Create successful ePaper yourself

Delete template?

Save as template?