W10-09
W10-09
W10-09
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
strengthsandweaknessesofthesysteminaneffort<br />
toguidefuturework.<br />
Westructureourpresentationasfollows: inSection2,wepresentpreviousresearchthathasinvestigatedtheuseoflargewebcorporafornaturallanguageprocessing(NLP)tasks.InSection3,wedescribeanefficientmethodofautomaticallyparsing<br />
weblogstoriesfordiscoursestructure. InSection4,<br />
wepresentasetofinferencemechanismsthatuse<br />
theextracted discourse relations togenerate opendomaintextualinferences.<br />
Weconclude,inSection<br />
5,withinsightsintostory-based envisionment that<br />
wehopewillguidefutureworkinthisarea.<br />
2 Relatedwork<br />
Researchers have made many attempts to use the<br />
massive amount of linguistic content created by<br />
usersoftheWorldWideWeb. Progressandchallengesinthisareahavespawnedmultipleworkshops<br />
(e.g.,thosedescribedbyGurevychandZesch(20<strong>09</strong>)<br />
andEvertetal.(2008)) thatspecifically target the<br />
useofcontentthatiscollaborativelycreatedbyInternetusers.<br />
Ofparticularrelevancetothepresent<br />
workistheweblogcorpusdevelopedbyBurtonet<br />
al. (20<strong>09</strong>), which was used for the data challenge<br />
portionoftheInternationalConferenceonWeblogs<br />
andSocialMedia(ICWSM).TheICWSMweblog<br />
corpus(referredtohereasSpinn3r)isfreelyavailableandcomprisestensofmillionsofweblogentriespostedbetweenAugust1st,2008andOctober<br />
1st,2008.<br />
Gordon et al. (20<strong>09</strong>) describe an approach to<br />
knowledgeextractionovertheSpinn3rcorpususing<br />
techniquesdescribedbySchubertandTong(2003).<br />
Inthisapproach,logicalpropositions(knownasfactoids)<br />
are constructed via approximate interpretationofsyntacticanalyses.Asanexample,thesystemidentifiedafactoidglossedas“doorstoaroom<br />
maybeopened”. Gordonetal.(20<strong>09</strong>) found that<br />
theextractedfactoidscoverroughlyhalfofthefactoidspresentinthecorresponding<br />
Wikipedia 2 articles.<br />
We used a subset of the Spinn3r corpus in<br />
ourwork,butfocusedondiscourseanalysesofentiretextsinsteadofsyntacticanalysesofsinglesentences.<br />
Ourgoalwastoextractgeneralcausaland<br />
temporal propositions instead of the fine-grained<br />
2 http://en.wikipedia.org<br />
44<br />
propertiesexpressedbymanyfactoidsextractedby<br />
Gordonetal.(20<strong>09</strong>).<br />
Clark and Harrison (20<strong>09</strong>) pursued large-scale<br />
extraction ofknowledge fromtextusing asyntaxbasedapproachthatwasalsoinspiredbythework<br />
ofSchubertandTong(2003). Theauthorsshowed<br />
how the extracted knowledge tuples can be used<br />
toimprovesyntacticparsingandtextualentailment<br />
recognition. Bar-Haimetal.(20<strong>09</strong>)presentanefficient<br />
method of performing inference with such<br />
knowledge.<br />
Ourworkisalsorelated totheworkofPersing<br />
and Ng (20<strong>09</strong>), in which the authors developed a<br />
semi-supervisedmethodofidentifyingthecausesof<br />
events described in aviation safety reports. Similarly,<br />
our system extracts causal (as well as temporal)knowledge;<br />
however,itdoesthisinanopen<br />
domainanddoesnotplacelimitationsonthetypes<br />
of causes to be identified. This greatly increases<br />
thecomplexityoftheinferencetask,andourresults<br />
exhibit acorresponding degradation; however, our<br />
evaluationsprovideimportantinsightsintothetask.<br />
3 Discourseparsingacorpusofstories<br />
Gordon and Swanson (20<strong>09</strong>) developed a supervised<br />
classification-based approach for identifying<br />
personal stories within the Spinn3r corpus. Their<br />
methodachieved75%precisiononthebinarytask<br />
of predicting story versus non-story on a held-out<br />
subsetoftheSpinn3rcorpus. Theextracted“story<br />
corpus”comprises960,<strong>09</strong>8personalstorieswritten<br />
by weblog users. Due to its large size and broad<br />
domaincoverage,thestorycorpusoffersuniqueopportunitiestoNLPresearchers.Forexample,SwansonandGordon(2008)showedhowthecorpuscan<br />
beusedtosupportopen-domaincollaborativestory<br />
writing. 3<br />
As described by Gordon and Swanson (2008),<br />
storyidentificationisjustthefirststeptowardscommonsensereasoningusingpersonalstories.Weaddressed<br />
the second step - knowledge extraction -<br />
byparsingthecorpususingaRhetorical Structure<br />
Theory(CarlsonandMarcu,2001)parserbasedon<br />
the one described by Sagae (20<strong>09</strong>). The parser<br />
performsjointsyntactic anddiscourse dependency<br />
3 The system (called SayAnything) is available at<br />
http://sayanything.ict.usc.edu