30.05.2013 Views

W10-09

W10-09

W10-09

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

strengthsandweaknessesofthesysteminaneffort<br />

toguidefuturework.<br />

Westructureourpresentationasfollows: inSection2,wepresentpreviousresearchthathasinvestigatedtheuseoflargewebcorporafornaturallanguageprocessing(NLP)tasks.InSection3,wedescribeanefficientmethodofautomaticallyparsing<br />

weblogstoriesfordiscoursestructure. InSection4,<br />

wepresentasetofinferencemechanismsthatuse<br />

theextracted discourse relations togenerate opendomaintextualinferences.<br />

Weconclude,inSection<br />

5,withinsightsintostory-based envisionment that<br />

wehopewillguidefutureworkinthisarea.<br />

2 Relatedwork<br />

Researchers have made many attempts to use the<br />

massive amount of linguistic content created by<br />

usersoftheWorldWideWeb. Progressandchallengesinthisareahavespawnedmultipleworkshops<br />

(e.g.,thosedescribedbyGurevychandZesch(20<strong>09</strong>)<br />

andEvertetal.(2008)) thatspecifically target the<br />

useofcontentthatiscollaborativelycreatedbyInternetusers.<br />

Ofparticularrelevancetothepresent<br />

workistheweblogcorpusdevelopedbyBurtonet<br />

al. (20<strong>09</strong>), which was used for the data challenge<br />

portionoftheInternationalConferenceonWeblogs<br />

andSocialMedia(ICWSM).TheICWSMweblog<br />

corpus(referredtohereasSpinn3r)isfreelyavailableandcomprisestensofmillionsofweblogentriespostedbetweenAugust1st,2008andOctober<br />

1st,2008.<br />

Gordon et al. (20<strong>09</strong>) describe an approach to<br />

knowledgeextractionovertheSpinn3rcorpususing<br />

techniquesdescribedbySchubertandTong(2003).<br />

Inthisapproach,logicalpropositions(knownasfactoids)<br />

are constructed via approximate interpretationofsyntacticanalyses.Asanexample,thesystemidentifiedafactoidglossedas“doorstoaroom<br />

maybeopened”. Gordonetal.(20<strong>09</strong>) found that<br />

theextractedfactoidscoverroughlyhalfofthefactoidspresentinthecorresponding<br />

Wikipedia 2 articles.<br />

We used a subset of the Spinn3r corpus in<br />

ourwork,butfocusedondiscourseanalysesofentiretextsinsteadofsyntacticanalysesofsinglesentences.<br />

Ourgoalwastoextractgeneralcausaland<br />

temporal propositions instead of the fine-grained<br />

2 http://en.wikipedia.org<br />

44<br />

propertiesexpressedbymanyfactoidsextractedby<br />

Gordonetal.(20<strong>09</strong>).<br />

Clark and Harrison (20<strong>09</strong>) pursued large-scale<br />

extraction ofknowledge fromtextusing asyntaxbasedapproachthatwasalsoinspiredbythework<br />

ofSchubertandTong(2003). Theauthorsshowed<br />

how the extracted knowledge tuples can be used<br />

toimprovesyntacticparsingandtextualentailment<br />

recognition. Bar-Haimetal.(20<strong>09</strong>)presentanefficient<br />

method of performing inference with such<br />

knowledge.<br />

Ourworkisalsorelated totheworkofPersing<br />

and Ng (20<strong>09</strong>), in which the authors developed a<br />

semi-supervisedmethodofidentifyingthecausesof<br />

events described in aviation safety reports. Similarly,<br />

our system extracts causal (as well as temporal)knowledge;<br />

however,itdoesthisinanopen<br />

domainanddoesnotplacelimitationsonthetypes<br />

of causes to be identified. This greatly increases<br />

thecomplexityoftheinferencetask,andourresults<br />

exhibit acorresponding degradation; however, our<br />

evaluationsprovideimportantinsightsintothetask.<br />

3 Discourseparsingacorpusofstories<br />

Gordon and Swanson (20<strong>09</strong>) developed a supervised<br />

classification-based approach for identifying<br />

personal stories within the Spinn3r corpus. Their<br />

methodachieved75%precisiononthebinarytask<br />

of predicting story versus non-story on a held-out<br />

subsetoftheSpinn3rcorpus. Theextracted“story<br />

corpus”comprises960,<strong>09</strong>8personalstorieswritten<br />

by weblog users. Due to its large size and broad<br />

domaincoverage,thestorycorpusoffersuniqueopportunitiestoNLPresearchers.Forexample,SwansonandGordon(2008)showedhowthecorpuscan<br />

beusedtosupportopen-domaincollaborativestory<br />

writing. 3<br />

As described by Gordon and Swanson (2008),<br />

storyidentificationisjustthefirststeptowardscommonsensereasoningusingpersonalstories.Weaddressed<br />

the second step - knowledge extraction -<br />

byparsingthecorpususingaRhetorical Structure<br />

Theory(CarlsonandMarcu,2001)parserbasedon<br />

the one described by Sagae (20<strong>09</strong>). The parser<br />

performsjointsyntactic anddiscourse dependency<br />

3 The system (called SayAnything) is available at<br />

http://sayanything.ict.usc.edu

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!