1 Natural Language Processing Goals of the field of NLP Natural ...

need for NLP:Thein 2003“Databases”search, PeertoPeer, Agents, CollaborativeWebXML/Metadata, Data miningFiltering,is more of everything, it’s more distributed,Thereit’s less structured.andof the information in most companies isMostin human languages (reports, customermaterialdiscussion papers, text) – not stuff inemail,databasestraditionaltextbases and information retrieval nowLargea big impact on everyday people (web search,haveprogress on thisMakingproblem… The task is difficult! What tools do we need? The answer that’s been getting traction:models built from language dataprobabilisticP(“maison” → “house”) highSome computer scientists think this is a newidea“A.I.”models say how likely a sentence is, orLanguagelikely one word is to appear after anotherhow Traditional models just P(w i|w i-1, w i-2)a need forToday:→ KnowledgeInformationthat we’d like to turn into usable knowledgeemploys(stanfordUniversity, chrisManning)∃e ∃x 1 ∃x 2 (employing(e) & employer(e, x 1 ) & employed(e,∃t2 ) & name(x 2 , “Christopher Manning”) & name(x 1 , “Stanfordxstarted out with words, they were encodedWea signal, and we now wish to decode.asmost likely sequence w of “words” givenFindsequence of observations atheBayes’ law to create a generative modelUsew P(w|a) = ArgMax w P(a|w) P(w) / P(a)ArgMax=ArgMax wthat our computers are so fast and haveNowmuch disk, we should be embeddingsoexample: The file commandTrivial> file ~/current/NLP-notesdataproposal.txt:Uh oh! This is also English text. Just has a couple of1. I am going to go visit Sonoma county this weekend.A lot of new things: Lots of unstructured text/web information78University) & at(e, t) & t ⊃ [1999, 2003]portals, email)The Noisy Channel ModelKnowledge about languageKnowledge about the worldA way to combine knowledge sourcesP(a|w) P(w)P(“L’avocat general” → “the general avocado”) lowChannel ModelLanguage ModelSpeech productionEnglish words910English wordsTranslating to English French wordsengineers…. electrical theOCR/spelling with errorsBut really it’s an old idea that was stolen fromEmbedding NLPLanguage modelsUseful for many things:simple practical NLP into systems programs!Speech, OCR, language identificationContext sensitive spelling correction“Fluent” natural language generation, MT/user/manning/current/NLP-notes: ASCII English textGrammar checking> file proposal.txtLately, the hottest area in Information Retrievalfunny characters (“smart quotes”) in it somewhere….Are these “documents” English text to file?onsat X 4 the2. THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.1112Recently, great results with richer parsing models2

those “documents” English text to file?AreI am going to go visit Sonoma county this weekend.sequence level language models areCharactereffective language recognizers.verymore robust: file is an overextended set ofFarhacks…. Why hasn’t this happened?NLP is difficult:WhyheadlinesNewspaperformer)(TheAnalytic Section QuestionsGRESix sculptures – C, D, E, F, G, H – are to be exhibited in rooms 1, 2,3 of an art gallery.andsculptures E and F are exhibited in the same room, no otherIfmay be exhibited in that room.sculptureleast one sculpture must be exhibited in each room, and noAtthan three sculptures may be exhibited in any room.moresculpture D is exhibited in room 3 and sculptures E and F areIfin room 1, which of the following may be true?exhibitedSculpture C is exhibited in room 1A.Sculpture H is exhibited in room 1B.Sculpture G is exhibited in room 2C.Sculptures C and H are exhibited in the same roomD.E. Sculptures G and F are exhibited in the same roomBill Gates, Remarks to Gartner Symposium,6, 1997:Octoberalways become more demanding.“Applicationsthe computer can speak to you in perfectUntiland understand everything you say toEnglishand learn in the same way that an assistantitlearn – until it has the power to do thatwouldwe need all the cycles. We need to be–to do the best we can. Right nowoptimizedare right on the edge of what thelinguisticscan do. As we get another factor ofprocessorthen speech will start to be on the edgetwo,Embedding NLPSculptures C and E may not be exhibited in the same room.THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.Sculptures D and G must be exhibited in the same room.No! Pure ASCII with the word ‘the’ or ‘The’We can do better than this with language models!Even Java vs. C++ can confuse it[It has at Microsoft!]1314Systems people divorced from NLP people?Is the problem just cycles?Ban on Nude Dancing on Governor's DeskIraqi Head Seeks ArmsJuvenile Court to Try Shooting DefendantTeacher Strikes Idle KidsStolen Painting Found by TreeLocal High School Dropouts Cut in HalfRed Tape Holds Up New BridgesClinton Wins on Budget, but More Lies AheadHospitals Are Sued by 7 Foot Doctors1516Kids Make Nutritious Snacksof what it can do.”17 183

many meanings of interest [n.]TheReadiness to give attention to or to learn about1.Quality of causing attention to be given2.Activity, subject, etc., which one gives time and3.toattentionThe advantage, advancement or favor of an4.or groupindividualA stake or share (in a company, business, etc.)5.6. Money paid regularly for the use of moneysimple counts often work quite well as a substituteMoral:AI-complete reasoning … though not perfectlyforParameters: priors P(s) andword distribution P(w|s).thew 1 w 2 w 3Parameters usually set using relative frequency(RFEs) with some smoothing:estimatorsModels for WSDNaive-BayesModel: P(s|w 1,…w n) ∝Word sense disambiguationsP(s)P(w 1 |s)…P(w n |s)somethingP(s) = count(s)/totalP(w|s) = count(w,s)/count(s)Converse: words that mean (almost) the same:1920image, likeness, portrait, facsimile, pictureNaive Bayes WSD performanceData - Leacock et al.Accuracy“line”0.900.800.700.60Joint NB0.500 1000 2000 3000 4000Training Set Size0.860.840.820.800.780.76“hard”Joint NB0 1000 2000 3000 4000Training Set Size212223 244

You can model much of POS resolution as asequence modelflatSequenceDataFeatureExtractionSequence Model InferenceFeaturesare elementary pieces of evidence thatFeaturesaspects of what we observe d with a categorylinkfeatures are indicator functions ofUsuallyof the input and a particular class.properties f i (c, d) ≡ [Φ(d) ∧ c = c i ] [Value is 0 or 1]We also say that Φ(d) is a feature of the data d,for each c when, i the conjunction Φ(d) ∧ c = c , i a isLabelStocksBUSINESS:a yearly low …hitFeaturesstocks, hit, a,{…,TextCategorization= states s 1, s 2, …HMMstart state s 1(specialtransition probs P(s i |s j )stateemission probs P(w k |s j )tokene.g., speech recognition, POS tagging,NN INbed into restructure…debt.bank:MONEYLabelMONEYFeaturesP=restructure,{…,Word-SenseDisambiguationVB TOaid toJJ NN …DTprevious fall …ThePT=JJ{W=fall,PW=previous}JJblue inLabelNNHMM formalism: as a FSALinear sequence models for POSNNHMM = probabilistic FSAJJ Classic model: Hidden Markov Modelsx 1 x t-1 x t x t+1 x Tend state s n )specialalphabet a 1, a 2, …tokenVBCDo 1satonP(sat|x t-1 ) P(x t|x t-1 ) 26theo TDTUHWidely used in many language processing tasks,25topic detection.Local and sequence classifiersFeature-Based ModelsSequence Level The decision about a data point is basedonly on the features active at that point.DataDataDataLocal LevelLabelLabelClassifier TypeLocalBUSINESSLocalOptimizationLocalDataFeaturesFeaturesDataSmoothingN=debt, L=12, …}yearly, low, …}2728POS TaggingFeaturesFeaturesthat we want to predict.cfeature has a real value: f: C × D → RAf 1 (c, d) ≡ [c= “NN” ∧ islower(w 0 ) ∧ ends(w 0 , “d”)]f 2 (c, d) ≡ [c = “NN” ∧ w -1 = “to” ∧ t -1 = “TO”]f 3 (c, d) ≡ [c = “VB” ∧ islower(w 0 )]For example:TO NNINto aidThey pick out a subset.Models will assign each feature a weightThese are the parameters of a probability model2930feature of the data-class pair (c, d).5

In the last few years there has been extensiveof conditional or discriminativeusemake it easy to incorporate lots ofTheyimportant featureslinguisticallyallow automatic building of languageTheyretargetable NLP modulesindependent,net diagrams draw circles for randomBayesand lines for direct dependenciesvariables,node is a little classifier (conditionalEachtable) based on incoming arcsprobabilityare many ways to chose weightsTherePerceptron: find a currently misclassified example, andthe best known StatNLP models:Alln-gram models, Naïve Bayes classifiers, hidden(conditional) models take the dataDiscriminativegiven, and put a probability overasstructure given the data:hiddenLogistic regression, conditional loglinear models,entropy markov models, (SVMs,maximumperceptrons)models work well:ConditionalSense DisambiguationWordwith exactly theEvenfeatures, changingsamejoint to conditionalfromincreasesestimationis, we use the sameThatand the samesmoothing,features, weword-classchange the numbersjustof the(parametersmodel)the linear combination Σλ Use i f i to produce a(c,d)model:probabilistic P(NN|to, aid, TO) = e 1.2 e –1.8 /(e 1.2 e –1.8 + e 0.3 ) = 0.29 P(VB|to, aid, TO) = e 0.3 /(e 1.2 e –1.8 + e 0.3 ) = 0.71weights are the parameters of the probabilityThecombined via a “soft max” functionmodel,Joint vs. Conditional modelsDiscriminative models(generative) models place probabilities overJointobserved data and the hidden stuff (gene-bothrate the observed data from hidden stuff):probabilistic models in NLP, IR, and SpeechP(c,d) Because:They give high accuracy performanceMarkov models, probabilistic context-free grammarsP(c|d)3132Bayes Net/Graphical modelsTraining SetObjectiveAccuracySome variables are observed; some are hiddenJoint Like.86.8performancec 1 c 2 c 3cc98.5Cond. Like.Test Setd 1 d 2 d 3d 1 d 2 d 3d 1 d 2 d 3AccuracyObjective73.6Joint Like.HMMLogistic RegressionNaïve BayesCond. Like.333476.1(Klein and Manning 2002, using Senseval-1 Data)DiscriminativeGenerativeFeature-Based ClassifiersFeature-Based ClassifiersClassify from features sets {f i } to classes {c}.“Linear” classifiers:Exponential (log-linear, maxent, logistic, Gibbs) models: Assign a weight λ i to each feature f i .a pair (c,d), features vote with their weights:Forvote(c) = Σλ i f i (c,d)P( c | d,λ)=c'∑ exp λifi(c,d)Makes votes positiveiexp λ f ( c',d)ii iNormalizes votesTO NNTO VB∑∑to aid1.2 –1.8 0.3aid toChoose the class c which maximizes Σλ i f i (c,d) = VBthis model form, we will choose parametersGiven{λ i that maximize the conditional likelihood of the}35data according to this model.36nudge weights in the direction of a correct classification6

22.6W 0W +1fellW -1%T -1-T -2on both left and right tags fixesConditioningproblemtheBEST test sett 1t 3T -1discriminative decisions are chained together toLocala conditional markov modelgivebad independence assumptions of directionalThecan lead to label bias (Bottou 91, Lafferty 01)modelst 1t 3P(MD|will,)*P(TO|to,MD)=P(MD|will,)*1{MD, NN} to {TO} fight {NN, VB, VBP}willwill be mis-tagged as MD, because MD is the mostwillL+L 2 R+R 2t t t t -2t 2t 0t -1 1t 01t -1 00 wL+L 2w 0w 0Model L+R has 13.2% error reduction from Model L+L 2CMM POS Tagging ModelsExample: POS TaggingFeatures can include:Current, previous, next words in isolation or together.Previous (or next) one, two, three tags.Word-internal features: word types, suffixes, dashes, etc.or observation bias (Klein & Manning 02)Decision PointTOP(t 1 =MD,t 2 =TO|will,to)=FeaturesLocal Context+10-3-1-2to??????VBDNNPDTwillfightVBDDowNNP-VBD%22.6ThefelltruehasDigit?(Ratnaparkhi 1996; Toutanova et al. 2003, etc.)37common tagging38……Centered Context is BetterDependency NetworksL+RTOTokenModeltoFeaturesfightSentencewillUnknown85.92%96.05%32,935R+R 244.04%33,42395.25%84.49%37.20%49.50%87.15%96.57%L+R 32,6103940Named Entity Recognition taskFinal POS Tagging Test ResultsModelTask: Predict semantic label of each word in textFeaturesSentenceUnknownToken89.04%460,55297.24%56.34%Foreign NNP I-NP ORG2.90%4.4% error reductionMinistry NNP I-NP ORG2.71%spokesman NN I-NP OUs(Statistically significant)Shen NNP I-NP PER2.51%CollinsTokenErrorGuofang NNP I-NP PERtold VBD I-VP ORateReuters NNP I-NP ORG: : O O4142Comparison to best published results – Collins 027

include the word, previousFeaturesnext words, previous classes,andnext, and current POSprevious,character n-gram features andtag,signature of word(> 92% on English devtest set)Highcomes fromperformancemany informativecombiningfeatures.smoothing / regularization,Withfeatures never hurt!moreWords and LetterCommonSequencesAlso inherent in common NLP word-levelmodelsBut for names, this misses valuable source ofinformationin web search?NERfalse name matchesSolvingExample: NERNER conditional models(Klein et al. 2003; also, Borthwick 1999, etc.)Feature WeightsPERSLOC(Klein et al. 2003)Sequence model across wordsFeature TypeFeatureEach word classified by local modelat-0.730.94Previous wordGraceDecision Point:0.030.00Decision Point:Current word-0.04State for GraceState for GraceBeginning bigram 800K features0.80XxCurrent signaturePrevNextCurNextCurPrev0.460.68??????O-Xx??????OtherPrev state, cur sigClassOther0.37Classx-Xx-XxGraceatatPrev-cur-next sigWordWord0.37Road-0.69GraceRoadO-x-Xx-0.20P. state - p-cur sig0.82NNPINTagNNPNNPINTagNNP43…Total:-0.582.6844XxXxXxxSigxXxSigCharacter-level informationFinal NER Results: English100 Traditional linguistics idea (Saussure, etc.):Form is usually uncorrelated with meaning95“The arbitariness of the linguistic sign”9085often classify PNPs by how they lookPeopleCotrimoxazole Wethersfield80LOC MISC ORG PER OverallAlien Fury: Countdown to InvasionPrecision Recall F1 454675oxa000 06:0 0John174144Incfield180710708140 86241drugcompanymovie1432placeperson6847488

models for lexicalizedFactoredparsingUses an interpolation of estimates of differentspecificitiesAim is to use richly conditioned estimatesavailable, but to back off to coarserwhenA hugely increased ability to do accurate,broad coverage parsing of sentencesrobust,Achieved by converting parsing into atask and using probabilisticclassificationStatistical methods quite accurately resolvestructural and real world ambiguitiesmostProvide probabilistic models which can befrom speech recognition systemsintegratedmillion words of hand-parsed WSJ newswire1And other stuff: Switchboard, partial Brown More recently, growing number of treebanks:shows that grammar induction is a hard(It… but one I’m interested in.)problem1 million words isSparseness:nothinglike 965,000 constituents, but only 66 WHADJP Only 6 aren't how much or how many, but:clever/original/incompetent (at riskhowand evaluation})assessmentMost of the probabilities that you would likecompute, you can't computetoParsers use complex models with a lot ofestimated from very sparse dataparametersE.g., Gildea 01 Collins 97 reimplementation:parameters [for Model 1]735,850Modern statistical parsingmachine learning methods Quickly: find a good parse in a few seconds49 50through to Knowledge RepresentationMarked-up data Key resource for parsing: the Penn Treebank Lexicalized PCFG parsing: Charniak (1997)Prague (Czech) dependency treebank(Penn) Chinese Treebank5152Charniak 1997 smoothingestimates when they’re not available A lot of smoothing/backoff (“regularization”)53549

work uses bilexical statistics: likelihoods ofMuchbetween pairs of wordsrelationships stocks plummeted 2 occurrences stocks stabilized 1 occurrence stocks skyrocketed 0 occurrences stocks discussed 0 occurrencesfar very little success in augmenting theSowith extra unannotated materials ortreebank TREC 8+ QA competition (1999–)massive collections of on-line documents,Withtranslation of knowledge is impractical:manual(Early on) evaluated output was 5 answers of 50snippets of text drawn from a 3 Gb textbyte(IR think.) Get reciprocal points forcollection.correct answer.highestfragments belched out of the mountain“lavaas hot as 300 degrees Fahrenheit”were lava ISPARTOF volcano ■ lava inside volcanoneeded semantic information is in WordNetTheand was successfully translated into adefinitions,and Harabagiu (2001) –Pascafrom sophisticated NLPvaluetaxonomy of question types and expectedLargetypes is crucialanswerparser used to parse questions andStatisticaltext for answers, and to build KBrelevantexpansion loops (morphological, lexicalQueryand semantic relations) importantsynonyms,(the former)Semantics:Analytic Section QuestionsGRESix sculptures – C, D, E, F, G, H – are to be exhibited in rooms 1, 2,3 of an art gallery.andsculptures E and F are exhibited in the same room, no otherIfmay be exhibited in that room.sculptureleast one sculpture must be exhibited in each room, and noAtthan three sculptures may be exhibited in any room.moresculpture D is exhibited in room 3 and sculptures E and F areIfin room 1, which of the following may be true?exhibitedSculpture C is exhibited in room 1A.Sculpture H is exhibited in room 1B.Sculpture G is exhibited in room 2C.Sculptures C and H are exhibited in the same roomD.E. Sculptures G and F are exhibited in the same roomParsing resultsSparseness Labeled precision/recall F measure: 86–90% Unlabeled dependency accuracy: 90+%Very sparse, even on topics central to the WSJ:using semantic classes or clustersbilexical statistics on WSJ; nothing cross-domain5556Gildea 01: You only lose 0.5% by eliminatingQuestion answering from textGood IR is needed: SMART paragraph retrievalAn idea originating from the IR communitywe want answers from textbases [cf. bioinformatics]Answer ranking by simple ML methodMainly factoid ‘Trivial Pursuit’ questionsfor web accessUsablestill a little too slow….But5758Question Answering ExampleHow hot does the inside of an active volcano get?get(TEMPERATURE, inside(volcano(active)))Sculptures C and E may not be exhibited in the same room.Sculptures D and G must be exhibited in the same room.fragments(lava, TEMPERATURE(degrees(300)),belched(out, mountain))volcano ISA mountainfragments of lava HAVEPROPERTIESOF lava5960form that can be used for rough ‘proofs’10

former)(TheAnalytic Section QuestionsGREleast one sculpture must be exhibited in each room, and noAtthan three sculptures may be exhibited in any room.moreat least 2x4=8 readings; taking second conjunct:HasNo more than three sculptures may be exhibited in any room.¬(∃ 4 x(sculpture(x) ∧∃y (room(y) ∧ exhibit(x, y))))1.¬(∃y (room(y) ∧∃ 4 x(sculpture(x) ∧ exhibit(x, y))))2.¬ (∃ 4 x(sculpture(x) ∧∃y (room(y) ∧ exhibit(x, y))))3.¬ (∃y (room(y) ∧∃ 4 x(sculpture(x) ∧ exhibit(x, y))))4.uoy !knahTnatural language interaction is still aHuman-levelgoaldistantthere are now practical and usable NLUButapplicable to many problemssystemsNLP methods have opened up newStatisticalfor robust high performance textpossibilitiesshould be looking more forPeopleto embed NLP into systems!opportunitiesConclusionunderstanding systems.6162The End6311

1 Natural Language Processing Goals of the field of NLP Natural ...

Create successful ePaper yourself

Delete template?

Save as template?