Proceedings Template - WORD - Twente Student Conference on IT

The Modules and Methods ofTopic Detection and TrackingNiek HoogmaUniversity of <strong>Twente</strong>, Faculty of Eletrical Engineering, Mathematics and Computer Sciencen.hoogma@student.utwente.nlABSTRACTThis report presents the methods and techniques used toperform the tasks of Topic Detection and Tracking (TDT). Itstarts with an introduction to TDT and its five tasks: StorySegmentation, Topic Detection, Topic Tracking, First StoryDetection and Link Detection. In order to characterize theperformance of a task, two important measurement techniquesare brought in. Each task is introduced by a brief desciption andits distinctive characteristics. In addition, each task isaccompanied by the best-known, related mathematical methods.KeywordsTopic Detection, Topic Tracking, Story Segmentation, LinkDetection, First Story Detection1. INTRODUCTIONTopic Detection and Tracking (TDT) refers to a variety ofautomatic techniques for discovering and threading togethertopically related materials in streams of data. Automaticdiscovery and threading could be quite valuable in manyapplications, where people need timely and efficient access tolarge quantities of information. Systems could alert users tonew events and to new information about old events. Byexamining one or two stories, a user could decide whether topay attention to the rest of an evolving thread.1.1 About this paperAt the moment, there is a lot of information available,concerning specific elements of TDT. This information issuitable for researchers familiar with TDT. For theunexperienced, however, it is very hard to get an overview ofthe most important aspects of TDT. This paper is written toassist these people, by giving the big picture of Topic Detectionand Tracking.This paper starts with an introduction to the TDT process as awhole. The goal of TDT is to generate efficient access to largequantities of information. This paper describes how this isachieved by the current techniques. In addition, TDT can besplit up into different (sub)tasks. These tasks are combined toperform the overall task of TDT. Since each task can beconsidered an essential building block, there is a description ofits most important characteristics and methods.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission.2nd <strong>Twente</strong> <strong>Student</strong> <strong>Conference</strong> on IT , Enschede 21January,2005Copyright 2005, University of <strong>Twente</strong>, Faculty of ElectricalEngineering, Mathematics and Computer ScienceThis paper focusses on the most important aspects selected froma large collection of available information. A selection had tobe made, in order to address the right level of detail. Theapplications of the described techniques are not the focus of thispaper. It isn't meant to be give a full technical description,either. If you are looking for a discussion of the mathematicalconcept in greater detail, check out the references.1.2 BackgroundThe basic idea for TDT originated in 1996, when the DefenseAdvanced Research Projects Agency (DARPA) realized itneeded technology to determine the topical structure of newsstreams without human intervention. In 1997, a pilot study[ACD+98] laid to the essential groundwork, serving as a solidbase for further research. The ideal result would be a set ofalgorithms which are source, medium, domain, language andapplication independent.TDT is sponsored by the U.S. Government and is explored inthe context of four annual cooperatively competitive evaluationsdone by the National Institute of Standards and Technology(NIST).1.3 CorporaGood corpora are extremely important for both research andevaluation. There is currently one important corpus available:TDT-5, maintained by the Linguistic Data Consortium(LDC)[LDC04]. This corpus is considered to be the worldstandardand contains over 400.000 documents in three differentlanguages: English, Arabic and Mandarin. Remarkably, thelatest corpus contains, unlike its predecessors, only data fromnewswire.In order to support the results of this paper, there were initiallytest runs with the TDT-5 corpus planned. However, due to thefact that this corpus is not freely available, this wasunfortunately impossible. This is a small setback, because itwould have made the results more intuitive.1.4 TDT TasksIn Figure 1 an overview of the entire TDT process is given.This figure contains the stories of the source data (depicted byT-1 … T-6) and the five TDT tasks (depicted by the arrows).The input of the entire process is a news stream containing anumber of stories. The first task of TDT is to detect (theboundaries between) the stories within the stream. This is doneby the segmentation process. The process is optional, since thesource data can be supplied in segmentated form, for example,news wire.The the entire process builds a set of clusters. Each clustercontains stories which discuss the same topic. The process ofbuilding these clusters is called Topic Detection. It assigns eachstory to one or possible more clusters. A story discussing anunkown topic is detected by a part of the detection process

Figure 1. The tasks of Topic Detection and Tracking (T-1..T-6 are stories, C-1..C-3 are clusters/topics)called First Story Detection. This process will generate a newcluster for this unkown topic.The process of Topic Tracking is the selection of a certaincluster specified by one or more example stories. There is oneadditional process, called Link Detection, which isn't depictedin Figure 1. This process is basically a kernel function whichestablishes if two stories are linked or not.Conclusively, the five main research applications are:• Story Segmentation: Detection of changes betweentopically cohesive sections• Topic Tracking: Set of stories similar to a set of examplestories• Topic Detection: Build clusters of stories that discuss thesame topic• First Story Detection: Detect if a story is the first story ofan unknown topic• Link Detection: Detect whether or not two stories aretopically linked1.5 Dynamic aspects of TDTAlthough the tasks of TDT have been approached withtraditional Information Retrieval techniques, the settings ofTDT present unusual problems that complicate the use oftraditional techniques [MAMS04]. This is one of the reasonswhy the TDT research has been started in a seperated thread.The following key aspects characterize the project:• The TDT systems run on-line and can make very fewassumptions on the incoming data• The topics often involve only a small number ofdocuments that are encountered in a burst• The essential vocabulary describing a topic may changedrastically in a short timeThere is a need for specialized methods for the TDT tasks. Inthe following chapters there is an overview of these methods,categorized per TDT task. Before discussing each task,howerver, two frequently used measurement techniques areintroduces.2. EVALUATIONS MEASURES ANDAPPROACHESThe different TDT tasks can all be considered to be some sortof detection. Given an input and a hypothesis about the data, aTDT system makes a decision, whether that hypothesis holds[FDGM98]. There are two techniques, cost function and DETcurves,which characterize the detection performance in termsof the probability of miss and false alarm errors P and P[ACD+98].miss2.1 Set Based MeasuresMost filtering and tracking measures can be defined in terms ofthe following well known contingency table [All02]:Table 1. The contigency table.RetrievedNot retrievedOn-topic A BOff-topic C DIn table 2 the commonly used measures from IR and TDT areexpressed [All02]:Table 2. Measurments derived from the contigency tableRecallPrecisionMissFalse AlarmAA + BAA + CBA + BCC + DProportion of on-topic materialthat is retrievedProportion of retrieved materialthat is on-topicProportion of on-topic materialthat is not retrievedProportion of off-topic materialthat is retrievedfa

2.2 Cost FunctionTDT evaluations are carried out by using a cost function, whichpenalizes misses and false alarm [FDGM98]. The cost functionis defined as a linear combination of Pmiss and Pfa:Ct arg et= CmissPmissP(t)+ CfaPfa(1 − P(t))where Cmiss and Cfa are the costs of missed detection and falsealarm respectively and P(t) is the prior probability of findingthe target. These parameters are fixed constant values that areinitialized by the NIST.2.3 DET CurvesOne problem with the set-based measures of chapter 2.2 is thatthey require careful selection of a cutoff mechanism fordeciding which stories to include and which to omit [All02].The well-known recall/ precision graph portrays the quality ofthat threshold by showing how the measures trade off againsteach other as the threshold varies. A high threshold results in agood precision; a small number of off-topic stories arecontained in the set. When a low threshold is used, there will bea small number of on-topic stories missed(good recall).A variation to the Recall/Precision graph is the DET (DetectionError Tradeoff) curve. A DET is generated by sweeping thedecision threshold through the score space [MDK+97]. At eachthreshold the missed detection rate and a corresponding falsealarm rate are calculated. Three example DET curves areplotted in Figure 2. The connected points form the DET curve,having the False Alarm Probability (Pfa), plotted on thehorizontal axis and the corresponding Miss Probability (Pmiss)on the vertical axis.Figure 2. Plot of DET Curves for a speaker recognitionevaluation2.4 CharacteristicsThere are certain characteristics to an multimedia stream of anews broadcast, which can help segmentation:• States within a news broadcast: A complete newsbroadcast can be split up in several states. Identifying thesestates can enhance the segmentation. For example, at theend of a story there is likely to be a transition from thereporter or the topic export to a speaking anchor.• Pre-defined templates: Typically when a new topic isstarted and there is a switch to a reporter, the reporter willintroduce himself. For example,I am James Earl Jones,Reporting from Amsterdam …. Since the same patternsoccur in every stream it's possible to use a text tagging toolto perform pattern matching.2.5 MethodsThe approaches for Story Segmenation rely heavily on the useof combinations of methods. In the following subsections theapproaches of two research institutes, IBM and MITRE, arepresented.2.5.1 IBM's approachIBM's story segmentation uses a combination of decision treesand maximum entropy models. They take a variety of lexical,prosodic, semantic and structural features as their inputs. Bothtypes of models are source-specific and by combining them thecosts for segmentation, Cseg, can be lowered substantially[DFM+02]2.5.2 MITRE's approachMITRE uses a naive Bayes classifier which is known to greatlysimplify learning by assuming that features are independent of agiven class. Although independence is generally a poorassumption, in practice, naive Bayes often competes well withmore sophisticated classifiers [RIS].3. TOPIC TRACKING3.1 IntroductionThe tracking task of TDT is fundamentally similar to TREC'sfiltering task [EV99]. The Text REtrieval <strong>Conference</strong> is a seriesof workshops designed to foster research in text retrieval. Thework on the TDT Tracking task is based upon the TRECFiltering task. However the TREC filtering task focusses onperformance improvements driven by feedback from real-timerelevance assessments. TDT systems, on the other hand, aredesigned to run autonomously without human feedback.3.2 CharacteristicsThe input data of the topic tracking task is a representation of atopic, followed by a stream of arriving stories. The systemmakes a decision for each story. Stories are assigned aconfidence score for that topic and, if the score is high enough,are tracked and/ or retrieved. The latter part is achieved by athreshold determining a hard 'yes/no' decision for each story. Atracking system consists of n-separate binary classifiers[FMWZ01].There are likely to be a small number of training stories that areknown to be on the same topic. Stories may be assigned to morethen one topic, or to none at all [FMWZ01]. By definition, thereis no user feedback allowed after the tracking process hasstarted. Systems can adapt their guesses on a certain story, butthey get no human confirmation in any form.The tracking task is supervised, typically with 1-4 seed ortraining documents. A tracking system is provided with a smallnumber, N, of on-topic training stories. The value of N isusually varied between one and eight, with four being the mostcommonlyused value. The system's task is to analyze thosestories and automatically identify the news topic beingdiscussed. Each topic being tracked is assigned two scores: aconfidence score and a yes/no decision.

3.3 Methods3.3.1 Vector space approachThe vector space approach uses methods based primarily onInformation Filtering [APL98]. The stories are represented byvectors of features, found by applying a shallow tagger to thestories and selecting nouns, verbs, adjectives and numbers.Queries are represented by a similar vector of TF.IDF (TermFrequency-Inverse Document Frequency), a term weightingapproaches commonly used in Information Retrieval [SB87].sim(Q,D)=d(i)=tftf +N∑i=1*(12q(i) * d(i)N∑i=1q(i)− log( n)df ( i)where q(i) is the weight of feature i in the query, d(i) is theweight of the story, tf is the number of times the feature occursin the story, df(i) is the number of features in the collection andN is the number of stories in the collection. The results of thesemethods are depending strongly upon the selection of usefulwords and phrases from the training set.3.3.2 Decision treesDecision trees (d-trees) are classifiers built based on theprinciple of a sequential greedy algorithm which at each stepstrives to maximize the reduction of system entropy [Qui86]. Adecision tree takes as input an object or situation described by aset of properties, and outputs a ‘yes/no decision’[RN95]. Thefeature with the maximum Information Gain is placed at theroot of the tree. This process is recursively repeated for eachbranch. D-trees are commonly used because it is one of thesimple and yet most successful forms of a learning algorithm.3.3.3 K-Nearest Neighbor (kNN)KNN is an instance based classification method well-known inpattern recognition and machine learning[Das91]. The systemconverts an input story into a vector as it arrives and comparesit with the training stories. The next step is the selection of the knearest neighbors based on the cosine similarity between theinput story and the training stories. The confidence score s1 iscomputed by taking the difference of the summing of thesimilarity scores for the positive and negative storiess 1( YES | x)=∑d∈P(x,k )cos( d,x)−∑d∈N( x,k )cos( d,x)where x is the input story. P(x,k) is the set of positive trainingstories in the k-neighborhood. N(x,k) is the set of negativetraining stories in the k-neighborhood.4. TOPIC DETECTION4.1 IntroductionThe TDT detection task effectively processes all news topicssimultaneously [All02]. The goal of this task is to partition allarriving stories into "bins" depending on the topic beingdiscussed. A bin consists of stories discussing the same topic.An important component is the recognition of the arrival of anew topic- i.e. a story that cannot be placed in any of theexisting bins. This process is a separate TDT task called FirstStory Detection (Chapter 6).4.2 CharacteristicsThe detection task is characterized by the lack of knowledge ofthe event to be detected [ACD+98]. The detection task iscompletely unsupervised. Each news story is almost a uniquecombination of people, places and other facts not known priorto the news broadcast. The only training can be done by somepre-annoted training data that is likely to share only a fewtopics with the news processed. There is no human feedback orcorrection to the system while it's running.The input data of this task is a set of topics, which areoptionally seperated by the segmentation task. The output datais a certain clustering of the topics. The type of clusteringdetermines whether it’s possible to assign a story to multipleclusters. Since this module is highly connected with the FirstStory Detection task, there will be an overlap between thesetwo tasks.4.3 Methods4.3.1 Incremental ClusteringThe Incremental Clustering algorithm, used by BBN, processesstories one at a time and sequentially and for each story itexecutes a two-step process [WJSS99].• Selection: The most similar topic cluster to the story isselected• Thresholding: That story is compared to the cluster, andthe system decides whether to merge the story with thecluster or to start a new clusterThe big advantage of this approach is its dynamical character.There is no restriction on the number of clusters and the clustersizes. A drawback is that decisions can be made only once.Early mistakes based on little information can be costly.Secondly, the computational requirement grows as the storiesare processed.It is important to determine the similarity between a topic and acertain cluster. This is done by IBM using the symmetrizedOkapi formula[DFM+].1 2Ο k( 1, 2)= t t idf ( w)dd∑w ww∈d1∩d2where d1 and d2 are the two documents andicounts of word w in document .dit ware the term4.3.2 k-means clusteringThis system is used by Dragon and operates as follows[YLS+99]• At any given point there are k story clusters, each cluster ischaracterized by a set of statistics.• For the next available story, determine its distance to theclosed cluster, and if this distance is below a certainthreshold, insert the story into the cluster and update thecluster's statistics. If the distance is above the threshold,created a new cluster• Loop through the stories again, but now consider switchingeach story from its present topic each of the others, basedon the same distance measure as before.5. FIRST STORY DETECTION5.1 IntroductionFirst Story Detection (FSD) operates in a strict on-line setting,processing stories from a news story as they arrive [APL98].

The goal is to mark each story as either "first" or "not first"indicating whether or not it is the first one discussing a newtopic [ALJ00]5.2 CharacteristicsThis task should make a hard 'yes/no' decision for each story inthe input stream. In case an unkown topic is encountered, a newtopic cluster is generated. FSD can be considered as a part ofTopic Detection, since it only indicates whether a topic ismentioned for the first time [ALJ00]This task uses, similar to that of Topic Detection, no supervisedtopic training. Research has shown that it is possible to reducethe TDT First Story Detection problem to the TDT Trackingproblem . This is, however, only one of the many researchperspectives.Another perspecive is that the exploitation of time will lead toimproved detection [APL98]. A side-effect of broadcast news isthat stories closer together are likely to discuss related events. Atime penalty tp is based on the time between a query and astory. This time penalty will increase the threshold. If the jthstory is compared to the query resulting from the ith story, for i< j we have:Θ( q , d ) = 0.4 + p *( eval(q , d ) − 0.4 + tp(j − i)ijThe threshold is determined between the query q and thedocument d using the following eval function:weval(q,d)Ni∑ii=1=N∑i=1jw * dwhere i is the relative weight of query feature i and i isthe belief that feature's appearance in the document indicates arelevant query [APL98].5.3 Methods5.3.1 Single Pass Clustering[VR79]• Use feature extraction and selection techniques to build aquery representation for the story's content.• Determine the query's initial threshold by evaluating thenew story with the query.• Compare the story against earlier queries in memory.• If the story does not trigger any previous query byexceeding it's threshold, the story is containing a newevent.• If the story triggers an existing query, the story is flaggednot a new event.• Add new query to memory5.3.2 Topic Tracking AlgorithmsSince this task can be considered to be a detection task, it canuse the same methods as other TDT tasks. Examples are k-nearest neighbors and the TF.IDF weighting scheme[ALMS99].6. LINK DETECTION6.1 IntroductionThe goal of the Link Detection Task is to detect whether twostories are "linked" by a common topic. The Link Detectionwiiqdserves a "kernel" function for other TDT tasks to rely on[ALMS99]. Link Detection is not considered a separate task,but a function that can be used by other TDT tasks.6.2 CharacteristicsThe input data are two documents, the output is a hard 'yes/no'decision. Like the Detection Task, there will be an internalthreshold determining the final verdict.This task uses several methods to determine the similaritybetween documents. The most important methods are the cosinesimilarity, the weighted sum, language models and theKullbach-Leiblar divergence. Feature weighting can also beused, so there is a need for weighted schema such as TF.IDF,TF and IDF. The task of Link Detection does not havesupervised topic training.6.3 Methods6.3.1 Cosine WeightingThe critical property of the similarity function is its ability toseparate stories that discuss the same topic from stories thatdiscuss different topics. The cosine similarity is a classicmeasure used in Information Retrieval. It is represented by theangle between two vectors d and q(∑2∑ qiq di) *(i∑6.3.2 Weighted SumWeighted sum represents a linear combination of evidence withweights representing confidences associated with various piecesof evidence:∑∑q dwhere q represents the query vector and d represents thedocument vector.6.3.3 Feature WeightingAn important issue is weighting of individual features(words)that occur in the stories [ALMS99]. The traditional weightingemployed in most IR systems is a form of TD.IDF weighting.7. CONCLUSION AND FUTURE WORK7.1 ConclusionThis paper offers an overview of the available modules andmethods for Topic Detection and Tracking. It shows that thetasks of TDT are hard to approach with traditional IRtechniques. The goal of TDT research is to generate efficientaccess to large quantities of (news broadcast) information. TDTresearch is conducted by many research institutes under controlby a centralized organisation (NIST).All TDT tasks can be considered to be some sort of detection.Each task uses an internal confidence score which is cut-offusing a threshold. There are several measurement techniques,such as cost function and DET-curves, which are widely used toevaluate the performance of a TDT method.The TDT methods are based upon known techniques. In mostcases a query or a document needs to be matched againstanother document. This can be done by cosine similarity,weighted sum and language models. Feature weighting, such asTF.IDF, is another widely used technique. These mathematicaliqiid2i)

techniques provide a solid starting point for methods such as k-Nearest Neighbor (kNN) and Vector Space Approaches.7.2 Future WorkAlthough there seems to be transparent methods that can beused in multiple TDT tasks, the current systems have owntechniques for every task. The reason for this, is thedevelopment of TDT in the past. The role of the controlorganisation, NIST, should lead to a good problemspecification. In the future it will be usefull to generalizetechniques allowing it be used in multiple tasks.REFERENCES[ACD+98] J. Allan, J. Carbonell, G. Doddington,J. Yamron, and Y. Yang. Topic detection and trackingpilot study, 1998.URL citeseer.ist.psu.edu/article/allan98topic.html.[ALJ00] James Allan, Victor Lavrenko, and Hubert Jin. FirstStory detection in TDT is hard. In CIKM, pages374{381,2000. URLciteseer.ist.psu.edu/allan00first.html.[All02] James Allan. Detection as multi-topic tracking. Inf.Retr., 5(2-3):139{157, 2002. ISSN 1386-4564. URLkluweronline.com/article.asp?PIPS=407257[ALMS99] James Allan, Victor Lavrenko, Daniella Malin, andRussell Swan. Detections,bounds, and timelines: Umassand tdt-3, 1999. URL ciir.cs.umass.edu/~lavrenko/pub/DetectionsBoundsTimelines[APL98] James Allan, Ron Papka, and Victor Lavrenko. Onlinenew event detection and tracking. In <strong>Proceedings</strong> ofthe 21 st annual international ACM SIGIRconference on Research and development ininformation retrieval, pages 37{45. ACM Press, 1998.ISBN 1-58113-015-5.[Das91] Belur V. Dasarathy. Nearest neighbor(nn) norms: Nnpattern classification techniques, 1991 URLciteseer.ist.psu.edu/context/1204751/0[DFM+] S. Dharanipragada, M. Franz, J.S. Mc-Carley, S.Roukos, and T. Ward. Story segmentation and topicdetection in the broadcast news domain. URLwww1.cs.columbia.edu/~smaskey/candidacy/cand_papers/dharanipragada_story_seg.pdf[DFM+02] S. Dharanipragada, M. Franz, J. S. Mc-Carley, T.Ward, and W.-J. Zhu. Seg-mentation and detection atibm: hybrid statistical models and two-tiered clustering.pages 135{148, 2002. ISBN 0-7923-7664-1.[EV99] D. Harman E. Voorhees. Overview of the eight textretrieval conference, 1999. URLciteseer.ist.psu.edu/context/525613/0[FDGM98] J. Fiscus, G. Doddington, J. Garofolo, and A.Martin. Nist's 1998 topic detection and trackingevaluation, 1998. URLciteseer.ist.psu.edu/article/fiscus98nists.html.[FWMZ01] Martin Franz, Todd Ward, J. Scott Mc-Carley, andWei-Jing Zhu. Unsupervised and supervised clusteringfor topic tracking. In <strong>Proceedings</strong> of the 24th annualinternational ACM SIGIR conference on Researchand development in information retrieval, pages310{317. ACM Press, 2001. ISBN 1-58113-331-6.[LDC04] LDC, TDT 2004: Annotation manual 2004. URL:ldc.upenn.edu/Projects/TDT2004[MAMS04] Juha Makkonen, Helena Ahonen-Myka, and MarkoSalmenkivi. Simple semantics in topic detection andtracking. Inf.Retr., 7(3-4):347{368, 2004. ISSN 1386-4564.[MDK+97] Alvin Martin, George Doddington, Terri Kamm,Mark Ordowski, and Mark Przybocki. The DET curvein assessment of detection task performance. In Proc.eurospeech '97, pages 1895{1898. Rhodes, Greece,1997. URL citeseer.ist.psu.edu/martin97det.html.[Qui86] J.R. Quinlan. Introduction of decision trees, 1986.[Ris] I. Rish. An emperical study of the naïve Bayes classifier.[RN95] S.J. Russell and P. Norvig. Artificial Intelligence: AModern Approach. Pearson Education, 1995. (vol.Learning, chap- ter20), Prentice Hall: Upple SaddleRiver, NJ, 1995, pp. 598-624.[SB87] Gerard Salton and Chris Buckley. Term weightingapproaches in automatic text retrieval. Technical report,1987. URL portal.acm.org/citation.cfm?id=866292[VR79] C. J. Van Rijsbergen. Information Retrieval, 2ndedition. Dept. of Computer Science, University ofGlasgow,1979. URLciteseer.ist.psu.edu/vanrijsbergen79information.html.[WJSS99] Frederick Walls, Hubert Jin, Screenivasa Sista, andRichard Schwartz. Topic detection in broadcast news,1999. URLlocutus.cs.dal.ca/~watters/courses/6403/tdbroadcast.pdf[YLS+99] J.P. Yamron, L.Gillick, S.Knecht, S. Lowe, and P.van Mulbregt. Statistical models for tracking anddetection, 1999. Volume 34 , Issue 1-3 Special issueon natural language learning ISSN:0885-6125

Proceedings Template - WORD - Twente Student Conference on IT

Create successful ePaper yourself

Delete template?

Save as template?