12.07.2015 Views

Proceedings Template - WORD - Twente Student Conference on IT

Proceedings Template - WORD - Twente Student Conference on IT

Proceedings Template - WORD - Twente Student Conference on IT

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The Modules and Methods ofTopic Detecti<strong>on</strong> and TrackingNiek HoogmaUniversity of <str<strong>on</strong>g>Twente</str<strong>on</strong>g>, Faculty of Eletrical Engineering, Mathematics and Computer Sciencen.hoogma@student.utwente.nlABSTRACTThis report presents the methods and techniques used toperform the tasks of Topic Detecti<strong>on</strong> and Tracking (TDT). Itstarts with an introducti<strong>on</strong> to TDT and its five tasks: StorySegmentati<strong>on</strong>, Topic Detecti<strong>on</strong>, Topic Tracking, First StoryDetecti<strong>on</strong> and Link Detecti<strong>on</strong>. In order to characterize theperformance of a task, two important measurement techniquesare brought in. Each task is introduced by a brief descipti<strong>on</strong> andits distinctive characteristics. In additi<strong>on</strong>, each task isaccompanied by the best-known, related mathematical methods.KeywordsTopic Detecti<strong>on</strong>, Topic Tracking, Story Segmentati<strong>on</strong>, LinkDetecti<strong>on</strong>, First Story Detecti<strong>on</strong>1. INTRODUCTIONTopic Detecti<strong>on</strong> and Tracking (TDT) refers to a variety ofautomatic techniques for discovering and threading togethertopically related materials in streams of data. Automaticdiscovery and threading could be quite valuable in manyapplicati<strong>on</strong>s, where people need timely and efficient access tolarge quantities of informati<strong>on</strong>. Systems could alert users t<strong>on</strong>ew events and to new informati<strong>on</strong> about old events. Byexamining <strong>on</strong>e or two stories, a user could decide whether topay attenti<strong>on</strong> to the rest of an evolving thread.1.1 About this paperAt the moment, there is a lot of informati<strong>on</strong> available,c<strong>on</strong>cerning specific elements of TDT. This informati<strong>on</strong> issuitable for researchers familiar with TDT. For theunexperienced, however, it is very hard to get an overview ofthe most important aspects of TDT. This paper is written toassist these people, by giving the big picture of Topic Detecti<strong>on</strong>and Tracking.This paper starts with an introducti<strong>on</strong> to the TDT process as awhole. The goal of TDT is to generate efficient access to largequantities of informati<strong>on</strong>. This paper describes how this isachieved by the current techniques. In additi<strong>on</strong>, TDT can besplit up into different (sub)tasks. These tasks are combined toperform the overall task of TDT. Since each task can bec<strong>on</strong>sidered an essential building block, there is a descripti<strong>on</strong> ofits most important characteristics and methods.Permissi<strong>on</strong> to make digital or hard copies of all or part of this work forpers<strong>on</strong>al or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citati<strong>on</strong> <strong>on</strong> the first page. To copyotherwise, or republish, to post <strong>on</strong> servers or to redistribute to lists,requires prior specific permissi<strong>on</strong>.2nd <str<strong>on</strong>g>Twente</str<strong>on</strong>g> <str<strong>on</strong>g>Student</str<strong>on</strong>g> <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> <strong>on</strong> <strong>IT</strong> , Enschede 21January,2005Copyright 2005, University of <str<strong>on</strong>g>Twente</str<strong>on</strong>g>, Faculty of ElectricalEngineering, Mathematics and Computer ScienceThis paper focusses <strong>on</strong> the most important aspects selected froma large collecti<strong>on</strong> of available informati<strong>on</strong>. A selecti<strong>on</strong> had tobe made, in order to address the right level of detail. Theapplicati<strong>on</strong>s of the described techniques are not the focus of thispaper. It isn't meant to be give a full technical descripti<strong>on</strong>,either. If you are looking for a discussi<strong>on</strong> of the mathematicalc<strong>on</strong>cept in greater detail, check out the references.1.2 BackgroundThe basic idea for TDT originated in 1996, when the DefenseAdvanced Research Projects Agency (DARPA) realized itneeded technology to determine the topical structure of newsstreams without human interventi<strong>on</strong>. In 1997, a pilot study[ACD+98] laid to the essential groundwork, serving as a solidbase for further research. The ideal result would be a set ofalgorithms which are source, medium, domain, language andapplicati<strong>on</strong> independent.TDT is sp<strong>on</strong>sored by the U.S. Government and is explored inthe c<strong>on</strong>text of four annual cooperatively competitive evaluati<strong>on</strong>sd<strong>on</strong>e by the Nati<strong>on</strong>al Institute of Standards and Technology(NIST).1.3 CorporaGood corpora are extremely important for both research andevaluati<strong>on</strong>. There is currently <strong>on</strong>e important corpus available:TDT-5, maintained by the Linguistic Data C<strong>on</strong>sortium(LDC)[LDC04]. This corpus is c<strong>on</strong>sidered to be the worldstandardand c<strong>on</strong>tains over 400.000 documents in three differentlanguages: English, Arabic and Mandarin. Remarkably, thelatest corpus c<strong>on</strong>tains, unlike its predecessors, <strong>on</strong>ly data fromnewswire.In order to support the results of this paper, there were initiallytest runs with the TDT-5 corpus planned. However, due to thefact that this corpus is not freely available, this wasunfortunately impossible. This is a small setback, because itwould have made the results more intuitive.1.4 TDT TasksIn Figure 1 an overview of the entire TDT process is given.This figure c<strong>on</strong>tains the stories of the source data (depicted byT-1 … T-6) and the five TDT tasks (depicted by the arrows).The input of the entire process is a news stream c<strong>on</strong>taining anumber of stories. The first task of TDT is to detect (theboundaries between) the stories within the stream. This is d<strong>on</strong>eby the segmentati<strong>on</strong> process. The process is opti<strong>on</strong>al, since thesource data can be supplied in segmentated form, for example,news wire.The the entire process builds a set of clusters. Each clusterc<strong>on</strong>tains stories which discuss the same topic. The process ofbuilding these clusters is called Topic Detecti<strong>on</strong>. It assigns eachstory to <strong>on</strong>e or possible more clusters. A story discussing anunkown topic is detected by a part of the detecti<strong>on</strong> process


Figure 1. The tasks of Topic Detecti<strong>on</strong> and Tracking (T-1..T-6 are stories, C-1..C-3 are clusters/topics)called First Story Detecti<strong>on</strong>. This process will generate a newcluster for this unkown topic.The process of Topic Tracking is the selecti<strong>on</strong> of a certaincluster specified by <strong>on</strong>e or more example stories. There is <strong>on</strong>eadditi<strong>on</strong>al process, called Link Detecti<strong>on</strong>, which isn't depictedin Figure 1. This process is basically a kernel functi<strong>on</strong> whichestablishes if two stories are linked or not.C<strong>on</strong>clusively, the five main research applicati<strong>on</strong>s are:• Story Segmentati<strong>on</strong>: Detecti<strong>on</strong> of changes betweentopically cohesive secti<strong>on</strong>s• Topic Tracking: Set of stories similar to a set of examplestories• Topic Detecti<strong>on</strong>: Build clusters of stories that discuss thesame topic• First Story Detecti<strong>on</strong>: Detect if a story is the first story ofan unknown topic• Link Detecti<strong>on</strong>: Detect whether or not two stories aretopically linked1.5 Dynamic aspects of TDTAlthough the tasks of TDT have been approached withtraditi<strong>on</strong>al Informati<strong>on</strong> Retrieval techniques, the settings ofTDT present unusual problems that complicate the use oftraditi<strong>on</strong>al techniques [MAMS04]. This is <strong>on</strong>e of the reas<strong>on</strong>swhy the TDT research has been started in a seperated thread.The following key aspects characterize the project:• The TDT systems run <strong>on</strong>-line and can make very fewassumpti<strong>on</strong>s <strong>on</strong> the incoming data• The topics often involve <strong>on</strong>ly a small number ofdocuments that are encountered in a burst• The essential vocabulary describing a topic may changedrastically in a short timeThere is a need for specialized methods for the TDT tasks. Inthe following chapters there is an overview of these methods,categorized per TDT task. Before discussing each task,howerver, two frequently used measurement techniques areintroduces.2. EVALUATIONS MEASURES ANDAPPROACHESThe different TDT tasks can all be c<strong>on</strong>sidered to be some sortof detecti<strong>on</strong>. Given an input and a hypothesis about the data, aTDT system makes a decisi<strong>on</strong>, whether that hypothesis holds[FDGM98]. There are two techniques, cost functi<strong>on</strong> and DETcurves,which characterize the detecti<strong>on</strong> performance in termsof the probability of miss and false alarm errors P and P[ACD+98].miss2.1 Set Based MeasuresMost filtering and tracking measures can be defined in terms ofthe following well known c<strong>on</strong>tingency table [All02]:Table 1. The c<strong>on</strong>tigency table.RetrievedNot retrievedOn-topic A BOff-topic C DIn table 2 the comm<strong>on</strong>ly used measures from IR and TDT areexpressed [All02]:Table 2. Measurments derived from the c<strong>on</strong>tigency tableRecallPrecisi<strong>on</strong>MissFalse AlarmAA + BAA + CBA + BCC + DProporti<strong>on</strong> of <strong>on</strong>-topic materialthat is retrievedProporti<strong>on</strong> of retrieved materialthat is <strong>on</strong>-topicProporti<strong>on</strong> of <strong>on</strong>-topic materialthat is not retrievedProporti<strong>on</strong> of off-topic materialthat is retrievedfa


2.2 Cost Functi<strong>on</strong>TDT evaluati<strong>on</strong>s are carried out by using a cost functi<strong>on</strong>, whichpenalizes misses and false alarm [FDGM98]. The cost functi<strong>on</strong>is defined as a linear combinati<strong>on</strong> of Pmiss and Pfa:Ct arg et= CmissPmissP(t)+ CfaPfa(1 − P(t))where Cmiss and Cfa are the costs of missed detecti<strong>on</strong> and falsealarm respectively and P(t) is the prior probability of findingthe target. These parameters are fixed c<strong>on</strong>stant values that areinitialized by the NIST.2.3 DET CurvesOne problem with the set-based measures of chapter 2.2 is thatthey require careful selecti<strong>on</strong> of a cutoff mechanism fordeciding which stories to include and which to omit [All02].The well-known recall/ precisi<strong>on</strong> graph portrays the quality ofthat threshold by showing how the measures trade off againsteach other as the threshold varies. A high threshold results in agood precisi<strong>on</strong>; a small number of off-topic stories arec<strong>on</strong>tained in the set. When a low threshold is used, there will bea small number of <strong>on</strong>-topic stories missed(good recall).A variati<strong>on</strong> to the Recall/Precisi<strong>on</strong> graph is the DET (Detecti<strong>on</strong>Error Tradeoff) curve. A DET is generated by sweeping thedecisi<strong>on</strong> threshold through the score space [MDK+97]. At eachthreshold the missed detecti<strong>on</strong> rate and a corresp<strong>on</strong>ding falsealarm rate are calculated. Three example DET curves areplotted in Figure 2. The c<strong>on</strong>nected points form the DET curve,having the False Alarm Probability (Pfa), plotted <strong>on</strong> thehoriz<strong>on</strong>tal axis and the corresp<strong>on</strong>ding Miss Probability (Pmiss)<strong>on</strong> the vertical axis.Figure 2. Plot of DET Curves for a speaker recogniti<strong>on</strong>evaluati<strong>on</strong>2.4 CharacteristicsThere are certain characteristics to an multimedia stream of anews broadcast, which can help segmentati<strong>on</strong>:• States within a news broadcast: A complete newsbroadcast can be split up in several states. Identifying thesestates can enhance the segmentati<strong>on</strong>. For example, at theend of a story there is likely to be a transiti<strong>on</strong> from thereporter or the topic export to a speaking anchor.• Pre-defined templates: Typically when a new topic isstarted and there is a switch to a reporter, the reporter willintroduce himself. For example,I am James Earl J<strong>on</strong>es,Reporting from Amsterdam …. Since the same patternsoccur in every stream it's possible to use a text tagging toolto perform pattern matching.2.5 MethodsThe approaches for Story Segmenati<strong>on</strong> rely heavily <strong>on</strong> the useof combinati<strong>on</strong>s of methods. In the following subsecti<strong>on</strong>s theapproaches of two research institutes, IBM and M<strong>IT</strong>RE, arepresented.2.5.1 IBM's approachIBM's story segmentati<strong>on</strong> uses a combinati<strong>on</strong> of decisi<strong>on</strong> treesand maximum entropy models. They take a variety of lexical,prosodic, semantic and structural features as their inputs. Bothtypes of models are source-specific and by combining them thecosts for segmentati<strong>on</strong>, Cseg, can be lowered substantially[DFM+02]2.5.2 M<strong>IT</strong>RE's approachM<strong>IT</strong>RE uses a naive Bayes classifier which is known to greatlysimplify learning by assuming that features are independent of agiven class. Although independence is generally a poorassumpti<strong>on</strong>, in practice, naive Bayes often competes well withmore sophisticated classifiers [RIS].3. TOPIC TRACKING3.1 Introducti<strong>on</strong>The tracking task of TDT is fundamentally similar to TREC'sfiltering task [EV99]. The Text REtrieval <str<strong>on</strong>g>C<strong>on</strong>ference</str<strong>on</strong>g> is a seriesof workshops designed to foster research in text retrieval. Thework <strong>on</strong> the TDT Tracking task is based up<strong>on</strong> the TRECFiltering task. However the TREC filtering task focusses <strong>on</strong>performance improvements driven by feedback from real-timerelevance assessments. TDT systems, <strong>on</strong> the other hand, aredesigned to run aut<strong>on</strong>omously without human feedback.3.2 CharacteristicsThe input data of the topic tracking task is a representati<strong>on</strong> of atopic, followed by a stream of arriving stories. The systemmakes a decisi<strong>on</strong> for each story. Stories are assigned ac<strong>on</strong>fidence score for that topic and, if the score is high enough,are tracked and/ or retrieved. The latter part is achieved by athreshold determining a hard 'yes/no' decisi<strong>on</strong> for each story. Atracking system c<strong>on</strong>sists of n-separate binary classifiers[FMWZ01].There are likely to be a small number of training stories that areknown to be <strong>on</strong> the same topic. Stories may be assigned to morethen <strong>on</strong>e topic, or to n<strong>on</strong>e at all [FMWZ01]. By definiti<strong>on</strong>, thereis no user feedback allowed after the tracking process hasstarted. Systems can adapt their guesses <strong>on</strong> a certain story, butthey get no human c<strong>on</strong>firmati<strong>on</strong> in any form.The tracking task is supervised, typically with 1-4 seed ortraining documents. A tracking system is provided with a smallnumber, N, of <strong>on</strong>-topic training stories. The value of N isusually varied between <strong>on</strong>e and eight, with four being the mostcomm<strong>on</strong>lyused value. The system's task is to analyze thosestories and automatically identify the news topic beingdiscussed. Each topic being tracked is assigned two scores: ac<strong>on</strong>fidence score and a yes/no decisi<strong>on</strong>.


3.3 Methods3.3.1 Vector space approachThe vector space approach uses methods based primarily <strong>on</strong>Informati<strong>on</strong> Filtering [APL98]. The stories are represented byvectors of features, found by applying a shallow tagger to thestories and selecting nouns, verbs, adjectives and numbers.Queries are represented by a similar vector of TF.IDF (TermFrequency-Inverse Document Frequency), a term weightingapproaches comm<strong>on</strong>ly used in Informati<strong>on</strong> Retrieval [SB87].sim(Q,D)=d(i)=tftf +N∑i=1*(12q(i) * d(i)N∑i=1q(i)− log( n)df ( i)where q(i) is the weight of feature i in the query, d(i) is theweight of the story, tf is the number of times the feature occursin the story, df(i) is the number of features in the collecti<strong>on</strong> andN is the number of stories in the collecti<strong>on</strong>. The results of thesemethods are depending str<strong>on</strong>gly up<strong>on</strong> the selecti<strong>on</strong> of usefulwords and phrases from the training set.3.3.2 Decisi<strong>on</strong> treesDecisi<strong>on</strong> trees (d-trees) are classifiers built based <strong>on</strong> theprinciple of a sequential greedy algorithm which at each stepstrives to maximize the reducti<strong>on</strong> of system entropy [Qui86]. Adecisi<strong>on</strong> tree takes as input an object or situati<strong>on</strong> described by aset of properties, and outputs a ‘yes/no decisi<strong>on</strong>’[RN95]. Thefeature with the maximum Informati<strong>on</strong> Gain is placed at theroot of the tree. This process is recursively repeated for eachbranch. D-trees are comm<strong>on</strong>ly used because it is <strong>on</strong>e of thesimple and yet most successful forms of a learning algorithm.3.3.3 K-Nearest Neighbor (kNN)KNN is an instance based classificati<strong>on</strong> method well-known inpattern recogniti<strong>on</strong> and machine learning[Das91]. The systemc<strong>on</strong>verts an input story into a vector as it arrives and comparesit with the training stories. The next step is the selecti<strong>on</strong> of the knearest neighbors based <strong>on</strong> the cosine similarity between theinput story and the training stories. The c<strong>on</strong>fidence score s1 iscomputed by taking the difference of the summing of thesimilarity scores for the positive and negative storiess 1( YES | x)=∑d∈P(x,k )cos( d,x)−∑d∈N( x,k )cos( d,x)where x is the input story. P(x,k) is the set of positive trainingstories in the k-neighborhood. N(x,k) is the set of negativetraining stories in the k-neighborhood.4. TOPIC DETECTION4.1 Introducti<strong>on</strong>The TDT detecti<strong>on</strong> task effectively processes all news topicssimultaneously [All02]. The goal of this task is to partiti<strong>on</strong> allarriving stories into "bins" depending <strong>on</strong> the topic beingdiscussed. A bin c<strong>on</strong>sists of stories discussing the same topic.An important comp<strong>on</strong>ent is the recogniti<strong>on</strong> of the arrival of anew topic- i.e. a story that cannot be placed in any of theexisting bins. This process is a separate TDT task called FirstStory Detecti<strong>on</strong> (Chapter 6).4.2 CharacteristicsThe detecti<strong>on</strong> task is characterized by the lack of knowledge ofthe event to be detected [ACD+98]. The detecti<strong>on</strong> task iscompletely unsupervised. Each news story is almost a uniquecombinati<strong>on</strong> of people, places and other facts not known priorto the news broadcast. The <strong>on</strong>ly training can be d<strong>on</strong>e by somepre-annoted training data that is likely to share <strong>on</strong>ly a fewtopics with the news processed. There is no human feedback orcorrecti<strong>on</strong> to the system while it's running.The input data of this task is a set of topics, which areopti<strong>on</strong>ally seperated by the segmentati<strong>on</strong> task. The output datais a certain clustering of the topics. The type of clusteringdetermines whether it’s possible to assign a story to multipleclusters. Since this module is highly c<strong>on</strong>nected with the FirstStory Detecti<strong>on</strong> task, there will be an overlap between thesetwo tasks.4.3 Methods4.3.1 Incremental ClusteringThe Incremental Clustering algorithm, used by BBN, processesstories <strong>on</strong>e at a time and sequentially and for each story itexecutes a two-step process [WJSS99].• Selecti<strong>on</strong>: The most similar topic cluster to the story isselected• Thresholding: That story is compared to the cluster, andthe system decides whether to merge the story with thecluster or to start a new clusterThe big advantage of this approach is its dynamical character.There is no restricti<strong>on</strong> <strong>on</strong> the number of clusters and the clustersizes. A drawback is that decisi<strong>on</strong>s can be made <strong>on</strong>ly <strong>on</strong>ce.Early mistakes based <strong>on</strong> little informati<strong>on</strong> can be costly.Sec<strong>on</strong>dly, the computati<strong>on</strong>al requirement grows as the storiesare processed.It is important to determine the similarity between a topic and acertain cluster. This is d<strong>on</strong>e by IBM using the symmetrizedOkapi formula[DFM+].1 2Ο k( 1, 2)= t t idf ( w)dd∑w ww∈d1∩d2where d1 and d2 are the two documents andicounts of word w in document .dit ware the term4.3.2 k-means clusteringThis system is used by Drag<strong>on</strong> and operates as follows[YLS+99]• At any given point there are k story clusters, each cluster ischaracterized by a set of statistics.• For the next available story, determine its distance to theclosed cluster, and if this distance is below a certainthreshold, insert the story into the cluster and update thecluster's statistics. If the distance is above the threshold,created a new cluster• Loop through the stories again, but now c<strong>on</strong>sider switchingeach story from its present topic each of the others, based<strong>on</strong> the same distance measure as before.5. FIRST STORY DETECTION5.1 Introducti<strong>on</strong>First Story Detecti<strong>on</strong> (FSD) operates in a strict <strong>on</strong>-line setting,processing stories from a news story as they arrive [APL98].


The goal is to mark each story as either "first" or "not first"indicating whether or not it is the first <strong>on</strong>e discussing a newtopic [ALJ00]5.2 CharacteristicsThis task should make a hard 'yes/no' decisi<strong>on</strong> for each story inthe input stream. In case an unkown topic is encountered, a newtopic cluster is generated. FSD can be c<strong>on</strong>sidered as a part ofTopic Detecti<strong>on</strong>, since it <strong>on</strong>ly indicates whether a topic ismenti<strong>on</strong>ed for the first time [ALJ00]This task uses, similar to that of Topic Detecti<strong>on</strong>, no supervisedtopic training. Research has shown that it is possible to reducethe TDT First Story Detecti<strong>on</strong> problem to the TDT Trackingproblem . This is, however, <strong>on</strong>ly <strong>on</strong>e of the many researchperspectives.Another perspecive is that the exploitati<strong>on</strong> of time will lead toimproved detecti<strong>on</strong> [APL98]. A side-effect of broadcast news isthat stories closer together are likely to discuss related events. Atime penalty tp is based <strong>on</strong> the time between a query and astory. This time penalty will increase the threshold. If the jthstory is compared to the query resulting from the ith story, for i< j we have:Θ( q , d ) = 0.4 + p *( eval(q , d ) − 0.4 + tp(j − i)ijThe threshold is determined between the query q and thedocument d using the following eval functi<strong>on</strong>:weval(q,d)Ni∑ii=1=N∑i=1jw * dwhere i is the relative weight of query feature i and i isthe belief that feature's appearance in the document indicates arelevant query [APL98].5.3 Methods5.3.1 Single Pass Clustering[VR79]• Use feature extracti<strong>on</strong> and selecti<strong>on</strong> techniques to build aquery representati<strong>on</strong> for the story's c<strong>on</strong>tent.• Determine the query's initial threshold by evaluating thenew story with the query.• Compare the story against earlier queries in memory.• If the story does not trigger any previous query byexceeding it's threshold, the story is c<strong>on</strong>taining a newevent.• If the story triggers an existing query, the story is flaggednot a new event.• Add new query to memory5.3.2 Topic Tracking AlgorithmsSince this task can be c<strong>on</strong>sidered to be a detecti<strong>on</strong> task, it canuse the same methods as other TDT tasks. Examples are k-nearest neighbors and the TF.IDF weighting scheme[ALMS99].6. LINK DETECTION6.1 Introducti<strong>on</strong>The goal of the Link Detecti<strong>on</strong> Task is to detect whether twostories are "linked" by a comm<strong>on</strong> topic. The Link Detecti<strong>on</strong>wiiqdserves a "kernel" functi<strong>on</strong> for other TDT tasks to rely <strong>on</strong>[ALMS99]. Link Detecti<strong>on</strong> is not c<strong>on</strong>sidered a separate task,but a functi<strong>on</strong> that can be used by other TDT tasks.6.2 CharacteristicsThe input data are two documents, the output is a hard 'yes/no'decisi<strong>on</strong>. Like the Detecti<strong>on</strong> Task, there will be an internalthreshold determining the final verdict.This task uses several methods to determine the similaritybetween documents. The most important methods are the cosinesimilarity, the weighted sum, language models and theKullbach-Leiblar divergence. Feature weighting can also beused, so there is a need for weighted schema such as TF.IDF,TF and IDF. The task of Link Detecti<strong>on</strong> does not havesupervised topic training.6.3 Methods6.3.1 Cosine WeightingThe critical property of the similarity functi<strong>on</strong> is its ability toseparate stories that discuss the same topic from stories thatdiscuss different topics. The cosine similarity is a classicmeasure used in Informati<strong>on</strong> Retrieval. It is represented by theangle between two vectors d and q(∑2∑ qiq di) *(i∑6.3.2 Weighted SumWeighted sum represents a linear combinati<strong>on</strong> of evidence withweights representing c<strong>on</strong>fidences associated with various piecesof evidence:∑∑q dwhere q represents the query vector and d represents thedocument vector.6.3.3 Feature WeightingAn important issue is weighting of individual features(words)that occur in the stories [ALMS99]. The traditi<strong>on</strong>al weightingemployed in most IR systems is a form of TD.IDF weighting.7. CONCLUSION AND FUTURE WORK7.1 C<strong>on</strong>clusi<strong>on</strong>This paper offers an overview of the available modules andmethods for Topic Detecti<strong>on</strong> and Tracking. It shows that thetasks of TDT are hard to approach with traditi<strong>on</strong>al IRtechniques. The goal of TDT research is to generate efficientaccess to large quantities of (news broadcast) informati<strong>on</strong>. TDTresearch is c<strong>on</strong>ducted by many research institutes under c<strong>on</strong>trolby a centralized organisati<strong>on</strong> (NIST).All TDT tasks can be c<strong>on</strong>sidered to be some sort of detecti<strong>on</strong>.Each task uses an internal c<strong>on</strong>fidence score which is cut-offusing a threshold. There are several measurement techniques,such as cost functi<strong>on</strong> and DET-curves, which are widely used toevaluate the performance of a TDT method.The TDT methods are based up<strong>on</strong> known techniques. In mostcases a query or a document needs to be matched againstanother document. This can be d<strong>on</strong>e by cosine similarity,weighted sum and language models. Feature weighting, such asTF.IDF, is another widely used technique. These mathematicaliqiid2i)


techniques provide a solid starting point for methods such as k-Nearest Neighbor (kNN) and Vector Space Approaches.7.2 Future WorkAlthough there seems to be transparent methods that can beused in multiple TDT tasks, the current systems have owntechniques for every task. The reas<strong>on</strong> for this, is thedevelopment of TDT in the past. The role of the c<strong>on</strong>trolorganisati<strong>on</strong>, NIST, should lead to a good problemspecificati<strong>on</strong>. In the future it will be usefull to generalizetechniques allowing it be used in multiple tasks.REFERENCES[ACD+98] J. Allan, J. Carb<strong>on</strong>ell, G. Doddingt<strong>on</strong>,J. Yamr<strong>on</strong>, and Y. Yang. Topic detecti<strong>on</strong> and trackingpilot study, 1998.URL citeseer.ist.psu.edu/article/allan98topic.html.[ALJ00] James Allan, Victor Lavrenko, and Hubert Jin. FirstStory detecti<strong>on</strong> in TDT is hard. In CIKM, pages374{381,2000. URLciteseer.ist.psu.edu/allan00first.html.[All02] James Allan. Detecti<strong>on</strong> as multi-topic tracking. Inf.Retr., 5(2-3):139{157, 2002. ISSN 1386-4564. URLkluwer<strong>on</strong>line.com/article.asp?PIPS=407257[ALMS99] James Allan, Victor Lavrenko, Daniella Malin, andRussell Swan. Detecti<strong>on</strong>s,bounds, and timelines: Umassand tdt-3, 1999. URL ciir.cs.umass.edu/~lavrenko/pub/Detecti<strong>on</strong>sBoundsTimelines[APL98] James Allan, R<strong>on</strong> Papka, and Victor Lavrenko. Onlinenew event detecti<strong>on</strong> and tracking. In <str<strong>on</strong>g>Proceedings</str<strong>on</strong>g> ofthe 21 st annual internati<strong>on</strong>al ACM SIGIRc<strong>on</strong>ference <strong>on</strong> Research and development ininformati<strong>on</strong> retrieval, pages 37{45. ACM Press, 1998.ISBN 1-58113-015-5.[Das91] Belur V. Dasarathy. Nearest neighbor(nn) norms: Nnpattern classificati<strong>on</strong> techniques, 1991 URLciteseer.ist.psu.edu/c<strong>on</strong>text/1204751/0[DFM+] S. Dharanipragada, M. Franz, J.S. Mc-Carley, S.Roukos, and T. Ward. Story segmentati<strong>on</strong> and topicdetecti<strong>on</strong> in the broadcast news domain. URLwww1.cs.columbia.edu/~smaskey/candidacy/cand_papers/dharanipragada_story_seg.pdf[DFM+02] S. Dharanipragada, M. Franz, J. S. Mc-Carley, T.Ward, and W.-J. Zhu. Seg-mentati<strong>on</strong> and detecti<strong>on</strong> atibm: hybrid statistical models and two-tiered clustering.pages 135{148, 2002. ISBN 0-7923-7664-1.[EV99] D. Harman E. Voorhees. Overview of the eight textretrieval c<strong>on</strong>ference, 1999. URLciteseer.ist.psu.edu/c<strong>on</strong>text/525613/0[FDGM98] J. Fiscus, G. Doddingt<strong>on</strong>, J. Garofolo, and A.Martin. Nist's 1998 topic detecti<strong>on</strong> and trackingevaluati<strong>on</strong>, 1998. URLciteseer.ist.psu.edu/article/fiscus98nists.html.[FWMZ01] Martin Franz, Todd Ward, J. Scott Mc-Carley, andWei-Jing Zhu. Unsupervised and supervised clusteringfor topic tracking. In <str<strong>on</strong>g>Proceedings</str<strong>on</strong>g> of the 24th annualinternati<strong>on</strong>al ACM SIGIR c<strong>on</strong>ference <strong>on</strong> Researchand development in informati<strong>on</strong> retrieval, pages310{317. ACM Press, 2001. ISBN 1-58113-331-6.[LDC04] LDC, TDT 2004: Annotati<strong>on</strong> manual 2004. URL:ldc.upenn.edu/Projects/TDT2004[MAMS04] Juha Makk<strong>on</strong>en, Helena Ah<strong>on</strong>en-Myka, and MarkoSalmenkivi. Simple semantics in topic detecti<strong>on</strong> andtracking. Inf.Retr., 7(3-4):347{368, 2004. ISSN 1386-4564.[MDK+97] Alvin Martin, George Doddingt<strong>on</strong>, Terri Kamm,Mark Ordowski, and Mark Przybocki. The DET curvein assessment of detecti<strong>on</strong> task performance. In Proc.eurospeech '97, pages 1895{1898. Rhodes, Greece,1997. URL citeseer.ist.psu.edu/martin97det.html.[Qui86] J.R. Quinlan. Introducti<strong>on</strong> of decisi<strong>on</strong> trees, 1986.[Ris] I. Rish. An emperical study of the naïve Bayes classifier.[RN95] S.J. Russell and P. Norvig. Artificial Intelligence: AModern Approach. Pears<strong>on</strong> Educati<strong>on</strong>, 1995. (vol.Learning, chap- ter20), Prentice Hall: Upple SaddleRiver, NJ, 1995, pp. 598-624.[SB87] Gerard Salt<strong>on</strong> and Chris Buckley. Term weightingapproaches in automatic text retrieval. Technical report,1987. URL portal.acm.org/citati<strong>on</strong>.cfm?id=866292[VR79] C. J. Van Rijsbergen. Informati<strong>on</strong> Retrieval, 2ndediti<strong>on</strong>. Dept. of Computer Science, University ofGlasgow,1979. URLciteseer.ist.psu.edu/vanrijsbergen79informati<strong>on</strong>.html.[WJSS99] Frederick Walls, Hubert Jin, Screenivasa Sista, andRichard Schwartz. Topic detecti<strong>on</strong> in broadcast news,1999. URLlocutus.cs.dal.ca/~watters/courses/6403/tdbroadcast.pdf[YLS+99] J.P. Yamr<strong>on</strong>, L.Gillick, S.Knecht, S. Lowe, and P.van Mulbregt. Statistical models for tracking anddetecti<strong>on</strong>, 1999. Volume 34 , Issue 1-3 Special issue<strong>on</strong> natural language learning ISSN:0885-6125

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!