A Recursive Algorithm for Spatial Cluster Detection - CiteSeerX

A Recursive Algorithm for Spatial Cluster DetectionXia Jiang, MS, Gregory F. Cooper, MD, PhDDepartment of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PAAbstractSpatial cluster detection involves finding spatialsubregions of some larger region where clusters ofsome event are occurring. For example, in the caseof disease outbreak detection, we want to findclusters of disease cases so as to pinpoint where theoutbreak is occurring. When doing spatial clusterdetection, we must first articulate the subregions ofthe region being analyzed. A simple approach is torepresent the entire region by an n × n grid. Then welet every subset of cells in the grid represent asubregion. With this representation, the number ofnsubregions is equal to 2 2 −1. If n is not small, it isintractable to check every subregion. The timecomplexity of checking all the subregions that arerectangles is θ(n 4 ). Neill et al. 8 performed Bayesianspatial cluster detection by only checking everyrectangle. In the current paper, we develop arecursive algorithm which searches a richer set ofsubregions. We provide results of simulationexperiments evaluating the detection power andaccuracy of the algorithm.IntroductionSpatial cluster detection consists of finding spatialsubregions of some larger region where clusters ofsome event are occurring. For example, in the case ofdisease outbreak detection, we want to find clusters ofdisease cases so as to pinpoint where the outbreak isoccurring. Other applications of spatial clusterdetection include mining astronomical data, medicalimaging, and military surveillance. When doingspatial cluster detection, we must first articulate thesubregions of the region being analyzed. A simpleapproach is to represent the entire region by ann × n grid. Then we let every subset of cells in thegrid represent a subregion. This is the approach takenin this paper. With this representation, the number ofnsubregions is equal to 2 2 −1.If n is not small, it isintractable to check every subregion. The timecomplexity of only checking every subregion that is arectangle is θ(n 4 ). Neill et al. 8 performed Bayesianspatial cluster detection by only checking everyrectangle. In the current paper, we develop analgorithm which searches a richer set of subregions.The algorithm can be used in any application ofspatial cluster detection. However, we test itspecifically in the context of disease outbreakdetection. So next we describe disease outbreakdetection.Disease Outbreak Detection: Le Strat and Carrat 6define an epidemic as the occurrence of a number ofcases of a disease, in a given period of time in a givenpopulation that exceeds the expected number. Adisease outbreak is an epidemic limited to localizedincrease, e.g., in a town or institution. If we canrecognize an outbreak and its potential cost early, wecan take appropriate measures to control it.Monitoring a community in order to recognize earlythe onset of a disease outbreak is called diseaseoutbreak detection.Often the count of some observable event increasesduring an outbreak. For example, since Cryptosporidiuminfection causes diarrhea, the count ofover-the-counter (OTC) sales of antidiarrheal drugsordinarily increases during a Cryptosporidiumoutbreak. Typically, during an outbreak, the numberof new outbreak cases increases each day of theoutbreak until a peak is reached, and then declines.Accordingly, the count of the observable event alsoincreases during the outbreak. Therefore, a number ofclassical time-series methods have been applied to thedetection of an outbreak based on the count of theobservable event. Wong and Moore 9 review manysuch methods. Jiang and Wallstrom 4 describe aBayesian network model for outbreak detection thatalso looks at daily counts.Cooper et al. 1 took a different approach whendeveloping PANDA. Rather than analyzing dataaggregated over the entire population, they modeledeach individual in the population. PANDA consists ofa large Bayesian network that contains a set of nodesfor each individual in a region. These nodes representproperties of the individual such as age, gender, homelocation, and whether the individual visited the EDwith respiratory symptoms. By modeling eachindividual, we can base our analysis on moreinformation than that contained in a summary statisticsuch as over-the-counter sales of antidiarrheal drugs.PANDA is theoretically designed specifically for thedetection of non-contagious outbreak diseases such asairborne anthrax or West Nile encephalitis. Cooper etal. 2 extended the PANDA system to model the CDCCategory A diseases, (See http://www.bt.AMIA 2007 Symposium Proceedings Page - 369

cdc.gov/agent/agentlist-category.asp). Thisaugmented system, which is called PANDA-CDCA,takes as input a time series of 54 possible ED chiefcomplaints, and it outputs the posterior probability ofeach CDC Category A disease and several additionaldiseases.In a given region being monitored an outbreak mayoccur (or at least start) in some subregion of thatregion. For example, a Cryptosporidium outbreakmight occur only in a subregion in close proximity toa contaminated water distribution. We want todetermine that subregion, which can sometimes beaccomplished by doing spatial cluster detection.Spatial Cluster Disease Outbreak Detection:Traditional spatial cluster detection focuses onfinding spatial subregions where the count of someobservable event is significantly higher thanexpected. A frequentist method for spatial clusterdetection is the spatial scan statistic developed byKulldorff 5 . Neill et al. 8 developed a Bayesian versionof the spatial scan statistic. In their experiments, theyonly considered the set of all subregions that arerectangles. This paper describes an algorithm thatinvestigates a richer subset of subregions than the setof rectangles. We test the algorithm by using it toperform Bayesian spatial outbreak detection withPANDA-CDCA. Therefore, before describing thealgorithm, we review PANDA-CDCA.PANDA-CDCAFigure 1 shows the Bayesian network in PANDA-CDCA. We briefly describe the nodes in the network.Node O represents whether an outbreak is currentlytaking place. Node OD represents which outbreakdisease is occurring if there is an outbreak. Node Frepresents the hypothetical fraction of individuals inthe population who are afflicted with the outbreakdisease and go to the ED, given that an outbreak isoccurring. This node indicates the extent of theoutbreak, if one is occurring. For the sake ofcomputational efficiency, we modeled F as a discretevariable. Furthermore, we assumed all outbreak typesare equally likely to have the various levels ofseverity. This assumption is not necessary, and therecould be an edge from OD to F. Node Drrepresentswhether an individual arrives in the ED with aparticular disease. There is one such node for eachindividual r in the population. One value is NoED,which means the individual does not visit the ED.Node C r represents each of the possible chiefcomplaints the individual could have when arriving inthe ED.To do inference, we proceed as follows. On each day,we know the value of C r for each individual r in thepopulation. We call the set of all these values ourData. Using the network in Figure 1, we thencompute P(OD=none|Data) and for each outbreakdisease d, P(OD=d|Data).OODP(OD = flu | O = yes) =.8P(OD = botulism | O = yes) =.01......P(OD = none | O = yes) = 0P(OD = flu | O = no) = 0P(OD = botulism | O = no) = 0......P(OD = none | O = no) = 1P(O = yes) = .05P(O = no) = .95D rC rP(F = .0000118) = .0667P(F = .0000236) = .0667......P(C r = chest pain | D r =flu) = .022528P(C r = diarrhea | D r =flu) =.014422......P(C r =none | D r =flu) =0......P(C r = chest pain | D r =noED) = 0P(C r = diarrhea | D r =noED) = 0......P(C r =none | D r =noED) = 1FP(D r = flu | OB = flu, F = .0000118) = ..0000118P(D r = botulism | OB = flu, F = ..0000118) = 0......P(D r = other |OB = flu, F = .0000118) = .00203298P(D r = noED |OB = flu, F = .0000118) = .99795522......P(D r = flu | OB = none, F = .0000118) = 0P(D r = botulism | OB = none, F = ..0000118) = 0......P(D r = other | OB = none, F = .0000118) = .002033P(D r = noED | OB = none, F = .0000118) = .997967Figure 1. The PANDA-CDCA Bayesian network.A Recurisve Algorithm for Spatial ClusterDetection of Complex SubregionsNext we develop a new algorithm for spatial clusterdetection of complex subregions, and we apply thealgorithm to outbreak detection using PANDA-CDCA. First we show how to compute the likelihoodthat a given subregion has an outbreak usingPANDA-CDCA.Computing the Likelihood of a Subregion: Let OSbe a random variable, which represents the outbreaksubregion, whose value is none if no outbreak isoccurring, and whose value is S if an outbreak isoccurring in subregion S. We want to computeP(Data|OS=none) and for each subregion S,P(Data|OS=S).When OS=none we assume the data is beinggenerated according to the model shown in Figure 1with OD set to none. Therefore, P(Data|OS=none)=P(Data|OD=none), which is computed by doinginference in the network in Figure 1. When OS=S weassume the data in subregion S is being generatedaccording to the model in Figure 1 with OD set to oneof the 13 diseases, and the data outside subregion S isAMIA 2007 Symposium Proceedings Page - 370

eing generated by a separate model with OD set tonone. Let Data in be the data concerning individuals insubregion S and Data out be the data concerningindividuals outside of subregion S. ThenP(Data in |OD=d,OS=S) and P(Data out |OD=d,OS=S)are each computed by doing inference in the networkin Figure 1 with the instantiations just mentioned. Wethen computeP( Data | OD = d,OS = S)=P(DataP(Datainout| OD = d,OS = S)×| OD = d,OS = S).Finally, we sum over OD to obtain the likelihood ofsubregion S.Figure 2. The shaded area is a possible subregiondiscovered by algorithm refine.Finding a Likely Subregion: We can do Bayesianspatial cluster detection by only consideringsubregions that are rectangles, and assigning the sameprior probability to all rectangles. Then, aftercomputing the likelihoods discussed in the previoussubsection, we use Bayes' Theorem to calculateP(OS=none|Data) and P(OS=R|Data) for every rectangleR. The posterior probability of an outbreak isthen equal to ∑ RP(OS=R|Data). We can then basethe detection of an outbreak on this posteriorprobability, and report the posterior probability ofeach rectangle. The most probable rectangle is thenconsidered to be the most likely subregion where theoutbreak is occurring. The algorithms described nextassume that we have done this. They then search for amore likely subregion than the most probablerectangle.For subregion S, let P(Data|OS=S) be the “score” ofS. If we let Score best be the score of the most probablerectangle, we can possibly find a higher scoringsubregion by seeing if we can increase the score byjoining other rectangles to this rectangle. Thefollowing is an algorithm that repeatedly finds therectangle that most increases the score and joins thatrectangle to our current subregion. It does this untilno rectangle increases the score. By score(G,S) wemean the score of subregion S in grid G.void refine (grid G; subregion& S best )determine highest scoring rectangle R best in G;S best = R best ;Score best = score(G, S best );flag R best ;repeatfound = false;for (each unflagged rectangle R in G) {S try = S best ∪ R;if (score(G, S try )) > Score best ){found = true;Score best = score(G, S try );T = R;}}if (found) {S best = S try ;flag T; }until (not found);The algorithm would be called as follows (G is theentire grid): refine(G, S best ). The worst case timecomplexity of the algorithm is O(n 8 ). Figure 2 showsa possible subregion discovered by algorithm refine.In order to model that more complex subregions havea lower prior probability than less complex ones, ineach iteration of the repeat loop we multiplied thescore by a penalty factor.We might do better if, when we find a rectangle R inour grid G that increases the score, we treat R as grid,recursively call refine with R as the input grid, findthe best subregion V best in R, at the top level check ifV best increase the score in G more than R, and, if so,replace R by V best . The algorithm that follows doesthis.void refine2 (grid G; subregion& S best , int level)if (level ≤ N) { //N is the recursion depth.determine highest scoring rectangle R best in G;S best = R best ;Score best = score(G, S best );flag R best ;if (level < N) {refine2 ( S , V , level + 1);best best( ( G,V best ) Scorebestif score > ) {S best = V best ;Score = score G,V ); }}repeatfound =bestfalse;( bestfor (each unflagged rectangle R in G)S try = S best ∪ R;AMIA 2007 Symposium Proceedings Page - 371

if (score(G, S try )) > Score best ) {if (level < N) {refine2 ( R,Vbest, level + 1);if ( score(G,S best ∪Vbest)> score(G, S try ) )S try = S best ∪ V best ; }found = true;Score best = score(G, S try );T = R; } }if (found) {S best = S try ;flag T; }until (not found ); }The top-level call is as follows: refine2(G,Sbest,0).If the rectangles recursively become sufficientlysmall, algorithm refine2 can detect an outbreak of anyshape.ExperimentsMethod: We simulated a region covered by a 10×10grid. Using a Poisson distribution with mean 9500,we randomly generated the number of people in eachcell of the grid. Next, using this simulated population,the Bayesian network in PANDA-CDCA with theoutbreak node O instantiated to no, and logicsampling 7 we simulated ED visits during a one yearperiod in which no outbreak was occurring. For eachcell, we determined the mean and standard deviationσ of the number of ED visits for that cell. Wesimulated 3 types of 30-day influenza outbreaks:mild, moderate, and severe. To simulate a mildoutbreak in a given cell, which reaches its peak on the15th day, we assumed that 15σextra ED visits (dueto patients with influenza) occurred in the first 15days in the cell, and then we solvedΔ + 2Δ + L+15Δ = 15×σfor Δ. We next injected Δ new ED visits in the cell onday 1, 2Δ on day 2,…, and tΔ on day t. We did thisfor 12 days. (Outbreaks were always detected by the12th day.) To simulate moderate and severeoutbreaks, we repeated this procedure with values of2σand 3σ . The following table shows the averagevalue of Δ for each type of outbreak:Outbreak Type Stand. Deviations Avg. Δmild σ .443moderate2 σ.886severe3σ1.329The number of injected ED visits must be an integer.We rounded down when t Δ < .5, and up otherwise.Figure 3 shows a simulated outbreak in one cell.353025201510501/1/20051/2/20051/3/20051/4/20051/5/20051/6/20051/7/20051/8/20051/9/20051/10/20051/11/20051/12/2005InjectedBackgroundFigure 3. A simulated moderate outbreak.We simulated outbreaks in six different types ofsubregions. The first was a T-shaped subregion, thesecond L-shaped, the third a cross, and the last threewere three different separated rectangles. Figure 4shows the T-shaped subregion and one of theseparated-rectangles subregions. For each outbreaktype, for each of the six subregion types, we did 12simulations at different times during the one yearbackground period. This made a total of 72simulations for each of the three outbreak types. Weused Algorithm refine2 with a recursion depth of 5 todetermine the outbreak subregion.(a)Figure 4. The injected T-subregion is shown in (a),and one of the injected separated-rectanglessubregions is shown in (b).Results: To measure detection power, we usedAMOC curves 3 . In such curves, the annual number offalse positives is plotted on the x-axis and the meanday of detection on the y-axis. Figure 5 showsAMOC curves for each of the outbreak types. Tomeasure detection accuracy, we used the followingfunction: similarity(S 1 ,S 2 ) = #(S 1 1S 2 ) / #(S 1 χS 2 ),where # returns the number of cells in a subregion.This function is 0 if and only if two subregions do notintersect, while it is 1 if and only if they are the samesubregion. For each outbreak type, we determined themean of the similarities between the detectedsubregions and the injected subregions on each day ofthe outbreaks. The graphs of these relationshipsappear in Figure 6. The mean similarity for mildoutbreaks is about 0 on day 1, and for moderate and(b)AMIA 2007 Symposium Proceedings Page - 372

severe outbreaks it has about the same value on day 1.This may be due to rounding. For example, since formild outbreaks the average Δ=.443, no ED visits wereoften injected on the first day of such outbreaks.Mean Day of Detection4.03.53.02.52.01.51.001020 30 40False Alarms Per YearFigure 5. AMOC curves.Mean Similarity1.00.80.60.40.20.0024 6Day of Outbreak8506010VariableMildModerateSevereVariableMildModerateSevereFigure 6. Mean similarities between detectedsubregion and injected subregion.Discussion and ConclusionsThe results are encouraging. They indicate that, onthe average, we can detect 30-day severe, moderate,and mild outbreaks in complex subregions,respectively about 1.9, 2.2, and 4.0 days into theoutbreak. Furthermore, the similarity between thedetected subregion and the outbreak subregionaverages about .7 by the 2 nd , 3 rd , and 8 th daysrespectively of severe, moderate, and mild outbreaks.evaluate the algorithm using real data and compare itsresults to that of other approaches.Acknowledgements: This work was funded by agrant from the National Science Foundation Grant(IIS-0325581).References1. Cooper, G.F., Dash, D.H., Levander, J.D., Wong,W.K., Hogan, W.R., Wagner, M.M. 2004.Bayesian Biosurveillance of Disease Outbreaks.Proceedings of 20th Conference on Uncertaintyin Artificial Intelligence. Arlington, VA.2. Cooper, G.F., Dowling, J.N., Lavender, J.D.,Sutovsky, P. 2006. A Bayesian Algorithm forDetecting CDC Category A Outbreak Diseasesfrom Emergency Department Chief Complaints.Proceedings of Syndromics 2006, Baltimore,MD.3. Fawcett, T., Provost, F. 1999. ActivityMonitoring: Noticing Interesting Changes inBehavior. In Proceedings of the Fifth SIGKDDConference on Knowledge Discovery and DataMining. San Diego, CA: ACM Press.4. Jiang, X.,Wallstrom, G.L. 2006. A BayesianNetwork for Outbreak Detection and Prediction.In Proceedings of AAAI-06, Boston, MA.5. Kulldorff, M. 1997. A Spatial Scan Statistic.Communications in Statistics: Theory andMethods 26, 6.6. Le Strat, Y., Carrat, F. 1999. MonitoringEpidemiological Surveillance Data using HiddenMarkov Models. Statistics in Medicine 18.7. Neapolitan, R.E. 2004. Learning BayesianNetworks. Upper Saddle River, NJ: PrenticeHall.8. Neill, D.B., Moore, A.W., Cooper, G.F. 2005. ABayesian Spatial Scan Statistic. Advances inNeural Information Processing Systems 18.9. Wong, W.K., Moore, A. 2006. Classical TimeSeries Methods for Biosurveillance. In Wagner,M. ed. Handbook of Biosurveillance, New York,NY: Elsevier.We presented a recursive algorithm for detectingoutbreaks in complex subregions. The resultsreported here provide support that the algorithm is apromising method for detecting such outbreaks.In this preliminary evaluation, we used simulated datain order to test the inherent detection capability of thealgorithm under well controlled conditions. Giventhat the results were promising, we next plan toAMIA 2007 Symposium Proceedings Page - 373

A Recursive Algorithm for Spatial Cluster Detection - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?