View - Statistics - University of Washington

Fast Automatic Unsupervised Image Segmentation andCurve Detection in Spatial Point PatternsbyDerek C. StanfordA dissertation submitted in partial fulfillmentof the requirements for the degree ofDoctor of PhilosophyUniversity of Washington1999Program Authorized to Offer Degree: Statistics

University of WashingtonThis is to certify that I have examined this copy of a doctoral dissertation byDerek C. Stanfordand have found that it is complete and satisfactory in all respects,and that any and all revisions required by the finalexamining committee have been made.Chair of Supervisory Committee:Reading Committee:Adrian RafteryNayak PolissarPaul SampsonWerner StuetzleDate

In presenting this dissertation in partial fulfillment of the requirements for theDoctoral degree at the University of Washington, I agree that the Library shallmake its copies freely available for inspection. I further agree that extensivecopying of the dissertation is allowable only for scholarly purposes, consistentwith “fair use” as prescribed in the U.S. Copyright Law. Requests for copying orreproduction of this dissertation may be referred to Bell and Howell Informationand Learning, 300 North Zeeb Road, P.O. Box 1346, Ann Arbor, MI 48106-1346,to whom the author has granted “the right to reproduce and sell (a) copies of themanuscript in microform and/or (b) printed copies of the manuscript made frommicroform.”SignatureDate

University of WashingtonAbstractFast Automatic Unsupervised Image Segmentation and CurveDetection in Spatial Point Patternsby Derek C. StanfordChairperson of Supervisory CommitteeProfessor Adrian RafteryStatistics DepartmentThere is a growing need for image analysis methods which can process large imagedatabases quickly and with limited human input. I propose a method for segmentinggreyscale images which automatically estimates all necessary parameters,including choosing the number of segments. This method is both fast and general,and it does not require any training data. The EM and ICM algorithms are usedto fit an image model and compute a pseudolikelihood; this pseudolikelihood isused in a modified form of the Bayesian Information Criterion (BIC) to automaticallyselect the number of segments. A consistency result for this approach isproven and several example applications are shown. A method for automaticallydetecting curves in spatial point patterns is also presented. Principal curves areused to model curvilinear features; BIC is used to automatically select the amountof smoothing. Applications to simulated minefields and seismological data areshown.

TABLE OF CONTENTSList of FiguresList of TablesvixChapter 1: Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Other Methods for Image Segmentation . . . . . . . . . . . . . . . . 21.3 Other Methods for Image Segmentation with Choice of the Numberof Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Chapter 2: Clustering on Open Principal Curves 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Model, Estimation and Inference . . . . . . . . . . . . . . . . . . . 92.2.1 Principal Curves . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Probability Model . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Estimation: The CEM-PCC Algorithm . . . . . . . . . . . . 112.2.4 Inference: Choosing the Number of Features and TheirSmoothness Simultaneously . . . . . . . . . . . . . . . . . . 122.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Denoising and Initial Clustering . . . . . . . . . . . . . . . . 132.3.2 Hierarchical Principal Curve Clustering (HPCC) . . . . . . . 14

2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.1 A Simulated Two-Part Curvilinear Minefield . . . . . . . . . 162.4.2 A Simulated Curvilinear Minefield . . . . . . . . . . . . . . . 202.4.3 New Madrid Seismic Region . . . . . . . . . . . . . . . . . . 232.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Chapter 3: Marginal Segmentation 333.1 BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3 BIC with Mixture Models . . . . . . . . . . . . . . . . . . . . . . . 383.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.1 Mixture versus Componentwise Classification . . . . . . . . 40Mixture Classification . . . . . . . . . . . . . . . . . . . . . 40Theorem 3.1: Optimality of Mixture Classification . . . . . . 40Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . 41Componentwise Classification . . . . . . . . . . . . . . . . . 42Theorem 3.2: Optimality of Componentwise Classification . 43Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . 433.4.2 Correspondence of Components with Segments . . . . . . . . 45Chapter 4: Adjusting for Autoregressive Dependence 474.1 Adjusting BIC for the AR(1) Model . . . . . . . . . . . . . . . . . . 474.1.1 Loglikelihood Adjustment . . . . . . . . . . . . . . . . . . . 47Independence Case . . . . . . . . . . . . . . . . . . . . . . . 48Dependence Case . . . . . . . . . . . . . . . . . . . . . . . . 49Effect on Computation . . . . . . . . . . . . . . . . . . . . . 50ii

4.1.2 Penalty Adjustment . . . . . . . . . . . . . . . . . . . . . . 52Independence Case . . . . . . . . . . . . . . . . . . . . . . . 53Dependence Case . . . . . . . . . . . . . . . . . . . . . . . . 544.1.3 Computing BIC with the AR(1) Model . . . . . . . . . . . . 564.2 Adjusting BIC for the Raster Scan Autoregression (RSA) Model . . 574.2.1 Loglikelihood Adjustment . . . . . . . . . . . . . . . . . . . 574.2.2 Penalty Adjustment . . . . . . . . . . . . . . . . . . . . . . 624.2.3 Computing BIC with the RSA Model . . . . . . . . . . . . . 634.3 Mixture RSA Models . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4 Fitting the Raster Scan Autoregression Model . . . . . . . . . . . . 654.5 Choosing the Number of Segments with BIC . . . . . . . . . . . . . 664.6 Application of the RSA Model to Image Data . . . . . . . . . . . . 67Chapter 5: Automatic Image Segmentation via BIC 715.1 Pseudolikelihood for Image Models . . . . . . . . . . . . . . . . . . 715.1.1 Potts Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.1.2 ICM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.1.3 Pseudoposterior Distribution of the True Scene . . . . . . . 745.1.4 Pseudolikelihood and BIC . . . . . . . . . . . . . . . . . . . 755.1.5 Consistency of BIC P L . . . . . . . . . . . . . . . . . . . . . 77Theorem 5.1: Consistency of Choice of K . . . . . . . . . . 78Condition A . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Lemma 1: Integrability . . . . . . . . . . . . . . . . . . . . . 79Proof of Lemma 1. . . . . . . . . . . . . . . . . . . . . . . . 80Lemma 2: Ergodicity . . . . . . . . . . . . . . . . . . . . . . 83Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . 83iii

Case 1: K T = 1 . . . . . . . . . . . . . . . . . . . . . . . . . 83Case 2: K T = 2, K = 1, and condition A . . . . . . . . . . . 885.2 An Automatic Unsupervised Segmentation Method . . . . . . . . . 935.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2.3 Marginal Segmentation via Mixture Models . . . . . . . . . 96Parameter Estimation by EM . . . . . . . . . . . . . . . . . 96M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . 99Final Marginal Segmentation . . . . . . . . . . . . . . . . . 1015.2.4 ICM and Pseudolikelihood BIC . . . . . . . . . . . . . . . . 1025.2.5 Determining the Number of Components . . . . . . . . . . . 1035.2.6 Morphological Smoothing (Optional) . . . . . . . . . . . . . 1045.3 Image Segmentation Examples . . . . . . . . . . . . . . . . . . . . . 1065.3.1 Simulated Two Segment Image . . . . . . . . . . . . . . . . 1065.3.2 Simulated Three Segment Image . . . . . . . . . . . . . . . . 1115.3.3 Ice Floes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.3.4 Dog Lung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.3.5 Washington Coast . . . . . . . . . . . . . . . . . . . . . . . 1395.3.6 Buoy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148Chapter 6: Conclusions 158References 162iv

Appendix A: Software Discussion 169A.1 XV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169A.2 C code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.3 Splus code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172v

LIST OF FIGURES1.1 (a) Simulated image with 3 underlying segments and Gaussian noise.(b) Result of automatic segmentation. . . . . . . . . . . . . . . . . . 22.1 (a) Simulated minefield with noise. (b) Final result. . . . . . . . . . 82.2 Principal curve example. . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Two part curvilinear minefield after denoising using nearest neighborcleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Initial clustering of two part curvilinear minefield . . . . . . . . . . 172.5 HPCC applied to the two-part curvilinear minefield. . . . . . . . . . 182.6 Simulated curvilinear minefield. . . . . . . . . . . . . . . . . . . . . 202.7 Simulated curvilinear minefield after denoising using nearest neighborcleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.8 Initial clustering of denoised curvilinear minefield. . . . . . . . . . . 222.9 CEM-PCC applied to the curvilinear minefield. . . . . . . . . . . . 222.10 New Madrid earthquakes 1974-1992. . . . . . . . . . . . . . . . . . 242.11 New Madrid data after denoising. . . . . . . . . . . . . . . . . . . . 242.12 Initial clustering of denoised New Madrid earthquake data . . . . . 262.13 HPCC applied to the New Madrid data. . . . . . . . . . . . . . . . 262.14 CEM-PCC applied to the New Madrid data. . . . . . . . . . . . . . 27vi

4.1 (a) Signal generated by an AR(1) process (R 2 = 0.93). (b) Signalconsisting of two sequences of independent Gaussian noise (R 2 =0.92). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2 Simulated two segment image. . . . . . . . . . . . . . . . . . . . . . 705.1 Simulated two-segment image. . . . . . . . . . . . . . . . . . . . . . 1085.2 Scrambled version of figure 5.1. . . . . . . . . . . . . . . . . . . . . 1095.3 Marginal histogram of the simulated image. . . . . . . . . . . . . . 1105.4 Simulation of a three segment image, before processing. . . . . . . . 1135.5 Marginal histogram of the simulated image, with the estimated 3component mixture density. . . . . . . . . . . . . . . . . . . . . . . 1155.6 Initial segmentation of the simulated image by Ward’s method, using3 segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.7 Segmentation of the simulated image into 3 segments after EM. . . 1175.8 Segmentation of the simulated image into 3 segments after ICM. . . 1185.9 Segmentation of the simulated image into 3 segments after morphologicalsmoothing (opening and closing, conditional on the edgepixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.10 Aerial image of ice floes. . . . . . . . . . . . . . . . . . . . . . . . . 1225.11 Marginal histogram of the ice floe image, with the estimated 2 componentmixture density. . . . . . . . . . . . . . . . . . . . . . . . . 1245.12 Initial segmentation of the ice floe image by Ward’s method, using2 segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.13 Segmentation of the ice floe image into 2 segments after EM. . . . . 1265.14 Segmentation of the ice floe image into 2 segments after refinementby ICM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127vii

5.15 Segmentation of the ice floe image into 3 segments after refinementby ICM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.16 Segmentation of the ice floe image into 2 segments after morphologicalsmoothing (opening and closing, conditional on the edge pixels). 1295.17 PET image of a dog lung, before processing. . . . . . . . . . . . . . 1325.18 Marginal histogram of the dog lung image, with the estimated 4component mixture density. . . . . . . . . . . . . . . . . . . . . . . 1345.19 Initial segmentation of the dog lung image by Ward’s method, using4 segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.20 Segmentation of the dog lung image into 4 segments after EM. . . . 1365.21 Segmentation of the dog lung image into 4 segments after ICM. . . 1375.22 Segmentation of the dog lung image into 4 segments after morphologicalsmoothing (opening and closing, conditional on the edgepixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385.23 Satellite image of Washington coast, before processing. . . . . . . . 1415.24 Marginal histogram of the Washington coast image, with the estimated6 component mixture density. . . . . . . . . . . . . . . . . . 1435.25 Initial segmentation of the Washington coast image by Ward’smethod, using 6 segments. . . . . . . . . . . . . . . . . . . . . . . . 1445.26 Segmentation of the Washington coast image into 6 segments afterEM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455.27 Segmentation of the Washington coast image into 6 segments afterrefinement by ICM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1465.28 Segmentation of the Washington coast image into 6 segments aftermorphological smoothing (opening and closing, conditional on theedge pixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147viii

5.29 Aerial image of a buoy, before processing. . . . . . . . . . . . . . . . 1505.30 Buoy image after initial smoothing to mitigate the scan line artifact. 1525.31 Marginal histogram of the buoy image, with the estimated 6 componentmixture density. . . . . . . . . . . . . . . . . . . . . . . . . 1535.32 Initial segmentation of the buoy image by Ward’s method, using 6segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.33 Segmentation of the buoy image into 6 segments after EM. . . . . . 1555.34 Segmentation of the buoy image into 6 segments after ICM. . . . . 1565.35 Segmentation of the buoy image into 6 segments after morphologicalsmoothing (opening and closing, conditional on the edge pixels). . . 157ix

LIST OF TABLES2.1 BIC results for the two part curvilinear minefield. . . . . . . . . . . 192.2 BIC results for simulated sine wave minefield. . . . . . . . . . . . . 212.3 BIC results for New Madrid seismic data. . . . . . . . . . . . . . . . 254.1 Loglikelihood and BIC results for the data of figure 4.1B. . . . . . . 684.2 Loglikelihood and BIC results for the image of figure 4.2. . . . . . . 695.1 BIC P L and BIC IND results for the simulated two-segment imageand the scrambled image. . . . . . . . . . . . . . . . . . . . . . . . 1075.2 Logpseudolikelihood and BIC P L results for the simulated image. Amissing value, noted with †, is discussed in the text. . . . . . . . . . 1125.3 EM-based parameter estimates for the simulated image. . . . . . . . 1145.4 Logpseudolikelihood and BIC P L results for the ice floe image. . . . 1215.5 EM-based parameter estimates for the ice floe image. . . . . . . . . 1235.6 Logpseudolikelihood and BIC P L results for the dog lung image. . . 1325.7 EM-based parameter estimates for the dog lung image. . . . . . . . 1335.8 Logpseudolikelihood and BIC P L results for the Washington coastimage. Missing values, noted with †, are discussed in the text. . . . 1415.9 EM-based parameter estimates for the Washington coast image. . . 1425.10 Logpseudolikelihood and BIC P L results for the buoy image. . . . . 1505.11 EM-based parameter estimates for the buoy image. . . . . . . . . . 151x

DEDICATIONThis work is dedicated to my family, to my friends, and to anyone who finds ituseful. Caveat Emptor.xi

Chapter 1INTRODUCTION1.1 MotivationImage segmentation is the process of classifying each pixel of an image into a setof classes, where the number of classes is much smaller than the number of uniquepixel values. The goal of image segmentation is to separate features from eachother and from background, where features are items of interest in an image. Forexample, we might want to separate different tissue types in a brain image: greymatter, white matter, bone, blood, and so on.To illustrate this idea, a simple simulated image is shown in figure 1.1, alongwith a segmented version. In this simulation, it is visually clear that there arethree segments. We want the computer to be able to detect these segments automatically;in this case, the underlying segments are reconstructed perfectly usingthe algorithm described in chapter 5. The details of the analysis of this simulationcan be found in section 5.3.2.Segmentation can be accomplished manually by a human expert who simplylooks at an image, determines borders between regions, and classifies each region.This is perhaps the most reliable and accurate method of image segmentation, becausethe human visual system is immensely complex and well suited to the task.However, modern data acquisition methods create a huge amount of image datafor which manual analysis would be prohibitively expensive and time-consuming.

20 10 20 30 40 50 600 10 20 30 40 50 600 10 20 30 40 50 600 10 20 30 40 50 60(a)(b)Figure 1.1: (a) Simulated image with 3 underlying segments and Gaussian noise. (b) Result ofautomatic segmentation.This leads to the current goal: to develop a general method for segmenting imagesquickly and entirely automatically. Furthermore, I use no training data; inapplications where training data are available, this method could be an initialstep in a more complex classification process. Image segmentation is needed in awide variety of disciplines with many different imaging modalities. Some examplesare multispectral satellite or aerial images for geoscience or military reconnaisance;medical imaging with PET, CAT, MRI, and ultrasound; and real-timeimage streams for quality control in manufacturing.1.2 Other Methods for Image SegmentationImage segmentation is a well studied problem for which many methods have beendeveloped. Comprehensive surveys of the image segmentation literature are availablein Haralick and Shapiro (1985) and more recently in Pal and Pal (1993). Irestrict my attention to the automatic, unsupervised case; these methods work

3without training data (unsupervised) and without the need for manual fine-tuningof parameters (automatic). Although many such methods exist, most assume thatK, the number of segments, is known in advance. The basic approaches to imagesegmentation consist of three types of methods: region growing, edge finding, andpixel classification. Here, I give a brief description of these areas.Region growing methods seek homogeneous regions, and then grow and mergethese regions until the desired number of segments is reached. The growing ormerging of regions is typically controlled by a homogeneity measure, such as anentropy criterion or a least squares measure. A well-known example of the latteris Hartigan’s K-means clustering (Hartigan, 1975). In addition to the homogeneitymeasure, regions can be characterized by color, shape, size, and so on; thesemeasurements can be incorporated into the region growing algorithm or used in asubsequent processing step. For example, Campbell et al (1997) generate an initialsegmentation using a simple K-means approach, and then use texture, color, andshape to refine and classify the regions. Although the classification step requiresextensive training data, this approach achieves impressive results, correctly classifyingover 90% of the pixels in a set of outdoor urban test images into 11 objectclasses.Edge finding methods identify edges in a scene; after linking or extending theseedges to form closed regions, edges can be removed until the desired number ofsegments is attained. A wide variety of edge finding and edge enhancing methodsare available, with a corresponding range of computational complexity. On thesimpler end of the spectrum, high-pass filtering methods can be implemented asconvolution operations with simple kernels (for an introductory review of simpleconvolution methods, see Burdick, 1997); these approaches are similar to mathematicalmorphology. Usually, more complicated convolution approaches are used,such as Canny’s edge detection (Canny, 1986). Most convolution methods are very

4fast because they can be implemented in the Fourier domain. More complicatedmodels usually include some sort of distributional assumptions about noise in theimage; for example, Bovik and Munson (1986) show that when both Gaussian andimpulse noise are present, an edge detector based on local median values is morerobust than a similar algorithm using mean values. A similar distributional assumptionis made by Kundu (1990), who develops a multi-stage approach to dealwith the different noise types.Pixel classification methods attempt to classify individual pixels using eitherthe pixel value or the pixel value and the values of adjacent or nearby pixels (alsoknown as a neighborhood). A simple pixel classification method is given in chapter3; this assumes that the pixel values follow a Gaussian mixture distribution andignores spatial information. The pixels are then classified into the component ofthe mixture from which they are most likely to have arisen. Methods which makeuse of neighborhood information include Markov random field models; an earlyexample of segmentation with Markov random field models is given by Hansenand Elliott (1982), who achieve good results, especially considering the limitedcomputing power available at the time.1.3 Other Methods for Image Segmentation with Choice of the Numberof SegmentsIn this dissertation I present an automatic and unsupervised method for image segmentationincluding the choice of K, based on mixture models and the BayesianInformation Criterion (BIC). Other methods which address the problem of choosingK can be divided into two categories: ad hoc procedures, and procedureswhich require tuning parameters. Although the ad hoc procedures may give goodresults for some applications, it is doubtful that they would be applicable in general.Similarly, methods which use an arbitrarily chosen tuning parameter may

5not be generally applicable, though often one can set the tuning parameter to avalue which is reasonable for a large class of images. Also, it is usually easier tochoose a tuning parameter than to choose K directly; the tuning parameter maylet K vary rather than forcing a single value of K. Below are a few recent paperswhich address the automatic choice of K.Dingle and Morrison (1996) use local empirical density functions to characterizeeach segment. They begin with K=1, and then create ”outlier” regions by choosingan arbitrary threshold on total variation to determine when two density functionsare different. Outlier regions become segments (thereby increasing K) when theirsize is larger than another arbitrarily chosen threshold.Chen and Kundu (1993) use a hidden Markov model (HMM) approach totexture segmentation. They define a distance between HMMs called the discriminationinformation (DI). A split-and-merge procedure is used in which a HMMis fit to each region, and regions are merged if their corresponding HMMs havea DI below a certain threshold. This threshold is chosen by a convoluted ad hocprocedure which depends on 3 arbitrarily chosen parameters.Johnson (1994) defines a Gibbs distribution on region identifiers in order toallow inference. An arbitrary parameter in the potential function for the Gibbsdistribution is used to penalize results with many segments.Given some choice for K, there are several estimation methods which can beused to fit a mixture model to the data. The EM algorithm (Dempster et. al., 1977)can be used to estimate parameters in a Gaussian or Poisson model (Hathaway,1986). Many variations of this approach have been developed: CEM (Celeux andGovaert, 1992), SEM (Masson and Pieczinsky, 1993), and NEM (Ambroise et.al., 1996). CEM is an adaptation of EM for hard classification; the other twomethods take some account of spatial information. A similar but nonparametricsegmentation method was developed by Letts (1978) and extended to the case of

6multidimensional observations at each pixel.1.4 OverviewI try to make only minimal assumptions about the imaging method, but somedecisions are needed. For instance, raw data from PET images are modeled muchbetter by Poisson mixtures than by Gaussian mixtures, and vice versa for aerialphotography. In general, different features may be distinguished by color or texture.I do not consider texture differences here. The examples and discussionwill focus on greyscale images, though these methods can be extended to color ormultispectral images.Chapter 2 presents a method for automatic detection of curves in spatial pointprocesses; this method uses principal curves to model features and the BIC tochoose the amount of smoothing. Chapter 3 discusses marginal segmentationmethods, which I use to find an initial segmentation of the image. Chapter 4presents two models for autoregressive dependence, the AR(1) model and theraster scan autoregression (RSA) model, and discusses how BIC can be adjustedto accomodate these models. In chapter 5, I discuss a Markov random field modelfor images, and I describe an algorithm for automatic, unsupervised image segmentation.Examples follow at the end of the chapter. Conclusions and furtherdiscussion are given in chapter 6. A discussion of the software developed with thisdissertation is given in Appendix A.

Chapter 2CLUSTERING ON OPEN PRINCIPAL CURVESClustering about principal curves combines parametric modeling of noise withnonparametric modeling of feature shape. This is useful for detecting curvilinearfeatures in spatial point patterns, with or without background noise. Applicationsinclude the detection of curvilinear minefields from reconnaissance images, someof the points in which represent false detections, and the detection of seismic faultsfrom earthquake catalogs.Our algorithm for principal curve clustering is in two steps: the first is hierarchicaland agglomerative (HPCC), and the second consists of iterative relocationbased on the Classification EM algorithm (CEM-PCC). HPCC is used to combinepotential feature clusters, while CEM-PCC refines the results and deals withbackground noise. It is important to have a good starting point for the algorithm:this can be found manually or automatically using, for example, nearest neighborclutter removal or model-based clustering. We choose the number of features andthe amount of smoothing simultaneously using approximate Bayes factors.2.1 IntroductionWe wish to detect curvilinear features in spatial point processes automatically.We must deal with two kinds of noise: background noise, in the form of observedpoints which are not part of the features, and feature noise, which is the deviationof observed feature points from an underlying “true” feature curve. One suchproblem is the detection of curvilinear minefields in aerial reconnaissance images.

••••8Figure 2.1(a) is a simulation of such an image, and Figure 2.1(b) shows the featuresdetected by our method.•• •••• • ••• • •••••• • •• ••••• •••• • • • •••••• •• ••• • ••• • • ••• •••••••••• ••• •• •••• •• •• • •••••• • ••• ••••••••••• ••••••••• •••••• ••••• •••• • • •• • • •• •••••• •• • • •••• • •• •••••• ••• • • • • • ••••••• ••••••••• • ••• ••• • • ••••• • ••••••• •••••••••••• ••• ••• •••••••••••• • • •• ••• • •• •• ••••• • • •••• • • • • •• •••• • ••• •••• • •• • ••• ••••• ••• •••• • •• ••••••• ••• ••••••••• •••••• •••• • •• ••(a)(b)Figure 2.1: (a) Simulated minefield with noise. (b) Final result.In Section 2.2, we give some background on principal curves and introduce ourprobability model and clustering algorithm. Section 2.2.3 presents our methodfor clustering on open principal curves, and Section 2.2.4 describes our use ofapproximate Bayes factors to choose the number of features and the amount ofsmoothing simultaneously and automatically. Initialization methods, including ourhierarchical principal curve clustering (HPCC) algorithm, are discussed in Section2.3. Examples are presented in Section 2.4, and in Section 2.5 we discuss otherapproaches and areas of further work.

92.2 Model, Estimation and Inference2.2.1 Principal CurvesA principal curve is a smooth, curvilinear summary of n-dimensional data; it isa nonlinear generalization of the first principal component line. Principal curveswere introduced by Hastie and Stuetzle (1989) and discussed in the clusteringcontext by Banfield and Raftery (1992). The curve f is a principal curve of h ifE(X|λ f (X) = λ) = f(λ) (2.1)for almost all λ, where X is a random vector with density h in R n and λ f is thefunction which projects points in R n orthogonally onto f. When this holds, f isalso said to be self-consistent for h. A principal curve f is parametrized by λ, thearc length along the curve.The algorithm for fitting a principal curve from data involves iteratively applyingthe definition (2.1), where the conditional expectation is replaced by ascatterplot smoother. The choice of the smoothing parameter is discussed in Section2.2.4. Each data point, x j , has an associated projection point f(λ j ) on thecurve, which is the point on the curve closest to x j (see Figure 2.2). The linesegment from x j to f(λ j ) is orthogonal to the curve at f(λ j ), unless f(λ j ) is anendpoint of the curve. Bias correction for closed principal curves (Banfield andRaftery, 1992) can also be extended to the open principal curves that we use here.2.2.2 Probability ModelWe model a noisy spatial point process by making distributional assumptions aboutthe background noise and the feature noise. Suppose that X is a set of observationsx 1 . . . x N , and C is a partition consisting of clusters C 0 , C 1 . . . C K where clusterC j contains N j points. The noise cluster is denoted by C 0 ; we assume that the

10Principal Curve•••••Data PointProjection Point•••••••Figure 2.2: Principal curve example.background noise is uniformly distributed over the region of the image (this isequivalent to Poisson background noise). We assume that the feature points aredistributed uniformly along the true underlying feature; that is, their projectionsonto the feature’s principal curve are drawn randomly from a uniform distributionU(0, ν j ), where ν jis the length of the j-th curve. We assume that the featurepoints are distributed normally about the true underlying feature, with mean zeroand variance σj 2 . Distance about the curve is the orthogonal distance from a pointto the curve; if the point projects to an endpoint of the curve, it is simply thedistance from the point to the curve endpoint. The (K + 1) clusters are combinedin a mixture model, and we denote the unconditional probability of belonging tothe j-th feature by π j (j = 0, 1 . . . K).Let θ denote the entire set of parameters, {ν j , σ 2 j , π j: j = 1, . . . , K}, notincluding the curves themselves. Then the likelihood is L(X|θ) = ∏ Ni=1 π j L(x i |θ),

11where L(x i |θ) = ∑ Kj=1 π j L(x i |θ, x i ∈ C j ). For feature clusters,[ ] [( 1 1 −||xi − f(λL(x i |θ, x i ∈ C j ) = √ ij )|| 2 )]exp,ν j 2πσjwhere ||x i − f(λ ij )|| is the Euclidean distance from the point x i to its projectionpoint f(λ ij ) on curve j. For the noise cluster,where Area is the area of the image.L(x i |θ, x i ∈ C j ) = 1Area ,2σ 2 j2.2.3 Estimation: The CEM-PCC AlgorithmThe CEM-PCC algorithm refines a given clustering by using the Classification EMalgorithm (Celeux and Govaert, 1992), which is a version of the well known EMalgorithm (Dempster, et al, 1977) and the probability model of Section 2.2.2. Westart with an initial clustering from the methods discussed in Section 2.3.Overview of the CEM-PCC algorithm:1. Begin with an initial clustering (features and noise).2. (M-step) Conditional on the current clustering, fit a principal curve to eachfeature cluster and then compute estimates of the parameters (ν j , σ 2 j , andπ j ).3. (E-step) Conditional on the current curves and parameter estimates, calculatethe likelihood of each point being in each cluster.4. (Classification step) Reclassify each point into its most likely cluster.5. Check for convergence; end or return to step 2.Once we have calculated the probability of each point being in each cluster

12based on the estimates of the parameters at the current iteration, we reclassifyeach point into the cluster for which it has the highest likelihood. At the end ofeach iteration, we compute the overall likelihood L(X|θ). This process is executedfor a predetermined number of iterations, at which point we choose as the finalresult the clustering with the highest overall likelihood (CEM iterations sometimesdecrease the likelihood).We have found that it is useful to impose a lower bound on the estimateof the variance about the curve. If the variance is allowed to decrease withoutbound, the likelihood can grow without bound. This can be a problem whenthere are small clusters, since the smoothing is almost able to interpolate the datapoints. We impose a bound based on the assumption that the data are not knownwith absolute precision. For instance, if we assume the data are precise to threesignificant digits, then we can find a lower bound on the resolution of the data andtranslate this to a lower bound on the variance.2.2.4 Inference: Choosing the Number of Features and Their Smoothness SimultaneouslySince the number of clusters affects the overall amount of smoothing, we selectthe smoothing parameter and the number of clusters simultaneously. The amountof smoothing in each feature cluster is measured by the degrees of freedom (DF )used in fitting the principal curve to that cluster. We use a cubic B-spline (Wold,1974) in fitting the principal curves; specifically, we use the function principal.curve(obtained from Statlib) which calls the Splus function smooth.spline. The DF ofa cubic B-spline is given by the trace of the implicit smoother matrix S; S is thematrix which yields the fitted values at the observed data points (Tibshirani andHastie, 1987; and Hastie and Tibshirani, 1990).Each combination of number of features and degrees of freedom (i.e. smooth-

13ness of a feature) considered is viewed as specifying a possible model for the data,and the competing models are compared using Bayes factors (Kass and Raftery.1995). We approximate the Bayes factor using the Bayesian Information Criterion(BIC; Schwarz, 1978); the difference between the BIC values for two models isapproximately equal to twice the log Bayes factor when unit information priorsfor the model parameters are used (Kass and Wasserman, 1995). These are priorsthat contain about the same amount of information as a single typical observation.This approach has been found to work well for mixture models (Roeder andWasserman, 1997).The BIC for a model with K features and background noise is defined by:BIC = 2 log(L(X|θ)) − M · log(N),where M = K(DF + 2) + K + 1 is the number of parameters. The number offeature clusters is K; for each feature cluster we estimate ν j and σ j , and we fit acurve using DF degrees of freedom. The mixing proportions add K parameters,and the estimate of the image area used in the noise density is one more parameter.The larger the BIC, the more the model is favored by the data. Conventionally,differences of 2–6 between BIC values for models represent positive evidence,differences of 6–10 correspond to strong evidence, while differences greater than10 indicate very strong evidence (Kass and Raftery, 1995).2.3 Initialization2.3.1 Denoising and Initial ClusteringThe performance of the CEM-PCC algorithm can be sensitive to the startingvalue, so it is important to have a good starting value. The initial clustering usedto obtain this should accomplish two objectives: separate the feature points frombackground noise, and provide an initial clustering of the feature points. The

14first of these can be done by a human, or by various automatic methods suchas nonparametric maximum likelihood using the Voronoi tesselation (Allard andFraley, 1997), or Kth nearest neighbor clutter removal (Byers and Raftery, 1998).This step does not need to be perfect, since CEM-PCC will examine the noisepoints to determine if they should be included in the features, and vice versa.Once the noise points have been removed, we need an initial clustering of thefeature points so that a curve can be fit to each cluster. We recommend that therebe at least seven points in each cluster; when there are fewer than seven, we fit aprincipal component line instead of a curve. The feature points can be clusteredusing model-based clustering as implemented in the MCLUST software (Banfieldand Raftery, 1993; Dasgupta and Raftery, 1998; Fraley and Raftery, 1998). Asimpler method is to fit a minimum spanning tree to the feature points and cutthe longest edges, which will work well if the main clusters are well separated(Roeder and Wasserman, 1997; Zahn, 1971).2.3.2 Hierarchical Principal Curve Clustering (HPCC)Clustering on closed principal curves was introduced by Banfield and Raftery(1992). Their clustering criterion (V ∗ ) is based on a weighted sum of the squareddistances about the curve and the squared distances along the curve, and they statethat it is optimal when the data points are normally distributed about the curve(conditional on the estimated curves and assuming that α is chosen properly). Itis defined byV ∗ = V About + αV Along ,where V About = ∑ Nj=1 (x j − f(λ j )) 2 , V Along = 1 2∑ Nj=1(ɛ j − ¯ɛ) 2 , and ɛ j = f(λ j ) −f(λ j+1 ).The V About term measures the spread of observations about the curve (in or-

15thogonal distance to the curve), while the V Along term measures the variance in arclength distances between projection points on the curve. Minimizing ∑ V ∗ (wherethe sum is over all clusters) will lead to clusters with points regularly spaced alongthe curve and tightly grouped around it. Large values of α will cause the algorithmto avoid clusters with gaps, while small values will favor thinner clusters.Clustering stops when merging clusters would lead to an increase in ∑ V ∗ .We extend the method to open principal curves by changing V Along so that thesum goes only to (N − 1) instead of to N. This is because the closed curves couldwrap around, whereas the open curve stops at its end points.Overview of HPCC:1. Make a first estimate of the noise points and remove them.2. Form an initial clustering with at least seven points in each cluster.3. Fit a principal curve to each cluster.4. Calculate ∑ V ∗ for each possible merge.5. Perform the merge which leads to the lowest ∑ V ∗ .6. Keep merging until the desired number of clusters is reached.Deciding when to stop clustering is more difficult for open curves than forclosed curves. In the closed curve case, clustering stops when any merge wouldlead to an increase in ∑ V ∗ (Banfield and Raftery, 1993). For open curves, thismethod leads to an overfitting problem in which we end up with too many clusters.V ∗ can be made arbitrarily close to zero by increasing the number of clusters. Weovercame this problem by using approximate Bayes factors (Section 2.2.4).

162.4 Examples2.4.1 A Simulated Two-Part Curvilinear MinefieldThe simulated minefield shown in Figure 2.1(a) contains 100 points in each ofthe two curves and 200 points of background noise (400 points total). This simulationwas created using offset semicircles as the true underlying features, andbackground noise was generated uniformly over the image area. Note that someof the background noise points will fall inside the regions of feature points; thesenoise points will be indistinguishable from feature points.The first step is to separate the features from the noise, which we did using 9thnearest neighbor denoising (Byers and Raftery, 1998); the resulting feature pointsare shown in Figure 2.3. We then used MCLUST (Banfield and Raftery, 1993) toprovide an initial clustering into 9 clusters; this is shown in Figure 2.4. We used 9clusters for the initial clustering because this is the largest number of clusters forwhich MCLUST returns a clustering in which each cluster has at least 7 points.HPCC was applied to obtain 2 clusters, shown in Figure 2.5. The noise pointswere then returned to the image with the HPCC clustering, and CEM-PCC wasused to refine the clustering. The final result is shown in Figure 2.1(b).Table 2.1 shows the BIC values for 1 to 3 features with a variety of DF values.The BIC is maximized for 2 features with 5 DF. The approximate Bayes factorsidentified the correct number of features quite decisively in this example.

••••17•• •• ••••• •••• • • ••• • ••••••••• ••••••• •••• • • •• ••••• •••••••••• ••• •••••• •• ••• • •• •• •• •• • ••••• •• •••• •• • • •••••• ••••••• ••• •••• • •••• •• •••••• •• ••• ••••• ••• • • •• ••• ••• ••• • • • •• •••• • ••• •• •• • ••• ••• •• •• • •• •••••Figure 2.3: Two part curvilinear minefield after denoising using nearest neighbor cleaning.Figure 2.4: Initial clustering of two part curvilinear minefield using MCLUST. There are nineclusters.

18Figure 2.5: HPCC applied to the two-part curvilinear minefield.

19Table 2.1: BIC results for the two part curvilinear minefield.BICDF 0 Features 1 Feature 2 Features 3 Features2 -1984 -1880 -1846 -17453 -1984 -1850 -1748 -17214 -1984 -1861 -1648 -16485 -1984 -1845 -1628 -16586 -1984 -1803 -1632 -16707 -1984 -1761 -1641 -16858 -1984 -1726 -1643 -16939 -1984 -1702 -1648 -170310 -1984 -1692 -1660 -171811 -1984 -1689 -1671 -173512 -1984 -1689 -1688 -174913 -1984 -1680 -1703 -177014 -1984 -1721 -1718 -179215 -1984 -1748 -1727 -181016 -1984 -1777 -1733 -180717 -1984 -1755 -1739 -1827

21Table 2.2: BIC results for simulated sine wave minefield.DF 0 Features 1 Feature 2 Features 3 Features2 -1031 -1034 -1018 -9913 -1031 -1016 -938 -9414 -1031 -975 -889 -9085 -1031 -911 -883 -9046 -1031 -884 -887 -9017 -1031 -873 -894 -9018 -1031 -868 -901 -9079 -1031 -869 -908 -92110 -1031 -872 -913 -92911 -1031 -873 -919 -94012 -1031 -877 -926 -951Figure 2.7: Simulated curvilinear minefield after denoising using nearest neighbor cleaning.

22Figure 2.8: Initial clustering of denoised curvilinear minefield using MCLUST. There are sevenclusters.Figure 2.9: CEM-PCC applied to the curvilinear minefield.

232.4.3 New Madrid Seismic RegionData on 219 earthquakes in the New Madrid seismic region were obtained fromthe Center for Earthquake Research and Information (CERI) World Wide Website http://samwise.ceri.memphis.edu. We have included all earthquakes in theNew Madrid catalog from 1974 to 1992 with a magnitude of 2.5 and above. Thistime period was chosen because the data collection methods were consistent; dataprior to this period are available but become sparser and less reliable as one goesback in time. The New Madrid region extends from Illinois to Arkansas: latitude35 to 38 and longitude -91 to -88. These data are displayed in Figure 2.10, andthe BIC results are shown in Table 2.3. Figures 2.11 to 2.14 show each step of ourprocess; the final result (Figure 2.14) corresponds to the parameters which yieldthe maximum BIC value (3 features, each with 10 degrees of freedom).This example illustrates some strengths and limitations of our method. We cansee in Figure 2.11 that the most striking features in this dataset are a combinationof lines and blobs. While our method does a good job of picking out the curvilinearfeatures, blobs are not very well modeled by curves, causing the rather awkwardlooking result for the rightmost feature.

24Latitude35.5 36.0 36.5 37.0 37.5 38.0•••••••••• • • • ••• ••• •••• ••••• •••• •••• • ••• •••••••••• •••••• •••••• •••• •• •• ••••••• •• ••••••• •••• •• ••••••••• •• ••••••••• •••• ••••• ••••• •••-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.10: New Madrid earthquakes 1974-1992.Latitude35.5 36.0 36.5 37.0 37.5 38.0-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.11: New Madrid data after denoising.

25Table 2.3: BIC results for New Madrid seismic data.DF 0 Features 1 Feature 2 Features 3 Features 4 Features2 -861 -548 -447 -334 -3493 -861 -544 -403 -336 -3614 -861 -524 -380 -335 -3655 -861 -492 -367 -328 -3606 -861 -458 -363 -330 -3637 -861 -426 -362 -337 -3678 -861 -401 -362 -337 -3629 -861 -386 -363 -332 -34710 -861 -378 -364 -306 -36511 -861 -375 -372 -320 -38312 -861 -372 -377 -333 -40013 -861 -369 -381 -345 -41714 -861 -367 -387 -355 -43115 -861 -364 -391 -366 -44616 -861 -363 -404 -375 -45817 -861 -360 -408 -383 -46818 -861 -361 -410 -393 -48219 -861 -362 -414 -404 -49520 -861 -359 -420 -410 -50721 -861 -361 -418 -416 -51822 -861 -363 -424 -426 -53223 -861 -365 -424 -430 -54124 -861 -371 -426 -439 -554

26Latitude35.5 36.0 36.5 37.0 37.5 38.0-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.12: Initial clustering of denoised New Madrid earthquake data using MCLUST. Thereare four clusters.Latitude35.5 36.0 36.5 37.0 37.5 38.0-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.13: HPCC applied to the New Madrid data.

27Latitude35.5 36.0 36.5 37.0 37.5 38.0-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.14: CEM-PCC applied to the New Madrid data.

282.5 DiscussionWe have introduced a probability model for noisy spatial point process data withcurvilinear features. We use the CEM algorithm to estimate it and classify thepoints, and we use approximate Bayes factors to find the number of features and theoptimal amount of smoothing, simultaneously and automatically. The hierarchicalprincipal curve clustering method of Banfield and Raftery (1992) is extended toopen principal curves (HPCC), and we describe an iterative relocation method(CEM-PCC) for refining a principal curve clustering based on the ClassificationEM algorithm (Celeux and Govaert, 1992). In combination with the denoisingmethod of Byers and Raftery (1998) and an initial clustering method such asMCLUST (Banfield and Raftery, 1993), we have an approach which takes noisyspatial point process data and automatically extracts curvilinear features. Themethod appears to work well in simulated and real examples.One way that this kind of data may arise is from image processing. There aremany methods for edge detection in images, but most of these methods yield noisyresults. These edge detector results can be viewed as a point process and analyzedwith principal curve clustering; this would enhance edge and boundary detectionin images by reducing noise and looking for larger scale structures. Note that thisrequires an important additional step to convert the edge detector results into abinary image which can then be regarded as a point process. This is illustratedwith ice floe images in Banfield and Raftery (1992).The Hough transform is a well known and widely used method for fitting aparameteric curve to a point set (Illingworth and Kittler, 1988; Hough, 1962).One limitation of the Hough transform is that it fits only parametric curves, sothat the form of the curve must be specified in advance. We wish to address thesituation in which little is known about the true shape of the features in a point

29set, so we want to avoid assumptions about the parametric form of the features.In this paper, we use open principal curves to model underlying curves in thedata. Principal curves provide a data-driven, nonparametric summary of featureshape; they are characterized by the number of degrees of freedom allowed. Forexample, a principal curve with 15 degrees of freedom could look like a line, an arc,a sinusoid, a spiral, or some combination of these. In order to give a parametriccurve as much flexibility as a principal curve, many parameters would be needed,which would greatly increase the already large computational requirements of theHough transform.Preconditioning on the parameter domain is used in (Hansen and Toft, 1996)to improve the speed of the Radon transform, a generalization of the Hough transform.Although this approach can greatly reduce the size of the parameter space,it has two drawbacks. First, several sensitivity parameters must be specified bythe user. This means that the use of this method needs to be interactive, with theuser trying various parameter values until a good result is obtained. Second, onlyparametric curves are allowed. Our principal curve clustering method uses BICto automatically choose the number of features and the amount of smoothing; thecurve shape is estimated adaptively and nonparametrically, so there is no need tosearch over a large parameter space.A curve detection method which makes no parametric assumptions about thecurve shape is given in (Steger, 1998). Candidate points for the curves underlyingthe features are detected using local differential operators. These points are thenlinked into curves; a user-specified threshold is used to determine which candidatesare used. The curves are subsequently modeled as two edges with an interior region.This allows determination of curve width, and a bias reduction step is usedto improve the result. Because the curves detected by this method are nonparameteric,they are much more general than curves which can be fitted using a Hough

30transform type of approach. The examples in (Steger, 1998) show that the curvesfit the data quite well, but it is still up to the user to interactively choose thesensitivity parameter. The method does not include a formal way to choose thenumber of features.Spatial point process data arise in visual defect metrology, and the Hough transformhas been used previously to detect linear features in these data (Cunninghamand MacKinnon, 1998). In this application of the Hough transform, several parametersand thresholds must be specified in advance by the user. Although usersexperienced with this technique may well be able to find reasonable values for allof the needed parameters, it seems more satisfactory to estimate parameters fromthe data. Principal curve clustering allows automatic detection of both linear andnonlinear features without the need for ad hoc parameter specification.The Kohonen self-organizing feature map (SOFM) is another data-drivenapproach to feature detection (Kohonen, 1982; Ambroise and Govaert, 1996;Murtagh, 1995). Neither principal curve clustering nor the SOFM approach requiresprior specification of feature shape, and both algorithms are hierarchical innature. Like principal curves, the SOFM can be combined with further clusteringmethods to produce a more powerful clustering algorithm (Murtagh, 1995). However,unlike our method, the SOFM approach does not provide an explicit estimateof feature shape.Tibshirani (1992) proposes an alternate definition of principal curves based onmixture models and a new algorithm for fitting principal curves based on the EMalgorithm. It is argued that this definition avoids the bias problems inherent in theapproach of Hastie and Stuetzle (1989), and an example is presented showing thatthese principal curves can be different in practice from curves of the Hastie andStuetzle type. It would be of interest to see what effect this alternate definitionwould have on our results.

31The examples we have presented in this paper consist of two dimensional pointpatterns, but our methods can be generalized to higher dimensions. Principalcurves could be used in higher dimensions, or our approach could be modified touse a different model as the basis for features, such as principal surfaces (Hastieand Stuetzle, 1989) or adaptive principal surfaces (LeBlanc and Tibshirani, 1994).Many variations of the EM algorithm are available. Green (1990) introducedthe One Step Late (OSL) algorithm, a version of EM for use with penalized likelihoods.Silverman, Jones, Wilson and Nychka (1990) added a smoothing step tothe EM algorithm, and similarities between this approach and maximum penalizedlikelihood were discussed by Nychka (1990). Lu (1995) replaced the M-stepwith a smoothing step. Theoretical properties of smoothing in the EM algorithmwere discussed by Latham and Anderssen (1994) and Latham (1995). Our use ofprincipal curves in CEM-PCC is similar to these ideas of smoothing. Althoughthe curves themselves are not smoothed across CEM iterations, the curves can beviewed as smoothing the pointwise likelihoods, and thus indirectly smoothing theparameter estimates.In addition to approximate Bayes factors, we explored cross-validation as amethod for choosing the number of clusters and amount of smoothing. This involvesiteratively leaving out one data point, recomputing the entire clustering,and then calculating the likelihood for the left out point. We found that the resultsare quite similar to the Bayes factor results, but that the cross-validationapproach involves much more computation.Our model assumes that the principal curve underlying each feature has thesame smoothness. This may not always be realistic, and it would be possibleand worthwhile to relax this assumption. Furthermore, as seems to be the casein the New Madrid data set, one or more of the features might be circular (orhyperspherical in higher dimensions), concentrated about a point rather than a

32curve. Extending the method to accommodate this possibility explicitly wouldalso be worthwhile.Splus source files for HPCC and CEMPCC are available athttp://www.stat.washington.edu/stanford/princlust.html. Statlib has Splus functionsavailable for fitting principal curves, Kth nearest neighbor denoising, andmodel-based clustering (http://lib.stat.cmu.edu/S/principal.curve,http://lib.stat.cmu.edu/S/NNclean, andhttp://lib.stat.cmu.edu/S/emclust).

Chapter 3MARGINAL SEGMENTATIONIn this chapter I present a discussion of image segmentation based on themarginal (without spatial information) pixel values. I begin by presenting somebackground on the Bayesian Information Criterion (BIC), and then I introduce themixture model. I discuss the use of BIC for model selection in this context, andI examine two schemes for classification of pixels after model selection has beendone.3.1 BICThe Bayesian Information Criterion (BIC) was first given by Schwarz (1978) in thecontext of model selection with IID observations from a certain class of densities,and Haughton (1988) extended the class of densities to curved exponential families.The difference in the BIC value between two models is an approximation of twicethe log of the Bayes factor comparing the two models; the BIC has the advantageof being relatively easy to compute. The basic idea of the BIC is to use Laplace’smethod to approximate the integrated likelihood in the Bayes factor, and thenignore terms which do not increase quickly with N. In this section I present aderivation of the BIC, which largely follows the discussions found in Kass andRaftery (1995) and Raftery (1995). The following sections show how the BIC canbe adjusted for use with the AR(1) model and for the raster scan autoregression(RSA) model.A common approach to comparing two models or hypotheses, say M 2 vs. M 1 ,

34is to use a likelihood ratio test (LRT). The test statistic is2 log( )pMLE (X|M 2 )∼ χ 2 Dp MLE (X|M 1 )2 −D 1(3.1)where p MLE (X|M) is the maximized likelihood of the data X given the modelM, and (D 2 − D 1 ) is the difference in degrees of freedom between the two models.The LRT requires that M 1 must be nested in M 2 ; the Bayes factor approach doesnot have this restriction.The Bayes factor B 21 for comparing the same two models has a form similarto the LRT.B 21 = p(X|M 2)p(X|M 1 )(3.2)Here, p(X|M) denotes the integrated likelihood rather than the maximizedlikelihood. If θ is the set of parameters in p(X|M) (so θ may be a vector), thenthe integrated likelihood is∫p(X|M i ) =p(X|θ, M i )p(θ|M i )dθ (3.3)where p(θ|M i ) is the prior density of θ given M i .Denote log(p(X|θ, M i )p(θ|M i )) by g(θ|M i ) , and rewrite the integrated likelihoodas∫p(X|M i ) =exp(g(θ|M i ))dθ (3.4)Suppose ˜θ is the posterior mode of θ, i.e. the value which is the mode of theposterior distribution of θ. The maximum likelihood estimate ˆθ and the posteriormode ˜θ converge to the same value as the sample size increases to infinity. Replacethe inner term in equation 3.4 by the first few terms of a Taylor series expansion

35about ˜θ; this is a good approximation as long as N is large enough that g(θ) ishighly peaked.p(X|M i )≈∫exp(g(˜θ|M i ) + (θ − ˜θ) T g ′ (˜θ|M i )Since ˜θ is a maximum, g ′ (˜θ) = 0. We now have+ 1 2 (θ − ˜θ) T g ′′ (˜θ|M i )(θ − ˜θ))dθ (3.5)p(X|M i )∫ (≈ exp g(˜θ|M i ) + 1 2 (θ − ˜θ) T g ′′ (˜θ|M i )(θ − ˜θ))dθ (3.6)∫ (= exp(g(˜θ|M i )) exp ( 1 2 (θ − ˜θ) T g ′′ (˜θ|M i )(θ − ˜θ))dθ (3.7)The integral has the form of a multivariate normal density with covarianceequal to the inverse of −g ′′ (˜θ|M i ).p(X|M i ) ≈ exp(g(˜θ|M i ))(2π) D i/2 | − g ′′ (˜θ)| −1/2 (3.8)Recall that g(θ|M i ) = log(p(X|θ, M i )π(θ|M i )).log(p(X|M i )) ≈ log(p(X|˜θ, M i )) + log(p(˜θ|M i )) + (D i /2) log(2π)+(−1/2) log(| − g ′′ (˜θ)|) (3.9)If N is large, then −g ′′ (˜θ|M i ) ≈ E[−g ′′ (˜θ|M i )]. This is the Fisher informationfor the data Y , which will be equal to N times the Fisher information for oneobservation. Let I denote the Fisher information matrix for a single observation;this will be a D i by D i matrix. Now we have | − g ′′ (˜θ)| ≈ N D i|I|.log(p(X|M i )) ≈ log(p(X|˜θ, M i )) + log(p(˜θ|M i )) + (D i /2) log(2π)+(−D i /2) log(N) + (−1/2) log(|I|) (3.10)

36The error of the approximation in equation 3.10 is O(N −1/2 ).At this point, we could drop terms which do not increase with N from equation3.10 and obtain equation 3.11. The error of our approximation would then be O(1).However, if we consider the prior on θ more closely, we find that the approximationis actually better for a certain prior.Suppose p(θ|M i ) is multivariate normal with mean ˜θ and covariance matrixequal to I −1 . On average, this would give the prior about the same impact onlog(p(X|M i )) as a single observation. We can compute p(˜θ|M i ) by finding the densityof this multivariate normal evaluated at its mean; this is (2π) −Di/2 |I −1 | −1/2 .Recall that for any nonsingular matrix A, |A −1 | = |A| −1 . Substituting into equation3.10, we see that several terms cancel.log(p(X|M i )) ≈ log(p(X|˜θ, M i )) − D i2log(N) (3.11)In going from equation 3.10 to equation 3.11, we have simply chosen a certainprior which conveniently cancels a few other terms. Thus, the error of theapproximation is still O(N −1/2 ).We now multiply equation 3.11 by 2 and substitute the MLE of θ for theposterior mode. Denote the maximized loglikelihood of the data given model i byL(X|M i ). We arrive at the usual formulation of BIC in equation 3.12.BIC(M i ) = 2L(X|M i ) − D i log(N) (3.12)Returning to the Bayes factor idea of equation 3.2, we now have a relativelyeasy way to compute an approximate Bayes factor.2 log(B 21 ) = 2 log(p(X|M 2 )) − 2 log(p(X|M 1 )) (3.13)≈ BIC(M 2 ) − BIC(M 1 ) (3.14)

373.2 Mixture ModelsThe mixture density provides a natural way of modeling the observed “mixture”of features in an image. We can model the marginal (without spatial information)distribution of greyscale values in an image with a mixture density, shown inequation 3.15. We use the Gaussian density for Φ when the data are given inan image format, but the Poisson density is more appropriate when raw data areavailable for gamma camera images (e.g. PET). In the Poisson case, we allow thePoisson parameter λ to take on the value zero, which gives a probability mass atzero; this allows easy modeling of zero-intensity background regions, a commonand usually artifactual feature of medical images.K∑f(Y i |K, θ) = P j Φ(Y i |θ j ) (3.15)j=1Here, Y i is the greyscale value of pixel i, P j is the mixture probability of componentj, Φ is the single component density (e.g. Gaussian), θ j is the vector of parametersfor the jth density, and K is the number of components.Suppose we have a given value of K. Let Z i denote the true component whichgenerated the observation for pixel Y i , and suppose we have an initial classificationof each pixel into one of the K components (i.e. an initial estimate ˜Z i ). Considerthe Z i to be missing data; now we have cast the problem of estimating the θ jand P j parameters into a missing data problem which can be approached by theEM algorithm (Dempster, et al, 1977). Details of the estimation are presented inchapter 5.2.Once we have estimated θ j and P j , we can use equation 3.15 to compute alikelihood for each pixel. Under the assumption that all pixels are independent, thelikelihood of the whole image is the product of the pixel likelihoods, which makescomputation of the loglikelihood of the image relatively easy. The loglikelihood of

38the image with this independence assumption is given in equation 3.16.N∑ K∑L(Y |K, θ) = log( P j Φ(Y i |θ j )) (3.16)i=1 j=1However, pixels are typically not independent. Chapter 4 explores the impactof autoregressive dependence on the BIC in both one-dimensional and twodimensionaldata, and presents a combined mixture model and raster-scan autoregression(RSA) model which allows for autoregressive dependence with themixture model. In chapter 5, I use a Markov random field model for the spatialdependence.3.3 BIC with Mixture ModelsWe wish to use BIC for model selection with mixture models which have differentnumbers of components. BIC can be computed by using the likelihood fromequation 3.16 in the BIC formula from equation 3.12. The resulting BIC formulais shown in equation 3.17.BIC(K) = 2L(Y |K) − D K log(N) (3.17)Recall from section 3.1 that the BIC approximation is based on the use ofLaplace’s method to approximate the integrated likelihood in a Bayes factor. Aregularity condition for Laplace’s method is that the parameters must be in theinterior of the parameter space; as they approach the boundary of parameter space,the approximation breaks down. This presents a problem for the use of BIC inthe mixture model context. Suppose we fit a model with K segments when thetrue number of segments K 0 is smaller than K. In this case, the true mixtureproportion (P j in equation 3.15) for each extra segment would be zero, which is atthe boundary of parameter space. Also, there would be no information available

39for estimating the model parameters such as µ and σ for the extra segments; infact, µ and σ for the extra segments would have no meaning.Regardless of this, the BIC has been used previously for model selection inthe mixture context with some success (Banfield and Raftery, 1993; Banfield andRaftery, 1992; Fraley and Raftery, 1998). One possible reason for the apparentsuccess of BIC in these situations is the complexity of the data. Real data is oftennot truly a Gaussian mixture, so additional components in the mixture modelbeyond those which represent the main features of the data may be justified on asmaller scale. In this case, the BIC is not choosing a “true” model out of a set ofpossible models. Instead, the BIC is choosing a parsimonious model which capturesthe main features of the data, even though there may be smaller or more subtlefeatures present. The goal then ceases to be one of selecting the correct model andbecomes one of choosing a model which parsimoniously captures the importantfeatures in the data. The definition of an important feature is necessarily vague;this can vary between different applications. As a general method, the BIC seemsto perform reasonably well.3.4 ClassificationOnce the parameters of the mixture model have been estimated, we want to classifyeach pixel in the image into one of the K classes. This seemingly simplestep requires careful consideration because of two issues. First, the classificationwhich maximizes the mixture likelihood corresponds to a particular utility function,and this may not be the one we really want to use. Second, we have so farregarded components of the mixture as being synonymous with distinct features(or background) in the image, which may not be the case.

403.4.1 Mixture versus Componentwise ClassificationThe appropriate method for making our final classification depends on how wewill evaluate the classification when it is done. For instance, we might count thenumber of pixels in a feature of interest which are correctly classifed; clearly, a classificationwhich places all pixels into that feature regardless of all the estimatedparameters would be optimal, though rather unuseful. A more reasonable evaluationcriterion would be to simply count the number of pixels correctly classifiedin the whole image; the optimal classification method for this case would lead toa very different result than the previous example. The point here is that we canthink of many different evaluation methods; these may be driven by concerns of aparticular application, or they may just be different common sense approaches. Inthis section I consider two sensible evaluation criteria, mixture and componentwise,and give the classification methods which are appropriate for each.Mixture ClassificationFirst let us consider the case mentioned above in which we want to maximize thenumber of pixels which are correctly classified. This is the mixture classificationcase.Theorem 3.1: Optimality of Mixture ClassificationTo maximize the number of correctly classified pixels, the optimal classification ruleis to assign each pixel to the segment which has the largest posterior probability,as shown in equation 3.18.C i = argmax m P m Φ(Y i |θ m ) (3.18)In equation 3.18, Y is the observed image and C is the estimated classification.

41The ith pixel of X or C is denoted by subscript. The parameter vector θ, themixture proportion P , and the density Φ are defined in equation 3.15 of section3.2.Proof of Theorem 3.1We define a utility function g(X, C) where X is the true image (i.e. the unobservabletrue classification). Equation 3.19 shows the utility function for mixtureclassification.g(C|X) = (1/N) ∑ iI(X i , C i ) (3.19)Here, I(A, B) = 1 if A = B and 0 otherwise, and X i is the true (unobserved) valueunderlying the observation Y i . We wish to find the value of C to maximize theutility.argmax C g(C|X) = argmax C (1/N) ∑ iI(X i , C i ) (3.20)Since we are not able to observe X, we must restate this in probabilistic terms.Let P (X i = C i |Y i ) denote the probability that X i = C i given that we observe Y i .argmax C g(C|Y ) = argmax C (1/N) ∑ iP (X i = C i |Y i ) (3.21)Inside the sum, each term depends only on a single pixel of C, so maximizationwill be accomplished by choosing as the value of C i that value which maximizesthe probability that X i = C i . Note that X i and C i can take on only the classvalues of 1...K, so the maximization becomesApplying Bayes theorem,C i = argmax m P (X i = m|Y i ) (3.22)

42P (X i = m|Y i ) ∝ P (Y i |X i = m)P (X i = m) (3.23)As shown in equation 3.15, we have modeled the distribution of Y i , and we canestimate the prior P (X i = m) by using the mixture proportion P m . Substitutinginto equation 3.22,C i = argmax m P m Φ(Y i |θ m ) (3.24)In other words, pixel i is assigned to the component for which it has the highestlikelihood, and we include the mixture proportions (P m ) in the likelihood. Thisseems intuitively reasonable since we are using the mixture density given by equation3.15.End of Proof.Componentwise ClassificationAlthough maximizing the number of correctly classified pixels is both reasonableand consistent with our presumed mixture model, a componentwise approach tothis problem may be more useful in some circumstances. Consider a case in whichwe have a large background and only a few small features of interest. The mixtureproportions will be dominated by the component describing the background, givinginordinate weight to classification of pixels as background. As the proportion ofpixels in the background increases, classification by the mixture likelihood maylead to classifying all pixels as background even when Φ(Y i |θ j ) for the feature isseveral orders of magnitude larger than Φ(Y i |θ j ) for the background. In this case,we would want to use componentwise classification so that the feature componentswould receive a more reasonable share of influence on the classification.

43In componentwise classification, our aim is to maximize the sum of the proportionsof correctly classified pixels for each component. This gives equal weightto each component. I now derive the classification rule which corresponds to thecomponentwise approach.Theorem 3.2: Optimality of Componentwise ClassificationTo maximize the sum of the proportions of correctly classified pixels for eachcomponent, the optimal classification rule is to assign each pixel to the segmentwhich has the largest component likelihood, as shown in equation 3.25.C i = argmax m Φ(Y i |θ m ) (3.25)Proof of Theorem 3.2To find the appropriate classification rule for the componentwise approach, webegin by stating its utility function.(∑)K∑i I(X i , j)I(X i , C i )g(C|X) =∑j=1i I(X i , j)(3.26)As before, I(A, B) = 1 if A = B and 0 otherwise, and X i is the true (unobserved)value underlying the observation Y i . Note that the term in the denominatorof equation 3.26 is constant with respect to C. The form of equation 3.26is meant to be conceptually clear, but an equivalent and more computationallyuseful form is obtained by moving the denominator into the sum in the numeratorand interchanging the order of summation.g(C|X) = ∑ i⎛ ()⎞K∑ 1⎝ ∑ I(X i , j)I(X i , C i ) ⎠ (3.27)q I(X q , j)j=1

44In finding argmax C g(C|X), we can now consider each pixel separately sinceeach term in the sum over i only involves pixel i. Let N j denote the number ofpixels in class j. The maximization takes the following form.K ( )∑ 1C i = argmax m I(X i , j)I(X i , m) (3.28)N jj=1Because of the two indicator functions, terms in the sum will be nonzero onlywhen j = m. Note that I(X i , m) 2 = I(X i , m).C i = argmax m( 1N m)I(X i , m) (3.29)We can now return to our modeling assumptions and multiply by N N .Note that N N m=( )1 NC i = argmax m P (X i = m|Y i ) (3.30)N N m1P (X i, and apply Bayes’ theorem as in equation 3.23.=m)()1 1C i = argmax m P (Y i |X i = m)P (X i = m) (3.31)N P (X i = m)C i = argmax m P (Y i |X i = m) (3.32)As shown in equation 3.15, we have modeled the distribution of Y i , so we cansubstitute into equation 3.32 to obtain equation 3.33.C i = argmax m Φ(Y i |θ m ) (3.33)End of Proof.Application of equation 3.33 means that we should classify each pixel into itsmost likely component, without regard for the mixture proportion of each com-

45ponent. This classification procedure differs from the one suggested by equation3.24, unless the mixture proportions for all components happen to be equal.In this section, I have examined two different classification goals by expressingthem as different utility functions and deriving the appropriate classificationstrategy. Choice of the classification goals is something which must be consideredoutside the context of an algorithm, since the appropriate algorithm mustbe chosen to fit the task. Of the two methods I have presented here, the second(componentwise classification) is more appropriate when there is interest in smallerfeatures in the data, since these might be washed out by the dominance of a fewlarge features if the mixture classification method is used. The componentwiseclassification method is used in my algorithm in chapter 5.2.3.4.2 Correspondence of Components with SegmentsWe have so far been using the ideas of segments (in the image) and components (inthe mixture model) almost interchangeably. However, treating the mixture modelthis way sometimes yields results which are difficult to interpret. For example, twocomponents might have the same mean and different variances. If we are consideringthe mean to define a feature, then we should consider the two components torepresent one segment. Similarly, when one component has the highest likelihoodfor two unconnected (in greyscale value) groups of pixels, it is unclear whether weshould consider this to be one segment or two.It is important to make a distinction between segments and components. Forimages in general, we need to recognize that a segment can be modeled by one ormore components and a component can represent one or more segments. Althoughit might be difficult to account for this in an automatic method, our currentapproach of equating each component with exactly one segment could be seen asa specific case in this more general view. Particular applications might require

46fine tuning of this. For instance, if we know from the application that each featureshould have a very consistent and unique grey value, then we should split a segmentwhich contains two or more disjoint sets of grey values. This will not change themodel fitting process, only the final pixel classification will be changed.

Chapter 4ADJUSTING FOR AUTOREGRESSIVE DEPENDENCEIn this chapter I present a theoretical approach to model selection for data withautoregressive dependence. I show how the BIC should be modified to deal withcertain kinds of dependence in the data. First, I address the simplest case, whichis the the AR(1) model with one-dimensional data. I then present the raster-scanautoregression (RSA) model, which is a special case of an AR(P) model. I showthat for data with autoregressive dependence, a simple modification to the usualBIC formula is necessary, and the appropriate “boundary” data points must beexcluded from analysis.4.1 Adjusting BIC for the AR(1) ModelIn the following sections, I use an AR(1) model to examine the effect of spatial (ortemporal) dependence on the BIC (equation 3.12). I approach this in two parts,first considering the loglikelihood term and then the penalty term.4.1.1 Loglikelihood AdjustmentConsider data from the AR(1) modelY i = C + βY i−1 + ɛ i (4.1)where |β| < 1 and the ɛ i are IID N(0, σ 2 ɛ ). Note that it is important todistinguish between the variance of ɛ i , which is σ 2 ɛ , and the variance of Y i , which

48is σ 2 Y .Independence CaseWhen β = 0, the Y i are independent. In this case the following statements aretrue:E[Y i ] = C (4.2)V AR[Y i ] = σ 2 Y = σ2 ɛ (4.3)E[Ȳ ] = E[Y i ] = C (4.4)V AR[Ȳ ] =(σ2YN)=( )σ2ɛN(4.5)Furthermore, we can write down the loglikelihood by using the usual productof independent Gaussian densities.L(Y |M) = − N 2 log(2π) − N 2 log(σ2 Y ) − 1 N∑(Y2σY2 i − C) 2 (4.6)i=1Because we will later want to condition on the value of Y 1 , we do this here also:L(Y −1 |M, Y 1 ) = − N − 12log(2π) − N − 1 log(σY 2 ) − 122σY2N∑(Y i − C) 2 (4.7)i=2As N increases, the contribution of Y 1 to the loglikelihood becomes proportionatelysmaller.Equations 4.2 to 4.5 are actually special cases of equations 4.8 to 4.11, whichare shown in the next section. When β = 0 is substituted into equations 4.8 to4.11, they reduce to equations 4.2 to 4.5.

49Dependence CaseWhen |β| < 1, then the following statements hold (Hamilton, 1994).E[Y i ] = C/(1 − β) (4.8)V AR[Y i ] = σ 2 Y = σ2 ɛ /(1 − β2 ) (4.9)E[Ȳ ] = E[Y i] = C/(1 − β) (4.10)V AR(Ȳ ) =( ) ( )1 + β σ2ɛ1 − β N(4.11)We are able to observe the Y i−1 value preceding each Y i for every observationexcept the first. Not surprisingly, the contribution of the first observation to theoverall loglikelihood is different than that of the other observations, and this makesthe loglikelihood equation more difficult to analyze. Since the proportion of thecontribution of Y 1 to the loglikelihood becomes small as N increases, we chooseto condition on its value, which has the effect of simplifying the formula for theloglikelihood. Furthermore, estimates based on this conditional loglikelihood areconsistent and asymptotically equal to estimates based on the exact loglikelihood.The exact loglikelihood isL(Y |M) = − 1 2 log(2π) − 1 2 log(σ2 ɛ /(1 − β 2 )) − (Y 1 − C/(1 − β)) 2− N − 12log(2π) − N − 1 log(σɛ 2 ) − 122σɛ22σɛ 2 /(1 − β 2 )N∑(Y i − C − βY i−1 (4.12) ) 2The first three terms in equation 4.12 can be thought of as the contribution ofY 1 , with the remaining terms resulting from Y 2 ...Y N . Conditioning on Y 1 we obtaini=2

50L(Y −1 |M, Y 1 ) = − N − 12log(2π)− N − 1 log(σɛ 2 2)− 12σɛ2N∑(Y i −C −βY i−1 ) 2 (4.13)i=2As we would expect, equation 4.13 is equivalent to equation 4.7 when β = 0.Effect on ComputationWe wish to investigate the differences between equation 4.13 and equation 4.7.Since we are interested in the practical effects of these differences, I begin byrestating these equations in their empirical form, that is, the equations whichwould be used to calculate the loglikelihoods from the data. I use the standardconvention of denoting a maximum likelihood estimate with a hat.If we were to assume independence (i.e. β = 0), then the loglikelihood wouldbe computed asL(Y −1 |M, Y 1 ) ≈ − N − 12log(2π) − N − 1 log(ˆσ Y 2 2) − 12ˆσ Y2N∑(Y i − Ȳ ) 2 (4.14)i=2With the more general AR(1) model, the loglikelihood would be computed asL(Y −1 |M, Y 1 ) ≈ − N − 12log(2π)− N − 1 log(ˆσ ɛ 2 )− 122ˆσ ɛ2N∑(Y i −Ĉ − ˆβY i−1 ) 2 (4.15)i=2The difference between equation 4.14 and equation 4.15 is the use in the secondterm of σ Y as opposed to σ ɛ . We can correct this by using equation 4.16.ˆσ Y 2 = ˆσ ɛ 2 /(1 − R 2 ) (4.16)The R 2 value in equation 4.16 refers to the regression model suggested byequation 4.1. In general, the R 2 value gives the relationship between the variationin the response values and the variation in the residuals, so equation 4.16 can be

52For practical purposes, equation 4.21 will allow fast estimation of the dependenceloglikelihood as long as we can compute the value of R 2 . For the AR(1) case,such an estimate can be obtained through an ordinary least squares regression.For large datasets, an adequate estimate of R 2 might be obtained by subsamplingthe data. A random set of points would be sampled (forming the dpendentvariable), along with the points preceding each sampled point (forming the predictorvariable). From these vectors, least squares regression can be performed toestimate the β coefficient and and yield a value for R 2 . This assumes that β isconstant over the whole dataset.4.1.2 Penalty AdjustmentIn this section I derive the adjustment to N which follows from use of the BICwith the AR(1) model. In section 3.1, we saw how the loglikelihood and penaltyterm in equation 3.11 arise from equation 3.9. In the previous section I derivedthe adjustment to the loglikelihood term which we need with the AR(1) model, soour interest now lies in how the term log(| − g ′′ (˜θ|M i )|) should be computed forthe AR(1) case (the other terms in equation 3.9 are constant with respect to N).We again use the approximation that ˜θ ≈ ˆθ when N is large. Dropping thesubscript on M since it is not of interest here, we wish to computelog(| − g ′′ (ˆθ|M)|) (4.22)where g(θ) = log(p(Y |θ, M)p(θ|M). Simplifying g, we haveg(θ) = log(p(Y |θ, M)) + log(p(θ|M)) (4.23)The second term and its derivatives are constant with respect to N; since wewill later drop terms which do not increase with N we can exclude it from further

53analysis. We now consider g ′′ (θ), the matrix of second partial derivatives of g(θ)with respect to θ. Note that θ = (µ, σ), so g ′′ (θ) will be a 2x2 matrix. The (i, j)thelement of this matrix is given by equation 4.24.g ′′ (θ) ≈ ∂2 log(p(Y |θ, M))∂θ i ∂θ j(4.24)We assume that N is large, so that −g ′′ (θ|M) ≈ E[−g ′′ (θ|M)]. This expectedvalue is also known as the Fisher Information I(Y, θ).Independence CaseTo compute I(Y, θ) for a set Y = (Y 1 ...Y N ) of independent Normal (µ, σ 2 ) observations, we need the second derivatives of the Gaussian loglikelihood.g(θ) ≈ log(p(Y |θ, M)) (4.25)=N∑(− 1 )1log(2π) − log(σ) −2 2σ 2(Y i − µ) 2 (4.26)i=1The derivatives are shown below.This leads to∂ log(p(Y |θ, M))∂µ∂ log(p(Y |θ, M))∂σ∂ 2 log(p(Y |θ, M))∂µ∂σ===∂ 2 log(p(Y |θ, M))∂σ 2 =∂ 2 log(p(Y |θ, M))∂µ 2 =N∑− 1i=1σ 2(−Y i + µ) (4.27)N∑− 1i=1σ + 1 σ 3(Y i − µ) 2 (4.28)N∑ 2i=1σ 3(−Y i + µ) (4.29)N∑ 1i=1σ − 3 2 σ 4(Y i − µ) 2 (4.30)N∑− 1i=1σ 2 (4.31)

54g ′′ (θ) ≈⎛⎜⎝∑ Ni=1− 1 ∑ Ni=1 2(−Yσ 2 σ 3 i + µ)∑ Ni=1 2∑(−Yσ 3 i + µ) Ni=1( 1 − 3 (Yσ 2 σ 4 i − µ) 2 )The Fisher information is E[−g ′′ (θ|M)], which we can now compute.I(Y, θ) ≈⎛⎜⎝N0σ 202Nσ 2⎞⎞⎟⎠ (4.32)⎟⎠ (4.33)Denote the dimension of the Fisher Information matrix by D (in this caseD = 2). We can factor out N from the determinant of I(Y, θ).|I(Y, θ)| = N D |I(Y i , θ)| (4.34)We now return to equation 4.22 and drop terms which do not increase with Nto obtain the usual BIC result.log(| − g ′′ (ˆθ|M)|) ≈ log(N D |I(Y i , θ)|) (4.35)= D log(N) + log(|I(Y i , θ)|) (4.36)≈ D log(N) (4.37)Dependence CaseWe now assume that the data Y are generated by the AR(1) model of equation4.1. In this case, θ = (C, β, σ ɛ ) so the Fisher Information matrix will be 3x3 ratherthan 2x2 as it was in the independence case. For the more general AR(P) case, θwould contain C, all of the autorgressive coefficients, and σ ɛ .We begin by stating the loglikelihood for Y under the AR(1) model (conditioningon Y 1 ) and finding its first and second derivatives. Let N 0 denote the totalnumber of observations minus the number on which we are conditioning; for theAR(1) case, N 0 = N − 1.

55g(θ) ≈ log(p(Y |θ, M)) (4.38)(N∑= − 1 2 log(2π) − log(σ ɛ) − 1)(Y2σɛ2 i − C − βY i−1 ) 2 (4.39)i=2The derivatives are shown below.∂ 2 log(p(Y |θ, M))∂C 2 =∂ 2 log(p(Y |θ, M))∂C∂β=∂ 2 log(p(Y |θ, M))∂C∂σ ɛ=∂ 2 log(p(Y |θ, M))∂β 2 =∂ 2 log(p(Y |θ, M))∂β∂σ ɛ=∂ 2 log(p(Y |θ, M))∂σ 2 ɛ=N∑i=2N∑i=2N∑i=2N∑i=2N∑i=2N∑−1σ 3 ɛ−Y i−1σ 2 ɛ(4.40)(4.41)−2(Yσɛ3 i − C − 2βY i−1 ) (4.42)−Y 2i−1σ 2 ɛFollowing analogously to section 4.1.2, we find⎛I(Y, θ) ≈⎜⎝1N 0 σɛ2CN 0 (1−β)σɛ2(4.43)−2(Yσɛ3 i−1 )(Y i − C − βY i−1 ) (4.44)( 1− 3 )(Yi=2σɛ2 σɛ4 i − C − βY i−1 ) 2 (4.45)N 0( 1σ 2 ɛCN 0 (1−β)σɛ) ( 2 σ 2ɛ+1−β 2)C2(1−β) 220 0 N 0 σɛ2As in the independence case, we can factor out N 0 and drop the terms whichdo not increase with N.00⎞⎟⎠(4.46)log(| − g ′′ (ˆθ|M)|) ≈ log(N D 0 |I(Y i , θ)|) (4.47)= D log(N 0 ) + log(|I(Y i , θ)|) (4.48)≈ D log(N 0 ) (4.49)

564.1.3 Computing BIC with the AR(1) ModelWe can now construct the correct form of the BIC for the AR(1) model. SupposeM i is the model under consideration, and M i has D i parameters (degrees offreedom). Let B denote the beginning of the sequence (or border of an image) onwhich we condition in order to estimate parameters in the dependence model. Y −Bis the data not including the set B, and N 0 is the number of data points in Y −B .L(Y −B |M i ) is the maximized loglikelihood of the data under model M i . To avoidconfusion, we must distinguish between the BIC with the dependence correctionand usual form of the BIC, which assumes independence among all observations.I will use BIC IND and L IND to denote the BIC and loglikelihood with the independenceassumption, and BIC ADJ and L ADJ will be the BIC and loglikelihoodwith adjustment for dependence. The version of BIC which should be used for theAR(1) model is given by equation 4.52.L ADJ (Y −1 |M, Y 1 ) ≈ L IND (Y −1 |M) − N − 1 log(1 − R 2 ) (4.50)2BIC ADJ (M i ) = 2L ADJ (Y −B |M i ) − D i log(N 0 ) (4.51)BIC ADJ (M i ) = 2(L IND (Y −B |M i ) − N − 1 log(1 − R 2 )) − D i log(N 0 ) (4.52)2Note that the value of D ineeds to include the count of the autoregressiveparameters. For example, in the AR(1) case described above we have D = 3parameters in the model: mean (or equivalently the C parameter from the autoregressivemodel), variance, and autoregressive coefficient.

574.2 Adjusting BIC for the Raster Scan Autoregression (RSA) ModelI present the raster scan autoregression (RSA) model as a generalization of theAR(1) model which I investigated in the previous section. Instead of modelingspatial dependence with only the preceding (to the left) pixel, the RSA model uses4 neighbors. These are the 4 pixels which both precede the current pixel in rasterscan order and are adjacent to it (raster scan order is the ordering of pixels fromleft to right, then top to bottom of the image). In other words, these 4 neighborsare the adjacent neighbors to the left, above left, above, and above right in relationto the current pixel.Suppose we have an image which is H pixels high and W pixels wide. TheRSA model is given by equation 4.53.Y i = C + β W +1 Y i−(W +1) + β W Y i−W + β W −1 Y i−(W −1) + β 1 Y i−1 + ɛ i (4.53)Here, the ɛ i are IID Normal(0, σɛ 2 ), C is a constant, and the β values are theautoregressive coefficients which characterize the dependence structure. This isequivalent to an AR(W+1) model, with the additional constraint that only theautoregressive coefficients in equation 4.53 are nonzero. I consider only covariancestationary AR processes, so the autoregressive coefficients must satisfy certainregularity conditions (the interested reader is referred to (Priestley, 1981)).As in section 4.1, I wish to find the appropriate formulation of BIC for thisdependence model, and I will proceed by examining the loglikelihood and penaltyterms of the BIC separately.4.2.1 Loglikelihood AdjustmentFollowing analogously to section 4.1.1, I need to find the loglikelihood for the 4-neighbor RSA model. Due to the spatial nature of the data, I condition on the

58upper, left, and right borders of the image; I will refer to these borders as theboundary set B. The loglikelihood for one pixel is given in equation 4.54.L(Y i |M, B) = − 1 2 log(2π) − 1 2 log(σ2 ɛ ) − 12σ 2 ɛ(Y i − (C + β W +1 Y i−(W +1)+β W Y i−W + β W −1 Y i−(W −1) + β 1 Y i−1 )) 2 (4.54)Since this is an AR(P) model, we can write the loglikelihood of the whole image(conditional on B) as follows.L(Y −B |M, B) = ∑ i∈ ¯BL(Y i |M, B) (4.55)L(Y −B |M, B) = − N 02 log(2π) − N 02 log(σ2 ɛ ) − 12σ 2 ɛ∑(Y i − (C + β W +1 Y i−(W +1)+β W Y i−W + β W −1 Y i−(W −1) + β 1 Y i−1 )) 2 (4.56)i∈ ¯BN 0 is the total number of data points (N) minus the number of points on whichI condition (the number of points in the set B). The summation over ¯B meansthat we sum over all of the data points except those in B. M denotes a particularmodel (i.e. a set of coefficient values).Rewriting equation 4.56 in terms of estimation, we obtain the following.L(Y −B |M, B) = − N 02 log(2π) − N 02 log(ˆσ2 ɛ ) − 12ˆσ 2 ɛ∑(Y i − (Ĉ + ˆβ W +1 Y i−(W +1)+ ˆβ W Y i−W + ˆβ W −1 Y i−(W −1) + ˆβ 1 Y i−1 )) 2 (4.57)i∈ ¯BNote that the last of the three terms in equation 4.57 can be greatly simplifiedbecause both the denominator and1N 0 −1 times the numerator are estimators of σ2 ɛ .

59ˆσ 2 ɛ = 1N 0 − 1∑(Y i − (Ĉ + ˆβ W +1 Y i−(W +1) + ˆβ W Y i−W + ˆβ W −1 Y i−(W −1) + ˆβ 1 Y i−1 )) 2i∈ ¯BSubstituting this into equation 4.57, we obtain(4.58)L(Y −B |M, B) = − N 02 log(2π) − N 02 log(ˆσ2 ɛ ) − N 0 − 12(4.59)Let L IND denote the loglikelihood as it would be computed with the assumptionthat all pixels are independent.As shown in equation 4.60, the independenceloglikelihood is quite similar to the dependence loglikelihood, except for the secondterm.L IND (Y −B |M, B) = − N 02 log(2π) − N 02 log(ˆσ2 Y ) − N 0 − 12(4.60)We now need to examine the relation between σ 2 ɛ and σ 2 Y so that an adjustmentterm relating equations 4.59 and 4.60 can be found (analogous to equation 4.21).We saw in equations 4.9 and 4.16 that a simple relation exists between these twoquantities for the AR(1) case. As mentioned in section 4.1.1, equation 4.16 is validfor any AR(P) model; in particular, it is valid for the RSA model. This meansthat with the RSA model we can correct the loglikelihood in the same way that itcan be corrected with the AR(1) model. I restate equation 4.16 as equation 4.61.ˆσ Y 2 = ˆσ ɛ 2 /(1 − R 2 ) (4.61)The R 2 value in equation 4.61 is from the RSA autoregression model, that is,the least squares regression with Y −B as the response vector and lagged values ofY (corresponding to the four adjacent neighbors preceding each Y i in raster scanorder) as the predictors. Combining equations 4.59, 4.60, and 4.61, we arrive atequation 4.62.

60L(Y −B |M, B) = L IND (Y −B |M) − N 02 log(1 − R2 ) (4.62)Equation 4.62 shows that the loglikelihood with the RSA model can be computedby adjusting the independence loglikelihood with an additive term based onR 2 .I now proceed to show that σY 2 can also be expressed as a function of the modelcoefficients (the β values) and σɛ 2 . This is not needed for the BIC; it is presentedonly for completeness. The rest of this section can be skipped without loss ofcontinuity.Begin with the RSA model of equation 4.53. Take expectations and let µdenote the expected value of Y i .E[Y i ] = µ =C(1 − β W +1 − β W − β W −1 − β 1 )(4.63)C = µ(1 − β W +1 − β W − β W −1 − β 1 ) (4.64)Subtract µ from both sides of equation 4.53 and substitute in for C.(Y i −µ) = β W +1 (Y i−(W +1) −µ)+β W (Y i−W −µ)+β W −1 (Y i−(W −1) −µ)+β 1 (Y i−1 −µ)+ɛ iMultiply both sides by (Y (i−j) − µ) and take expectations.(4.65)

61E[(Y i − µ)(Y (i−j) − µ)] = E[β W +1 (Y i−(W +1) − µ)(Y (i−j) − µ)+β W (Y i−W − µ)(Y (i−j) − µ)+β W −1 (Y i−(W −1) − µ)(Y (i−j) − µ)+β 1 (Y i−1 − µ)(Y (i−j) − µ) + ɛ i (Y (i−j) − µ)](4.66)Let γ j denote the covariance of Y i and Y i−j . Note that γ j = γ −j . Since E[Y i −µ] = 0, equation 4.66 is actually equal to γ j . We can now rewrite equation 4.66 interms of γ.γ j = β W +1 γ W +1−j + β W γ W −j + β W −1 γ W −1−j + β 1 γ 1−j + I(j = 0)σ 2 ɛ (4.67)where I(j = 0) is an indicator function equal to 1 when j = 0 and 0 otherwise.We now have an equation for γ j for arbitrary j in terms of β values, σɛ 2 , andother γ j values. The equations for γ 0 to γ W +1 form a system of W + 2 equationswith W + 2 unknowns (treating the β and σɛ2 values as constant). This can besolved for γ 0 ; this will yield an equation of the form given in equation 4.68.γ 0 = h(β W +1 , β W , β W −1 , β 1 )σɛ 2 (4.68)Here, h(β W +1 , β W , β W −1 , β 1 ) is a function which can be found algebraically forany given value of W . For the more general AR(P) model, equation 4.68 willcontinue to hold, except that h will be a function of β 1 to β P . Since γ 0 is thevariance of Y , we can rewrite equation 4.68 as equation 4.69.σ 2 ɛ =σ 2 Yh(β W +1 , β W , β W −1 , β 1 )(4.69)

62We now return to equation 4.56 and substitute in equation 4.69.L(Y −B |M, B) = − N 02 log(2π) − N 02 log( σY2h(β W +1 , β W , β W −1 , β 1 ) ) − N 0 − 12(4.70)L(Y −B |M, B) = − N 02 log(2π)−N 02 log(σ2 Y )+ N 02 log(h(β W +1, β W , β W −1 , β 1 ))− N 0 − 12(4.71)Let L IND denote the loglikelihood as it would be calculated assuming independenceamong all the pixels. Equation 4.71 can be rewritten in terms of the independenceloglikelihood in order to show that there is only an additive correctionterm between the independence loglikelihood and the correct (with dependence)loglikelihood of equation 4.56.L(Y −B |M, B) = L IND (Y −B |M, B) + N 02 log(h(β W +1, β W , β W −1 , β 1 )) (4.72)Equation 4.72 shows the adjustment needed to correct the independence loglikelihoodfor dependence. Since h(β W +1 , β W , β W −1 , β 1 ) is usually difficult to find,I recommend using equation 4.62 instead.4.2.2 Penalty AdjustmentAlthough section 4.1.2 focuses on the AR(1) model, the arguments are equallyvalid for other dependence models in which the loglikelihood can be expressed asa sum over the data points. In both the AR(P) model and the RSA model, theloglikelihood is given by such a sum (see equation 4.55).As shown in section 4.1.2, the adjustments to the BIC penalty term whendependence is present consist of using values of the degrees of freedom D andthe number of data points N which are consistent with the dependence model.

63The dependence model uses additional parameters, so the count of these must beincluded in D. For the RSA model, 4 autoregressive coefficients are used, as wellas a mean and a variance, so the overall number of degrees of freedom is D = 6.We exclude the boundary points from computation, so N is reduced to N 0 , thenumber of data points excluding the boundary. Thus the penalty term for theadjusted BIC is D log(N 0 ), where D includes the autoregressive coefficients.4.2.3 Computing BIC with the RSA ModelLet BIC ADJ denote the BIC as it would be computed assuming dependence amongthe data. I have shown in section 4.2.1 that the independence loglikelihood L INDcan be adjusted for dependence using equation 4.62.This, combined with thesimple penalty adjustment, means that we can write BIC ADJ in terms of L INDand a correction term based on R 2 from the RSA model. This is shown in equation4.73.(BIC ADJ (M i ) = 2 L IND (Y −B |M i ) − N )02 log(1 − R2 ) − D i log(N 0 ) (4.73)In equation 4.73, Y −B indicates the data Y excluding the boundary B. N 0 is thenumber of data points in Y −B , and D i is the number of parameters in model M i ,including autoregressive coefficients. The R 2 value is from the RSA autoregression,as described in section 4.2.1.4.3 Mixture RSA ModelsThe mixture model and the RSA model can be combined by using the mixturemodel to estimate the mean structure and the RSA model to estimate the dependencestructure. I begin by estimating the parameters of the mixture model

64and then computing a mean-corrected version of the image. This mean-correctedimage is used as data for the RSA model.Suppose we have an estimate of the classification ˜Z and parameter estimatesfor each segment θ. Recall from above that Z i = j means that pixel i is classifiedinto segment j, and that we have arranged the pixels in raster scan order. Now,the mean-corrected image M can be formed with the following equation.M i = Y i − µ Zi (4.74)Here, µ j is the mean for segment j (µ j will be one of the elements of θ j ). Foreach pixel, I am removing the mean of the segment into which the pixel has beenclassified. I now fit the RSA model with the mean-corrected image. Recall thatW is the width of the image.M i = β W +1 M i−(W +1) + β W M i−W + β W −1 M i−(W −1) + β 1 M i−1 + ɛ i (4.75)Since the mean has been removed from every pixel, E(M i ) = 0 and so there isno constant term in equation 4.75. The β coefficients can be estimated by fittinga least-squares regression in which the M i values (excluding the boundary) are theresponse and appropriately lagged M i values are the predictors. This regressionprovides an R 2 value for use in equation 4.76; the results of this regression canbe examined using the same methods and diagnostics as would be applied to anyAR(P) model.To the extent that BIC can be used with mixture models, it can also be usedwhen autoregressive dependence is present. The formula derived in section 4.2.3for BIC with an RSA model (equation 4.73) is restated here for the case where themodel is a mixture RSA model.

65(BIC(K) = 2 L IND (Y −B | θ ˆ K , K) − N )02 log(1 − R2 K) − D K log(N 0 ) (4.76)In equation 4.76, Y −Bis the image Y excluding the boundary B. K is thenumber of segments, and ˆ θ K is the vector of estimated parameters for the modelwith K segments. The 4 autoregressive coefficients are included in ˆ θ K , as well as(K − 1) mixture proportions, K mean parameters, and, for the Gaussian case, Kvariance parameters. The R 2 K value is from the autoregression described in section4.4. N 0 is the number of pixels in Y −B , and D K is the number of parameters inˆ θ K .4.4 Fitting the Raster Scan Autoregression ModelAfter performing a segmentation of the image into K segments, the estimatedsegmentation ˜Z can be used to create a mean-corrected version of the image M.This procedure is described in section 4.3. After finding M, we need to fit themodel given in equation 4.77.M i = β W +1 M i−(W +1) + β W M i−W + β W −1 M i−(W −1) + β 1 M i−1 + ɛ i (4.77)It is quite straightforward to compute this model with standard least squaresregression software. The response variable is the vector of M i values in rasterscan order, excluding the observations on the image boundary.Since the RSAmodel involves the four neighbors which precede each observation in raster scanorder, the four predictor vectors are these four lagged values for each pixel. Thepredictor vectors will each contain some of the boundary pixels. When a leastsquares regression is computed from this model, the coefficients of the 4 predictorsare exactly the four estimated β values for the RSA model. The R 2 value from this

66regression, which can be written as R 2 K to emphasize the fact that it correspondsto a segmentation with K segments, can then be used in equation 4.80.4.5 Choosing the Number of Segments with BICAfter running the EM algorithm, we have estimates of the model parameters and afinal value of the loglikelihood of the data with the estimated mixture model. Thisloglikelihood L is calculated under the assumption that all pixels are independent,but excluding the pixels on the edge of the image. Denote the interior (non-edge)pixels by I.⎛⎞log(L(Y |K, ˆθ)) = ∑ K∑log ⎝ P j Φ(Y i |θ j ) ⎠ (4.78)i∈I j=1As with the parameter estimation equations, this loglikelihood computationcan be made faster by only computing each term once for each unique data value.Equation 4.79 is equivalent to equation 4.78, but equation 4.79 only requires iterationover the unique data values instead of iteration over the entire data set.⎛⎞log(L(Y |K, ˆθ))C∑K∑= H i log ⎝ P j Φ(V i |θ j ) ⎠ (4.79)ij=1As shown in previous chapters, we can use the BIC based on this loglikelihoodwith an adjustment. Computation of the BIC at this point is straightforward,as shown in equation 4.80. I use the notation BIC(K) to emphasize that thisBIC value is computed for a particular value of K and all other parameters areestimated automatically in the segmentation algorithm.BIC(K) = 2 log(L(Y |K, ˆθ)) − N 0 log(1 − R 2 K) − D K log(N 0 ) (4.80)The R 2 K value is from the autoregression with K segments described in section4.4, and N 0 is the number of pixels in the interior of the image (i.e., the total

67number of pixels minus the number of boundary pixels). The number of degreesof freedom D K in this equation is equal to the number of parameters estimated;in other words, D K is equal to the number of elements in θ, which contains the 4autoregressive paramters, K − 1 mixture proportions, and the density parameters.For a Gaussian mixture, D K = 3K + 3; for a Poisson mixture, D K = 2K + 3.To choose the number of segments, we compute BIC(K) for several valuesof K, starting with K = 1. We then increase K until we find a local maximumin BIC(K); the value of K at this first local maximum is chosen as the optimalnumber of segments. This scheme for choosing K is reasonable when one expects asmall number of segments; it avoids problems with the model failing to hold whenlarge numbers of segments are fitted.Unfortunately, although the results presented in this chapter for using BIC withautoregressive data are valid, the raster scan autoregression model is not good forimage segmentation, as shown in the next section.4.6 Application of the RSA Model to Image DataAlthough it was initially thought that the RSA model might provide a computationallyfast way of performing image segmentation, it turns out that the RSAmodel is a bad fit for image data. This is because a model with too few segmentswill result in a very large R 2 value due to large changes in the mean from segmentto segment. This simple flaw can be illustrated even with a one-dimensionalexample.Figure 4.1 shows two time series. The first was generated by an AR(1) process(autoregressive coefficient 0.95, Gaussian noise with mean 0.5 and standard deviation1), while the second consists of two sequences of independent Gaussian noise(means 5 and 15, both with standard deviation 1). When we fit an AR(1) modelto each of these signals, a high level of autoregressive dependence is evident; the

68resulting R 2 values are 0.93 for the true AR(1) data and 0.92 for the change pointdata. For the first signal, it makes sense to try fitting an AR(1) model. For thesecond signal, the AR(1) model is not a good model because the change point hastoo much influence on the fit.Value0 5 10 15 20•• •••••••• • ••••• •• • • •• ••• •••• • • •• • •• ••• •••• •• •• • • ••• • • • •• •••• • • ••••• ••• • • ••• ••••• ••• • ••• • •• • • • •• • • •• •• •• •••• • •• • •• • •••••• • • • •• •• ••••• • • •••••• • •••• • • •• ••• ••• ••• ••• •• • •• •• •• • • ••• ••• •• •••• • •Value0 5 10 15 20•• ••••••• • • •• • •• • •••• • •••• •••••• •••• •• ••••• ••••••• • ••• •••• •• • • •••• • ••• • •••••• • •• ••• • •• • • •••••••• •• •• • ••• • •• • ••• •• • ••• • • •• • • ••• • • • • ••• •• ••• • •• •••• •• ••• •• • • •• ••• •••••• • • • ••• •• ••• • •• •• • ••••••• • •0 50 100 150 200Time0 50 100 150 200Time(a)(b)Figure 4.1: (a) Signal generated by an AR(1) process (R 2 = 0.93). (b) Signal consisting of twosequences of independent Gaussian noise (R 2 = 0.92).Table 4.1: Loglikelihood and BIC results for the data of figure 4.1B.Number of Mixture Unadjusted AdjustedSegments Loglikelihood BIC BIC1 -607.2 -1230.2 -727.62 -273.1 -572.8 -566.7When we compute the adjusted BIC shown in equation 4.80, the R 2 adjustmentterm does not distinguish between true autoregressive dependence (as in figure4.1A) and apparent dependence due to fitting a model with too few segments (as

69in figure 4.1B). This problem causes the adjusted BIC to tend to underestimatethe number of segments because smaller models yield large R 2 values due to thelack of fit of the model.The BIC and adjusted BIC values for figure 4.1B are shown in table 4.1. Becausethe difference in loglikelihood is so large in this case, both the BIC and theadjusted BIC make the correct choice between one and two segments, though theunadjusted BIC is more decisive.An example of this problem with image data is given by figure 4.2 and table4.2. The image clearly has two segments, but the adjusted BIC (equation 4.80)incorrectly chooses one segment. An unadjusted BIC, i.e. equation 4.80 withoutthe R 2 term, correctly selects two segments in this relatively easy example. Themixture loglikelihood shows little change as more segments are added beyond two,which is consistent with the fact that only two segments are needed for this image.Table 4.2: Loglikelihood and BIC results for the image of figure 4.2.Number of Mixture Unadjusted AdjustedSegments Loglikelihood BIC BIC1 -2095 -4203 -32672 -1771 -3572 -36163 -1770 -3589 -36474 -1770 -3606 -3679

700 5 10 15 200 5 10 15 20Figure 4.2: Simulated two segment image.

Chapter 5AUTOMATIC IMAGE SEGMENTATION VIA BIC5.1 Pseudolikelihood for Image Models5.1.1 Potts ModelWe can model the spatial dependence in an image by using a Markov randomfield to model the true state of each pixel. We assume that each pixel has a truehidden state X i , where X i is an integer denoting one of the K states, and that thetrue state of a pixel is likely to be similar to the states of its neighbors. DefineI(X i , X j ) as an indicator function equal to 1 when X i = X j and zero otherwise.Let N(X i ) be the neighbors of X i (that is, the 8 pixels adjacent to pixel X i ),and let U(N(X i ), k) denote the number of points in N(X i ) which have state k(so U(N(X i ), X i ) is the number of neighbors of pixel i which have the same stateas pixel i). The Potts model is characterized by the joint distribution given inequation 5.1, in which the sum is over all neighbor pairs.p(X) ∝ exp(φ ∑ i jI(X i , X j )) (5.1)Equation 5.1 leads to the conditional distribution in equation 5.2.p(X i = j|N(X i ), φ) =exp(φU(N(X i), j))∑k exp(φU(N(X i ), k))(5.2)The parameter φ expresses the amount of spatial homogeneity in the model.A positive value of φ means that neighboring pixels tend to be similar, while a

72negative value would mean that neighboring pixels tend to be dissimilar. If φ = 0,then the pixels are independent.Note that pixels on the boundary of the image will not have a full set of observedneighbors. For simplicity and because the boundary is only a small fraction of thedata, the remainder of this chapter assumes that boundary pixels are excludedfrom analysis except in their use as neighbors of interior pixels. That is, any pixelof interest can be assumed to be an interior pixel.5.1.2 ICMThe Iterated Conditional Modes (ICM) algorthm was introduced by Besag (1986)as a method of image reconstruction when local characteristics of the true imagecan be modeled as a Markov random field. In particular, this can be used withthe Potts model described in equation 5.1. The algorithm begins with an initialestimate of the true scene X, and proceeds iteratively to estimate all necessaryparameters, as well as estimating X.Recall that we do not observe the X values, but we do observe Y i for each pixel.We assume that the density of Y i conditional on its true state X i = j is Gaussianwith mean µ j and variance σj 2 , and it follows that the Y values are conditionallyindependent given the X values as shown in equations 5.3 and 5.3. Let θ K denotethe vector of parameters (µ, σ 2 ) for state K.f(Y |X) = ∏ if(Y i |X i ) (5.3)f(Y i |X i ) = f(Y i |θ Xi ) (5.4)To initialize the ICM algorithm, we use marginal segmentation to find a firstestimate of the scene ˆX. The algorithm proceeds by first updating the estimate of

73the Gaussian parameters ˆθ, which is done by maximizing the likelihood in equation5.3 given the current ˆX. In other words, the usual maximum likelihood estimatorsof µ and σ 2 are computed, using ˆX as the assignment of each pixel to one of thetrue states.The next step is to estimate φ using maximum pseudolikelihood. Again, weuse the current estimate ˆX for the true scene. The function to be maximized isthe product over all pixels of equation 5.2.P L( ˆX|φ) = ∏ ip( ˆX i |N( ˆX i ), φ) (5.5)ˆφ = argmax φ P L( ˆX|φ) (5.6)We have one φ parameter for the image, which means we are assuming thateach segment has the same amount of spatial cohesion. In estimating the valueof φ, we are examining segments which are estimated correctly (high cohesion),incorrectly combined (high cohesion), or incorrectly subdivided (low cohesion).This means that when the number of groups K is less than or equal to the truenumber of groups K T rue , then φ is estimated correctly. As K grows larger thanK T rue , we expect to underestimate φ. The most extreme example of this occurswhen K T rue = 1, and we fit models with K > 1. In these cases, the unconstrainedestimate of φ could be negative, even if the true scene has positive φ. Because ofthis, we constrain φ to be nonnegative; if ˆφ is negative, then we reset it to zerobefore continuing.At this point, ˆθ and ˆφ have been updated, and we now update ˆX. This is doneby considering each pixel in turn and replacing ˆX i with the state which maximizesequation 5.7.ˆX i = argmax j f(Y i |X i = j)p(X i = j|N( ˆX i ), ˆφ) (5.7)

74Once each pixel has been updated, we have a new ˆX. We can now check forconvergence of ˆX, or stop after a predetermined number of iterations. At the endof ICM, we also have ˆθ and ˆφ. I now comment on the posterior distribution of X,which might be used for inference in some applications, before I continue on to adiscussion of inference for the number of segments K.5.1.3 Pseudoposterior Distribution of the True SceneThis section presents a pseudolikelihood-based expression for the posterior distributionof the true segmentation X, which is the distribution of X conditional onthe observed Y . Recall that we do not observe the X values, but we do observeY i for each pixel. We assume that the density of Y i conditional on its true stateX i = j is Gaussian with mean µ j and variance σj 2 , and it follows that the Y valuesare conditionally independent given the X values.f(X|Y ) = f(Y |X)f(X)/f(Y ) ∝ f(Y |X)f(X) (5.8)The first term is easy to compute, since f(Y |X) = ∏ i f(Y i |X i ), and this isjust a Gaussian density. Here we replace the second term by the pseudolikelihoodP L(X). Dependence on the parameter φ is made explicit in the equations whichfollow.P L(X, φ) = ∏ ip(X i |N(X i ), φ) (5.9)p(X i |N(X i ), φ) = exp(φU(N(X i), X i ))∑k exp(φU(N(X i ), k))(5.10)Rewriting this in computational terms, we obtain the following expression forthe pseudoposterior distribution of X, based on the pseudolikelihood.

75f( ˆX|Y, ˆφ, ˆθ) ∝ ∏ if(Y i | ˆX i , ˆθ)p( ˆX i |N( ˆX i ), ˆφ) (5.11)In some situations, one might wish to conduct inference based on this pseudoposterior.However, we use the ICM reconstruction of X, so our only remaininggoal is to conduct inference for K, the number of segments in the image.5.1.4 Pseudolikelihood and BICWe wish to conduct inference for K, the number of segments in the image. Aninitial thought would be to use BIC, as discussed in section 3.1. However, thiswould require evaluation of the likelihood of the observed data, L(Y |K), which isshown in equation 5.12.L(Y |K) = ∑ xf(Y |X = x, K)p(X = x|K) (5.12)The sum in equation 5.12 involves all possible configurations of the hiddenstates. With N pixels and K states, there are K N possible configurations, makingthis approach intractable. Instead, we replace the likelihood term with a pseudolikelihoodwhich maintains computational feasibility.An alternative to this pseudolikelihood method is to explore the space of allpossible configurations of hidden states. This can be done using reversible jumpMarkov chain Monte Carlo (Green, 1995), which would yield an estimate of theposterior probability of each K value. The main drawback of reversible jumpMCMC is its large computational demand. In most cases, we will not really beinterested in the posterior probabilities of values of K; instead, we simply wanta single best K value. In the context of the consistency result shown below,we would expect that as the amount of data increases, the choice of the singlebest K should be the same for both the reversible jump MCMC method and thepseudolikelihood-based BIC method.

76The basic idea of this psuedolikelihood approach is that instead of summingover all possible configurations of X we will consider only configurations whichare close to the ICM estimate of X, denoted by ˆX. Specifically, we consider eachpixel Y i in turn and condition on ˆX −i , which is ˆX excluding the value at X i . Thelikelihood of the ith pixel observation is L(Y i |K), shown in equation 5.13K∑L(Y i |K) = f(Y i |X i = j)p(X i = j) (5.13)j=1The sum in equation 5.13 is now over the K possible values of X i . Conditioningon ˆX −i , we obtain the conditional likelihood shown in equation 5.14, in whichN( ˆX i ) denotes the neighbors of ˆXi .K∑L(Y i | ˆX −i , K) = f(Y i |X i = j)p(X i = j|N( ˆX i )) (5.14)j=1The first term in the sum, f(Y i |X i = j), simply requires evaluation of a Gaussiandensity; the second term, p(X i = j|N( ˆX i )) is evaluated using equation 5.2.The conditional likelihoods from equation 5.14 are combined to form the pseudolikelihoodof the image, L ˆX(Y |K), shown in equation 5.15. Forming the productin this way makes intuitive sense because the Y i values are independent conditionalon the underlying hidden states.L ˆX (Y |K) = ∏ if(Y i | ˆX −i , ˆφ) = ∏ iK∑f(Y i |X i = j)p(X i = j|N( ˆX i ), ˆφ) (5.15)j=1Recall that Y i is the ith observed pixel, X i is the hidden state of pixel i, ˆφ is theestimate of φ (from the ICM algorithm), and N( ˆX i ) is the estimate of the hiddenstate of each neighbor of pixel i. I use L ˆX(Y |K) to denote the quantity in equation5.15, since it is a likelihood integrated over the approximate posterior distributionof a set of models near the MAP estimate ˆX.

77After running ICM, we have an estimate of φ, as well as estimates of the µand σ parameters for each segment. We can compute the log of the quantity inequation 5.15, and, since this is an approximation of the intractable L(Y |K), weuse it in place of the loglikelihood in the BIC, as shown in equation 5.16. I usethe notation BIC P L (K) to differentiate this equation from the usual BIC.BIC P L (K) = 2 log(L ˆX (Y |K)) − D K log(N) (5.16)Ideally, one could compute BIC P L (K) for a large range of K values and thenchoose K to maximize BIC P L (K). However, this would require an excessiveamount of computation, and we do not expect the model assumptions to holdfor values of K very far from the true value. Because of this, we adopt a sequentialtesting approach. We begin by computing BIC P L (K) for K = 1, andthen incrementally increase the value of K. At each step, we compare BIC P L (K)with BIC P L (K − 1), and stop the process when the larger model is rejected. Inother words, as we increase K incrementally from K = 1, we take the first localmaximum of BIC P L (K) to be our choice of the number of segments K.5.1.5 Consistency of BIC P LEquation 5.16 gives the formula for BIC P L (K) which I use for model selection.In this section I present a consistency result for BIC P L (K). First, I refine thenotation. Let K T denote the true number of segments. Let ˆX K be the estimatedX given that there are K segments, and similarly let ˆθ K and ˆφ K be the parameterestimates given K segments. N(X i ) is the neighborhood of X i , which consists ofthe 8 pixels adjacent to X i , and B i is the union of X i and its neighborhood. Inother words, B i is a three by three block of pixels centered at pixel i; omission ofthe subscript will indicate an arbitrary three by three block of pixels. I assumethat f(Y i |X i ) is a Gaussian distribution.

78The consistency result presented here is shown for a limited case, but it ishoped that future work might extend the result to more general cases. I first statethe theorem, and then two lemmas precede the proof.Theorem 5.1: Consistency of Choice of KWe observe an image Y consisting of pixels Y i which we assume are each generatedfrom a Gaussian distribution which depends on the true state X i of each pixel. Wedefine i to index only the interior pixels of the image. We assume that the localcharacteristics of the true state image X can be modeled as a Markov random field;in particular, we assume the Potts model given by equation 5.1. Let K T denote thetrue number of segments in the image, and let K denote a hypothesized numberof segments.Suppose that one of the following cases holds.Case 1: K T = 1 and K > 1.Case 2: K T = 2, K = 1, and condition A: log(σ K ) − log(σ 1 ) − 8φ > 0, whereσ K is the standard deviation from the K = 1 fit and σ 1 largest of the two standarddeviations from the K T = 2 fit.In case 1 or case 2, BIC P L (K) is consistent for K; that is, as N → ∞ in such away that the size of the image increases in both dimensions,P KT (BIC P L (K) < BIC P L (K T )) → 1 (5.17)Condition AThis condition is relevant only when K T = 2 and K = 1. Denote the true densityparameters in θ K T by µ1 , µ 2 , σ1, 2 and σ2. 2 Similarly, let the parameters in θ K be

79denoted by µ K and σK; 2 equations 5.18 and 5.19 give formulas for µ K and σK 2 interms of the θ K T parameters. Let P1 be the proportion of pixels for which the true(unobservable) state X i is 1, and similarly let P 2 be the proportion of pixels instate 2.µ K = P 1 µ 1 + P 2 µ 2 (5.18)σ 2 K = P 1 (σ 2 1 + µ 2 1 − 2P 1 µ 2 1 − 2P 2 µ 1 µ 2 + P 2 1 µ 2 1 + P 2 2 µ 2 2 + 2P 1 P 2 µ 1 µ 2 )+P 2 (σ 2 2 + µ 2 2 − 2P 2 µ 2 2 − 2P 1 µ 1 µ 2 + P 2 1 µ 2 1 + P 2 2 µ 2 2 + 2P 1 P 2 µ 1 µ 2 )(5.19)Suppose, without loss of generality, that σ 1 > σ 2 . Condition A is given byequation 5.20.log(σ K ) − log(σ 1 ) − 8φ > 0 (5.20)Note from equation 5.57 that σ K becomes larger as the two true mixture componentsbecome more separated; that is, σ K can be made arbitrarily large bymoving µ 1 and µ 2 farther apart. Thus, condition A can be thought of as a regularitycondition which requires a certain amount of separability between the twotrue segments.When φ = 0 (the spatial independence case), condition A reduces to log(σ K ) −log(σ 1 ) > 0, assuming σ 1 > σ 2 . If, in addition, we have σ 1 = σ 2 , then condition Ais guaranteed to hold as long as µ 1 ≠ µ 2 .Lemma 1: IntegrabilitySuppose we define g i , a function of Y i , as shown in equation 5.21.

80⎛∑ ⎞Kj=1f(Yg i (Y i ) = log ⎝i |X i = j, θ 1 )p(X i = j|N 1 (X i ), φ 1 )∑ ⎠KT(5.21)j=1 f(Y i |X i = j, θ 2 )p(X i = j|N 2 (X i ), φ 2 )Here, θ 1 , θ 2 , φ 1 , and φ 2 are fixed parameter values; N 1 (X i ) and N 2 (X i ) arefixed neighborhoods of X i . Let us further assume that f(Y i ) denotes a Gaussianor Gaussian mixture density. Then g i ∈ L 1 , that is, ∫ ∞−∞ |g i (Y i )|f(Y i )dY i < ∞.Proof of Lemma 1.Since g i is a function of Y i , I need to show that ∫ ∞−∞ |g i(Y i )|f(Y i )dY i < ∞. NowN 1 (X i ) and N 2 (X i ) are fixed, so p(X i = j|N 1 (X i ), φ 1 ) and p(X i = j|N 2 (X i ), φ 2 )are also fixed for a given i and I denote them by p 1 j and p 2 j. I can now write g i asfollows.g i (Y i ) = log⎛⎝∑ Kj=1f(Y i |X i = j, θ 1 )p 1 j∑ KTj=1 f(Y i |X i = j, θ 2 )p 2 j⎞⎠ (5.22)Note that both the numerator and denominator inside the logarithm in equation5.22 are Gaussian mixtures. I assume that all variances are nonzero. Thusg i (Y i ) is finite over any finite interval, and its integral is finite when integratedover any finite interval.Let S 1 denote the state with largest variance in the numerator of equation 5.22,and similarly S 2 for the denominator. These terms will dominate in the tails; inother words, there exists a value Y A such that for all Y i > Y A (and also for allY i < −Y A ), the following two inequalities will hold.K∑f(Y i |X i = j, θ 1 )p 1 j < f(Y A |X i = S 1 , θ 1 ) (5.23)j=1K T∑f(Y i |X i = j, θ 2 )p 2 j < f(Y A|X i = S 2 , θ 2 ) (5.24)j=1We can now write the following, in which C denotes a constant.

81∫ ∞−∞|g i (Y i )|f(Y i )dY i =∫ −YA−∞|g i (Y i )|f(Y i )dY i +∫ −∞Y A|g i (Y i )|f(Y i )dY i + C (5.25)Now note that g i (Y i ) is the log of a fraction; making use of equations 5.23 and5.24, the inequality of equation 5.26 will hold when Y i > Y A or Y i < −Y A .K∑K|g i (Y i )| = | log( f(Y i |X i = j, θ 1 )p 1 j ) − log( ∑ Tf(Y i |X i = j, θ 2 )p 2 j )|≤| log(j=1K∑j=1j=1f(Y i |X i = j, θ 1 )p 1 ∑K Tj)| + | log(j=1f(Y i |X i = j, θ 2 )p 2 j )|≤ | log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))| (5.26)Combining the inequality of equation 5.26 with equation 5.25, we obtain equation5.27, in which C 1 and C 2 are irrelevant constants.≤≤∫ ∞−∞∫ −YA−∞∫ ∞+∫ ∞−∞|g i (Y i )|f(Y i )dY i(| log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))|)f(Y i )dY iY A(| log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))|)f(Y i )dY i + C 1(| log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))|)f(Y i )dY i + C(5.27)2At this point, showing that g i ∈ L 1 reduces to showing that the inequality inequation 5.28 holds.∫ ∞−∞(| log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))|)f(Y i )dY i < ∞ (5.28)The log terms in the integral can be simplified further, where C 1 , C 2 , and C 3are irrelevant constants and S is either S 1 or S 2 .

82| log(f(Y i |X i = S, θ 1 ))| ≤ C 1 + C 2 Y i + C 3 Y 2i (5.29)Thus, equation 5.28 becomes 5.30; again, C 1 , C 2 , and C 3 are irrelevant constants.∫ ∞−∞(C 1 + C 2 Y i + C 3 Y 2i )f(Y i )dY i < ∞ (5.30)The inequality in equation 5.30 holds if f(Y i ) is Gaussian or a Gaussian mixture;therefore g i ∈ L 1 .End of Proof of Lemma 1.My second lemma is theorem 3.1.1 from Guyon (1995), which I state belowwithout proof. In the lemma, X is a process defined over Z d . X has some distributionf; for our purposes, we need only consider Z 2 , so f would be a distributionfunction for the possible realizations of X on an infinite plane. X is assumed tobe stationary, which means that its distribution is invariant under translations τ i ,i ∈ Z d . It is further assumed to be in L p , meaning that the expected value of |X i | pin finite. The I in the lemma is the σ-algebra of invariant sets, which is defined byA ∈ I if and only if τ i (A) = A for all i. In other words, if there are non-invariantsets (e.g. initial conditions or boundary conditions), then their influence is excludedfrom the expectation in the lemma. This makes intuitive sense from theviewpoint that, for a stationary process, dependence on initial conditions shoulddie out as the size of the data goes to infinity. Let (D M ) be a sequence of boundedconvex sets; d(D M ) denotes the interior diameter of D M , which is the diameter ofthe largest ball around a point in D M which is entirely contained in D M .

83Lemma 2: ErgodicityLet X = X i , i ∈ Z d be a stationary process in L p , 1 ≤ p < ∞. If (D M ) is a sequenceof bounded convex sets such that d(D M ) → ∞, and if ¯X M = |D M | −1 ∑ D MX i , thenit follows that lim M→∞ ¯X M = E(X 0 |I) in L p .I now describe a sequence of bounded convex sets (D M ) which satisfy therequirements of the lemma. Consider a sequence of rectangular subsets of aninfinitely large image, with lower left hand corner at the origin and each side oflength M. For the sequence of rectangular sets I have defined (and in fact for anysequence which increases in size in both dimensions as M increases), it is clearthat d(D M ) → ∞ as M → ∞. Furthermore, |D M | = M 2 , so ¯X M defined in thelemma is the usual sample average.Proof of Theorem 5.1I will examine each case in turn.Case 1: K T = 1The inequality in equation 5.17 can be rewritten as follows.2 log(L ˆX(Y |K)) − D K log(N) < 2 log(L ˆX(Y |K T )) − D KT log(N) (5.31)⇒ log(L ˆX(Y |K)) − (D K − D KT ) log(N)/2 < log(L ˆX(Y |K T )) (5.32)⇒ (L ˆX (Y |K)) exp(−(D K − D KT ) log(N)/2) < (L ˆX (Y |K T )) (5.33)

84⇒ (L ˆX (Y |K)) exp(−(D K − D KT ) log(N)/2)(L ˆX (Y |K T ))< 1 (5.34)⇒ (L ˆX(Y |K))(L ˆX (Y |K T )) < exp((D K − D KT ) log(N)/2) (5.35)⇒∏ ∑ Kj=1i f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ K )∏ ∑ KTi j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ< exp((D K −D K KT ) log(N)/2)T )(5.36)⇒ 1 N⎛∑log ⎝i∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K T), ˆφ⎠K T )∑ KT< ((D K − D KT ) log(N)/2)Ni(5.37)Define h i as shown in equation 5.38.⎛h i = log ⎝∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K T), ˆφ⎠ (5.38)K T )∑ KTLet ZY 2 be the subset of Z 2 on which Y is defined, and define a process H =h i , i ∈ ZY 2 . Consider a translation in Z 2 denoted by τ, such that the distributionof H does not change under τ. The terms in h i are Gaussian densities and localcharacteristics of a Markov random field process; we are using the Potts modelof equation 5.2 to model the local characteristics of this process. The Gaussiandensities model the spatially independent noise at each pixel, while the Markovrandom field terms capture the spatial dependence of the image. Now, theselocal characteristics do not change for any interior pixel; they differ only at theboundaries, which (as noted previously) have been excluded from this analysis. Inexcluding the boundaries (as well as in letting the image size increase to infinity ini

85both dimensions) we are asymptotically dealing with an image with infinite size.For such an infinite image, there is no translation which will move H to a boundary,and so τ can be any translation in Z 2 . In other words, the distribution of H isinvariant under translations τ ∈ Z 2 ; since this is the definition of a stationaryprocess, it follows that H is stationary.I will now show that h i ∈ L 1 ; in other words, I need to show that∫ ∞−∞ |h i|f(Y )dY < ∞, where f(Y ) is the true joint distribution of Y . Now h idepends on values in Y other than Y i only through the parameter estimates ˆθ K ,ˆφ K ,N( ˆX K ) , ˆθ K T , ˆφK T , and N( ˆX K T ). Conditional on these parameter estimates,regardless of their values, h i will have the form of g i in lemma 1; furthermore,it will depend on Y only through the marginal distribution of Y i . This marginaldistribution is a Gaussian mixture since Y i depends on other Y values through thedependence between X i and X. Whatever the probability of X i taking on eachstate, Y i will be distributed as a Gaussian mixture. Applying lemma 1, h i ∈ L 1 .I now apply lemma 2, using the sequence of sets (D M ) described above. Sinceh i satisfies the requirements for X i in the lemma, we obtain the result that, asN → ∞ in such a way that the size of the image is increasing in both dimensions,1N∑h i = E(h) (5.39)Ni=1We can now conclude that the limit of the left hand side of equation 5.37 asN → ∞ in such a way that the size of the image is increasing in both dimensionsis equal to equation 5.40.⎛ ⎛E KT⎝log ⎝∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ⎠⎠ (5.40)K T )∑ KTNow, − log is a strictly convex function, so by Jensen’s inequality the quantityin equation 5.40 is less than the quantity in equation 5.41, with equality only when

86the object of the expectation is constant.⎛ ⎛log ⎝E KT⎝∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ⎠⎠ (5.41)K T )∑ KTIf the inequality in equation 5.42 holds, then consistency is implied.⎛E KT⎝∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ⎠K T )( )((DK − D KT ) log(N)/2)≤ expN∑ KT(5.42)When K T = 1, there is only one possible configuration for X; in other words,X is constant. Equation 5.43 follows.f(Y ) = f(Y |X) = ∏ if(Y i |X i ) = ∏ if(Y i ) (5.43)Note that f(Y i |ˆθ K T , Xi ) is equal to f(Y i |ˆθ K T ) when KT = 1, and this in turn isasymptotically equal to f(Y i |θ K T ), since in this case ˆθ is just the usual maximumlikelihood estimate with independent data. The expected value in equation 5.42can be rewritten as equation 5.44.⎛∑ Kj=1f(YE KT⎝i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞K )⎠ (5.44)f(Y i |ˆθ K T )I now show that equation 5.44 simplifies to equation 5.45. Consider the value ofˆφ K when K T = 1 and K > K T . The estimate ˆφ K is a maximum pseudolikelihoodestimate. Maximum pseudolikelihood estimates were shown to be consistent byGeman and Graffigne (1986), so we know that ˆφ K → φ K Twhen K T = 1 all points are independent, so φ K Tas N → ∞. However,= 0. When φ = 0, the probabilityof any X i is (1/K), regardless of its neighbors. It follows that the expectation inequation 5.44 is equal to the expectation in equation 5.45.

87⎛∑ Kj=1f(YE KT⎝i |X i = j, ˆθ⎞K )(1/K)⎠ (5.45)f(Y i |ˆθ K T )This expectation can be found by integrating over Y i with respect to the truedensity of Y i .⎛∫⎝Y i∑ Kj=1f(Y i |X i = j, ˆθ⎞K )(1/K)⎠ f(Y i |θ K T)dY i (5.46)f(Y i |ˆθ K T )As noted previously, f(Y i |θ K T) cancels with the term in the denominator, soequation 5.46 becomes equation 5.47.∫K ∑Y i j=1f(Y i |X i = j, ˆθ K )(1/K)dY i = 1 (5.47)Thus I have shown that the expectation on the left hand side of equation 5.42is equal to 1. The right hand side of the inequality in equation 5.42 is equal to thefollowing.exp( )((DK − D KT ) log(N)/2)Since D K > D KT , we know that the following inequalities hold.N(5.48)((D K − D KT ) log(N)/2)N> 0 (5.49)( )((DK − D KT ) log(N)/2)exp> 1 (5.50)NIn the limit as N → ∞, the inequality in equation 5.50 becomes an equality.Combining equations 5.47 and 5.50, we see the following.∫K ∑Y i j=1f(Y i |X i = j, ˆθ K )(1/K)dY i = 1 < exp( )((DK − D KT ) log(N)/2)N(5.51)

88Thus, equation 5.42 holds, so BIC P L (K) is consistent for K in case 1. Notethat in the limit as N → ∞, the inequalities in equations 5.50 and 5.51 becomeequalities; equation 5.42 still holds since its inequality is not strict.A comment about the proof for case 1 is needed. In equation 5.45, the numeratorcorresponds to a certain mixture density with K components, while thedenominator has K T = 1 component. I emphasize that this consistency result doesnot hold for the general Gaussian mixture model in which mixture proportions areestimated along with θ, since the mixture implied by the numerator of equation5.45 has the mixture proportions held constant at 1/K.Case 2: K T = 2, K = 1, and condition ABegin as in case 1, up to equation 5.36 which is rewritten here as equation 5.52.∏ ∑ Kj=1i f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ K )∏ ∑ KTi j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ< exp((D K − D K KT ) log(N)/2)T )(5.52)Inverting the fraction we obtain equation 5.53∏ ∑ KTi j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ K T )∏ ∑ Kj=1i f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ K )> exp((D KT − D K ) log(N)/2)(5.53)Since K = 1, this simplifies to equation 5.54.∏ ∑ KTi j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ K T )∏i f(Y i |ˆθ K )> exp((D KT − D K ) log(N)/2)(5.54)The inequality in equation 5.54 is equivalent to the inequality in equation 5.55.

89⎛∑ ∑ KTlog ⎝ j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ⎞K T )⎠ > ((Dif(Y i |ˆθ K KT −D K ) log(N)/2))(5.55)Recall from case 1 that as N → ∞, it follows that ˆθ K T → θK T and ˆφK T →φ K T . Some additional notation is needed at this point. Denote the true densityparameters in θ K Tby µ1 , µ 2 , σ 2 1 , and σ2 2 . Similarly, let the parameters in θK (whichin this case consists of only one component) be denoted by µ K and σ 2 K.We can deduce the asymptotic values of µ K and σ 2 K in terms of the true parameters.Let P 1 be the proportion of pixels for which the true (unobservable)state X i is 1, and similarly let P 2 be the proportion of pixels in state 2. Then µ Kand σ 2 K will be given by equations 5.56 and 5.57.µ K = P 1 µ 1 + P 2 µ 2 (5.56)σK 2 = P 1 (σ1 2 + µ 2 1 − 2P 1 µ 2 1 − 2P 2 µ 1 µ 2 + P1 2 µ 2 1 + P2 2 µ 2 2 + 2P 1 P 2 µ 1 µ 2 )+P 2 (σ2 2 + µ 2 2 − 2P 2 µ 2 2 − 2P 1 µ 1 µ 2 + P1 2 µ 2 1 + P2 2 µ 2 2 + 2P 1 P 2 µ 1 µ 2 )(5.57)The inequality in equation 5.55 holds if the inequality of equation 5.58 holds;this is true for any set of values of m i , since the left hand side of equation 5.58 isless than the left hand side of equation 5.55.(∑ f(Yi |X i = m i , θ K T )p(Xi = m i |N(logˆX K Ti ), φ K )T )> ((Dif(Y i |θ K KT − D K ) log(N)/2))(5.58)Now construct two sets, S 1 and S 2 , as follows. The set S 1 consists of all i suchthat (Y i − µ 1 ) 2 /2σ1 2 < (Y i − µ 2 ) 2 /2σ2, 2 and similarly S 2 is the set of i such that(Y i − µ 1 ) 2 /2σ1 2 > (Y i − µ 2 ) 2 /2σ2. 2 Combining this with equation 5.58, we find that

90the inequality in equation 5.55 holds if the inequality in equation 5.59 holds. Fromthis point onward I will use p(X i = m) to denote p(X i = m|N( ˆX K Ti ), ˆφ K T ).∑i∈S 1log(f(Yi |X i = 1, θ K )T )p(Xi = 1)f(Y i |θ K )+ ∑i∈S 2log(f(Yi |X i = 2, θ K )T )p(Xi = 2)f(Y i |θ K )> ((D KT − D K ) log(N)/2) (5.59)I will now examine the left hand side of equation 5.59 in order to show thatunder condition A the inequality in equation 5.59 is guaranteed to hold, thusshowing consistency. The left hand side of equation 5.59 can be written as shownin equation 5.60.∑log(f(Y i |X i = 1, θ K T)p(X i = 1)) + ∑i∈S 1i∈S 2log(f(Y i |X i = 2, θ K T)p(X i = 2))− ∑ ilog(f(Y i |θ K ))(5.60)Let |S 1 | and |S 2 | denote the sizes of the sets S 1 and S 2 ; note that |S 1 |+|S 2 | = N,where N is the number of pixels under consideration. The following three identitieshold asymptotically.∑log(f(Y i |X i = 1, θ K T)p(X i = 1)) = −|S 1 | log( √ 2π) − |S 1 | log(σ 1 )i∈S 1+ ∑log(p(X i = 1))i∈S 1− ∑ (Y i − µ 1 ) 2(5.61)i∈S 12σ12∑log(f(Y i |X i = 2, θ K T)p(X i = 2)) = −|S 2 | log( √ 2π) − |S 2 | log(σ 2 )i∈S 2+ ∑log(p(X i = 2))i∈S 2

91− ∑ (Y i − µ 2 ) 2i∈S 22σ22(5.62)∑log(f(Y i |θ K )) = −N log( √ 2π) − N log(σ K ) − N/2 (5.63)iCombining equations 5.61 to 5.63, we find that equation 5.60 is equal to equation5.64.N log(σ K ) − |S 1 | log(σ 1 ) − |S 2 | log(σ 2 ) + ∑p(X i = 1) + ∑p(X i = 2)i∈S 1 i∈S 2− ∑ (Y i − µ 1 ) 2− ∑ (Y i − µ 2 ) 2+ N/2 (5.64)i∈S 12σ12 i∈S 22σ22From the construction of the sets S 1 and S 2 , it is clear that the inequality inequation 5.65 holds.∑ (Y i − µ 1 ) 2+ ∑ (Y i − µ 2 ) 2< ∑ (Y i − µ 1 ) 2(5.65)i∈S 12σ12 i∈S 22σ22 i2σ12Thus, the quantity in equation 5.66 is less than the quantity in equation 5.64.N log(σ K ) − |S 1 | log(σ 1 ) − |S 2 | log(σ 2 ) + ∑log(p(X i = 1)) + ∑log(p(X i = 2))i∈S 1 i∈S 2(5.66)Recalling equation 5.60, consistency is implied if the inequality in equation 5.67holds.N log(σ K ) − |S 1 | log(σ 1 ) − |S 2 | log(σ 2 ) + ∑log(p(X i = 1)) + ∑log(p(X i = 2))i∈S 1 i∈S 2> ((D KT − D K ) log(N)/2) (5.67)Note that the minimum value of log(p(X i = 1)) or log(p(X i = 2)) is − log(e 8φ +1), which is approximately equal to −8φ. Suppose, without loss of generality, thatσ 1 > σ 2 . The inequality in equation 5.67 will be assured when equation 5.68 holds.

92N log(σ K ) − N log(σ 1 ) − N(8φ) > ((D KT − D K ) log(N)/2) (5.68)Equation 5.68 is equivalent to equation 5.69.N(log(σ K ) − log(σ 1 ) − 8φ) > ((D KT − D K ) log(N)/2) (5.69)The left hand side of equation 5.69 increases with order O(N) as long as(log(σ K ) − log(σ 1 ) − 8φ) > 0. This inequality is condition A. Since the righthand side of equation 5.69 only increases with order O(log(N)), it is clear that asN → ∞ the inequality of equation 5.69 will hold, thus implying consistency.End of Proof.Case 1 is the case in which we are comparing models with K > 1 componentsto the true model, K T = 1. No restriction is imposed on K in this case. Case2 assumes that K T = 2 and imposes the restriction that the hypothesized Kmust be 1. Although this at first seems a rather extreme restriction, it is similarto a nested model restriction. Suppose the estimated segmentation with K + 1segments is nested in the segmentation with K segments, in the sense that bothsegmentations are the same except for one of the K segments which has beensubdivided in the segmentation with K + 1 segments. In this case, the maindifference in BIC P L between the two models will be attributable to the subdividedsegment. Consideration of this segment alone becomes a comparison of a 2 segmentmodel with a 1 segment model. Thus, the consistency result for case 2 addressesa basic comparison (2 segments versus 1 segment) which may be the main drivingforce in many other comparisons.

935.2 An Automatic Unsupervised Segmentation Method5.2.1 OverviewThis automatic unsupervised image segmentation method consists of several steps:1. Initialize.2. Use EM to fit a mixture model and find a marginal segmentation throughmaximum likelihood classification.3. Use ICM to refine the segmentation and estimate parameters.4. Choose the number of segments (K) using pseudolikelihood BIC.5. (Optional) Morphological smoothing.For the first and last steps I present several possible methods, though these arenot critical and could be replaced with methods tailored to particular applications.Similarly, additional steps could be added following the segmentation to allow useof training data or other application-specific knowledge.5.2.2 InitializationAn initial segmentation of the image is needed before we can use EM. I haveexplored several methods of initialization: Ward’s method (Ward, 1963), randomparameter estimates, and a method based on histogram equalization.Ward’s method is an agglomerative, hierarchical clustering method. It beginsby considering each unique greyscale level observed in the image as a single cluster.At each step, two adjacent clusters are merged until we arrive at the desired

94final number of clusters. The choice of which clusters to merge is based on minimizinga sum of squares criterion at each step. When this process is complete,we have divided the greyscale levels into K groups, and we use this as the initialsegmentation. Because of the sum of squares criterion used in Ward’s method, itis appropriate when one intends to fit a Gaussian mixture model.One disadvantage of Ward’s method is the amount of computing it requires; asan alternative to Ward’s method, the initial parameter estimates can be randomlygenerated. This is quite fast. Since the EM algorithm typically moves very quicklytoward good parameter values, it is reasonably robust to the initial parameterestimates. However, random estimates may sometimes lead EM into undesiredlocal maxima in the likelihood. This can be alleviated by starting from severaldifferent sets of random parameter values, though this solution has the drawbackthat it once again increases computation time. Also, the use of random values isintuitively somewhat unsatisfying.Histogram equalization provides the basis for a fast and robust method offinding an initial segmentation. Consider a histogram of the greyscale values in animage (without regard to spatial information). The idea of histogram equalizationis to divide the greyscale levels into K bins which contain roughly equal numbersof pixels. Graphically, this means that we adjust the bin sizes (bins are not allthe same size) until the histogram is flat. I compute the histogram equalizationwith an iterative algorithm. First, I divide the number of data points N by thedesired number of bins K; the result N/K is the number of data points we wouldlike to have in each bin. Let T 0 denote the threshold number of pixels to allocateto each bin, so we set T 0 = N/K. Beginning with greyscale level 0, I allocate allpixels with that greyscale level into the first bin. Continuing with greyscale levels1, 2, and so on, I allocate each into the first bin until I have at least T 0 pixels inthe bin. I then allocate the following greyscale levels into the second bin until it

95has at least T 0 pixels. The process continues until all greyscale levels have beenallocated. Note that there may not be enough pixels to fill all of the bins. If emptybins remain at the end of the process, then it is computed again except that T 0is replaced by T 1 = CT 0 , where C is a fraction between 0 and 1. I usually use avalue of C = 2/3. The process is iterated, replacing the threshold each time withT i = CT i−1 , until all K bins have pixels. Even when the data are extremely skewedor concentrated on just a couple of grey values, this algorithm is guaranteed toconverge as long K is less than or equal to the number of distinct grey values inthe data.A common use of histogram equalization is in gamma correction. Gammacorrection refers to adjusting the brightness with which each greyscale value isdisplayed. For instance, consider a greyscale image which contains a few pixelswith grey value 250 and all the rest of the pixels with grey values less than 30.At first glance, the image might appear mstly black, even though there may befeatures in the dimmer pixels. The problem is that the sensitivity of the human eyeis not linear with brightness. We can tell light grey from dark grey, but not lightblack from dark black. By changing the mapping from pixel values into displayedbrightness, we can move or stretch the region of pixel values which is displayed atan appropriate brightness for human viewing. For the example I just mentioned, itwould be appropriate to map the values from 1 to 30 to the range of dark to lightso that the features would be visible. Clearly, this depends entirely on the imageunder consideration. Histogram equalization is an automatic way of selecting thisgamma correction; it attempts to map equal numbers of pixels into the darker andlighter parts of the display spectrum.Histogram equalization and Ward’s method are both good at picking out highdensity regions in the greyscale histogram, but histogram equalization is muchfaster. My implementation of automatic segmentation in XV uses histogram equal-

96ization to initialize (see section A.1), while my Splus implementation uses Ward’smethod.5.2.3 Marginal Segmentation via Mixture ModelsParameter Estimation by EMI use the EM algorithm to estimate the parameters of the mixture density. LetQ denote the pixel probabilities, where Q ij is the probability that pixel i is fromcomponent j. The initial segmentation is viewed as providing an initial estimateof Q; specifically, this initial estimate will consist of Q ij = 1 if pixel i is initiallyclassified in component j, and zero otherwise. After initialization, the algorithmiterates between the M-step, computing maximum likelihood estimates of the mixtureparameters θ conditional on Q, and the E-step, estimating Q conditional onθ (the name of this step comes from the fact that Q ij is the expected value ofI(Z i = j), which is an indicator function which is equal to 1 if pixel i is generatedby component j and zero otherwise).At each iteration, I compute the overall loglikelihood of the data Y given theparameters θ, using the assumption of independent pixels. In general, the EMprocess is repeated until the loglikelihood converges. My Splus implementationallows a user-specified limit on the number of iterations. The parameters θ whichmust be estimated are the density parameters from each of the K components ofthe mixture distribution and K − 1 mixture proportions (the mixture proportionssum to 1, so one of the K proportions is fixed given the other K − 1 proportions).The mixture density is shown in equation 5.70, where Y i is the observedgreyscale value of pixel i, P j is the mixture proportion of component j (also sometimescalled the prior probability of component j), Φ is the single componentdensity (e.g. Gaussian), θ j is the vector of parameters for the jth density, and K

97is the number of components (this is the same as equation 3.15 of section 3.2).K∑f(Y i |K, θ) = P j Φ(Y i |θ j ) (5.70)j=1Under the assumption that all pixels are independent, the likelihood for thewhole image is given by equation 5.71, where N is the number of data points.f(Y |K, θ) =M-Step⎛⎞N∏ K∑⎝ P j Φ(Y i |θ j ) ⎠ (5.71)i=1 j=1The M-step consists of finding the maximum likelihood estimate of θ conditionalon Q. Think of Q as a matrix with one row for each pixel and one column foreach component. The mixture proportions are easily obtainable from Q as shownin equation 5.72.ˆP j = 1 NN∑ˆQ ij (5.72)Since Q ij is the probability of pixel i being in component j, we know that eachrow of Q must sum to 1. The sum of all the elements of Q will be equal to N.This agrees with the fact that ∑ j P j = 1.We now find estimates for the density parameters. For a Gaussian mixture,i=1we need estimates for the mean µ j and variance σ 2 jfor each component j. For aPoisson mixture, only the mean is needed. Formulas for the mean and varianceestimates are gven in equations 5.73 and 5.74. In essence, these are simply weightedversions of the usual maximum likelihood estimators, with the weights given byQ.ˆµ j =∑ Ni=1 ˆQij Y i∑i ˆQ ij(5.73)

98ˆσ 2 j =∑ Ni=1 ˆQ ij (Y i − ˆµ j ) 2∑i ˆQ ij(5.74)Significant computational savings can be achieved by calculating these estimatesin a slightly different way. With equations 5.72, 5.73, and 5.74, an iterationover all N pixels is needed. This will make the algorithm take time proportional tothe number of pixels. However, this time can be reduced to a time proportional tothe number of unique values in the data. This is done by creating a list of uniquedata values and counting the number of pixels with each value (like a histogram).The parameter estimates can be computed by iterating over the unique data valuesinstead of the pixels. This makes the algorithm operate in constant time withrespect to the number of pixels; the time is linear in the number of unique datavalues. The time saved by this change is enormous. For instance, with a 256-levelgreyscale image, we will have at most 256 unique values, regardless of whetherthere are thousands or millions of pixels. The equations for computing estimatesin this way are shown in equations 5.75, 5.76, and 5.77. My Splus implementationand my XV implementation both use this time-saving approach.Suppose there are C unique data values (e.g. C = 256 grey levels). Let Vdenote the unique data values, so V is a vector of length C. Let H i be the numberof pixels with the ith unique data value, so H is like a histogram of the data.Define R the same way as Q, except that each row is for a unique data valuerather than a particular pixel. In other words, R ij is the probability that the ithunique data value being generated by the jth component. As with Q, each row ofR must sum to 1. The sum of all elements of R will be equal to C.ˆP j = 1 NC∑H i ˆRij (5.75)i=1

99ˆµ j =∑ Ci=1H i ˆR ij V i∑ Ci=1H i ˆR ij(5.76)ˆσ 2 j =∑ Ci=1H i ˆR ij (V i − ˆµ j ) 2∑ Ci=1H i ˆRij(5.77)E-StepIn the E-step, we update the estimate of Q (or R) conditional on the current estimateof θ. Equations 5.78 and 5.79 show how to compute ˆQ or ˆR. In a particularimplementation of the algorithm, only one approach is needed; I recommend theuse of R (equations 5.75, 5.76, 5.77, and 5.79) instead of Q (equations 5.72, 5.73,5.74, and 5.78). The two approaches are numerically equivalent, but the R methodis much faster.ˆQ ij = ˆP j Φ(Y i |ˆµ j , ˆσ 2 j )∑j ˆP j Φ(Y i |ˆµ j , ˆσ 2 j )ˆR ij = ˆP j Φ(V i |ˆµ j , ˆσ 2 j )∑j ˆP j Φ(V i |ˆµ j , ˆσ 2 j )(5.78)(5.79)Practical IssuesThere are several practical problems which arise in the use of this algorithm. Inthis section I discuss these problems and the solutions I have implemented.When Gaussian mixtures are used, we must keep in mind that the data are infact discrete. Trouble occurs when the variance of a components gets too small.The discrepancy between the continuous Gaussian distribution and the discretedata becomes more pronounced when there are fewer unique data values or whenthe number of components K is increased. Trouble occurs when the variance of acomponents gets too small. This can easily happen if there are many pixels with

100one particular data value, which causes a spike in the histogram of data values.Since the spike consists of a single discrete data value, the variance of a Gaussiancomponent might shrink as the component tries to model this single value. Ifthe variance is allowed to approach zero, then the likelihood will approach infinity.This problem is solved by imposing a lower bound on the variance of a component.If we imagine that the observed data are a discretized version of a true Gaussianmixture, then an observation Y i can be thought of as having a round-off errorwhich is unobserved. This means that an observed value of Y i could arise from atrue value in the range (Y i −0.5) to (Y i +0.5). To capture this variability, I imposea lower bound of 0.5 on the estimate of σ for each component.With Poisson mixtures, there is only a mean parameter µ, so the variance constraintis not needed. However, Poisson mixtures sometimes have an identifiabilityproblem. If the µ j values for two components become too similar, then they aremodeling the same feature of the data and their mixture proportions become arbitrary.That is, if components A and B have the same mean, then P A and P Bare not uniquely defined.There is a milder problem with means in Gaussian mixtures, as well. If twocomponents have means which are close, then the component with the larger variancewill be split. That is, points close to the common mean of the two segmentswill be classified into the segment with smaller variance, since it has a higher likelihood.Points farther from the mean, both high and low, will be classified into thecomponent with larger variance; this component will then contain sets of pointswhich are disjoint in grey level. This problem was discussed in section 3.4.2.There is some question of how to determine when the EM algorithm has converged.Typically, one looks for convergence in the loglikelihood, so the algorithmis stopped when the change in loglikelihood from one iteration to the next is belowa certain threshold. It is not always clear how this threshold should be chosen.

101However, experience with the EM algorithm suggests that it makes large steps atfirst, and then takes longer to converge once it is in the vicinity of a solution. Forclassification purposes, we do not really need extremely accurate estimates of theparameters of each component (especially in light of the inherent uncertainty inthe data due to discretization, as discussed above). An adaptive threshold canbe found by considering the contribution of each pixel to the loglikelihood. Forinstance, if the change in loglikelihood (from iteration i to iteration i + 1) for eachpixel was less than 0.00001, then the overall change in loglikelihood would be lessthan 0.00001N, where N is the number of pixels. Of course, some pixels mighthave a larger or smaller change than 0.00001. My XV implementation uses thisapproach, so the convergence criterion for EM is a change in the loglikelihood ofless than 0.00001N. If 0.00001N is larger than 1, then 1 is used as the criterioninstead. My Splus implementation runs much more slowly than XV, so I simplyallow a user-definable limit on the number of iterations for each execution of theEM algorithm. Inspection of output reveals that 20 iterations is usually sufficientfor the parameter estimates to stabilize, so this is the value that I typically usewhen running the algorithm in Splus.Final Marginal SegmentationOnce we have used EM to estimate the mixture model with K components, allthat remains is to classify each pixel into one of the K segments. In section3.4, I discussed two different methods for performing this classification: mixtureclassification and componentwise classification. In either method, we consider eachpixel (or each unique data value) in turn, and classify it into the segment for whichit has the highest likelihood (using the parameter estimates from the last iterationof EM). The difference between the two methods is that in mixture classificationthe mixture proportions are included in the likelihood, while in componentwise

102classification they are not. Each of these approaches to classification is optimalfor a particular utility function. I use componentwise classification for reasonsdiscussed in section 3.4.Let Ẑ i denote the estimate of the true classifcation for pixel i. The classificationis characterized by equation 5.80.Ẑ i = argmax j Φ(Y i | ˆθ j ) (5.80)In equation 5.80, j indexes the components (segments), Ẑi is the estimated classfor pixel i, Y i is the observed value of pixel i, and ˆθ j is the vector of estimatedparameters for component j.5.2.4 ICM and Pseudolikelihood BICThe final marginal segmentation is used as a starting point for the ICM algorithm.The ICM algorithm has two goals: to produce a final segmentation assuming aparticular value for K, and to estimate the parameters needed for inference aboutK. Section 5.1.2 describes the ICM algorithm in detail; in particular, it showshow to obtain estimates of the density parameters ˆθ, the Markov random fieldparameter ˆφ, and the hidden Markov random field ˆX.To perform inference for K, I use BIC P L (K), which is the BIC with the likelihoodintegrated over models near the posterior mode of X. Equations 5.81 to5.83 show how to compute this quantity for a given K. Recall that f(Y i |X i = g)is simply a Gaussian density with parameters µ g and σg 2, and p(X i = g|N( ˆX i ), ˆφ)is given by the Potts model in equation 5.2.⎛⎞log(f(Y i | ˆX −i , ˆφ))K∑= log ⎝ f(Y i |X i = g)p(X i = g|N( ˆX i ), ˆφ) ⎠ (5.81)g=1

103log L ˆX (Y |K) = ∑ ilog(f(Y i | ˆX −i , ˆφ)) (5.82)BIC P L (K) = 2 log(L ˆX (Y |K)) − D K log(N) (5.83)In equation 5.83, N is the number of pixels (excluding the boundary of theimage), and D K is the number of parameters in the K segment model. The parametersare φ, a mean and a variance for each segment, and a mixture proportionfor all but one segment; this results in D K = 3K.5.2.5 Determining the Number of ComponentsThe number of components is determined by simply comparing the BIC values forseveral choices of K. In some cases we are interested in whether there are anyfeatures at all (in addition to the background), so a reasonable starting value forK is 1. An obvious upper bound for K is the number of unique values in thedata; however, the segmentation which results when K takes on this value wouldnot be useful since it would be identical to the original image. It is also doubtfulthat the model would hold with very large K; for instance, a Gaussian mixture isnot a good model of discrete data when the number of components in the mixtureis close to the number of unique data values. In general, interest usually lies infinding just a few salient features in the data, so there is no need to explore theentire parameter space of K. Since models with smaller values of K can be fittedmore quickly than those with large values of K, it makes sense to start with smallK and then increase it. I begin with K = 1 and then increase K only as longas the BIC value increases. In other words, I am sequentially comparing a modelwith K components to a model with K +1 components, until the K +1 componentmodel fails to outperform the K component model, as judged by the BIC.

1045.2.6 Morphological Smoothing (Optional)Application of smoothing based on mathematical morphology can enhance thespatial contiguity of regions in the final segmentation. However, one must be warywhen looking for small features, since overuse of smoothing may remove the veryfeatures one is seeking. The smoothing step, although often extremely useful, isnonetheless something which must be carefully considered before application.Mathematical morphology is a local smoothing method which lends itself tofast computation. In morphology, each pixel is examined and an operation isperformed on it based on the contents of a small neighborhood of pixels aroundit. In this section I will discuss the basic operations which can be used and thespecification of the pixel neighborhood.A structuring element defines the size and shape of the neighborhood of eachpixel. The structuring element is simply a specification of neighboring pixels inrelation to the pixel under consideration. For example, a commonly used structuringelement is a 3 pixel by 3 pixel square, with the pixel under consideration atthe center. With this structuring element, the neighborhood of a pixel consists ofthe pixel itself and its 8 adjacent neighbors. In my XV implementation of mathematicalmorphology, I use this 3 by 3 structuring element by default. Changingthe shape of the structuring element can enhance certain features in the data; forinstance, use of a structuring element consisting of a short vertical line of pixelswould accentuate vertical features in the image.The two basic morphology operations are erosion and dilation. Consider pixeli and its neighbors N(i). The neighborhood is defined by an arbitrary structuringelement; note that the neighborhood can include pixel i. Erosion consists of settingthe value of pixel i to the minimum of the pixels in N(i), while dilation (a dual oferosion) is accomplished by setting pixel i to the maximum of the pixels in N(i).Let Y denote the original image, and A is the image after the morphology

105operation. Then erosion can be expressed by equation 5.84, while dilation is givenby equation 5.85.A i = min(Y N(i) ) (5.84)A i = max(Y N(i) ) (5.85)These two basic morphology operations are usually not performed singly. Ifan image is eroded (or dilated) repeatedly, it will eventually become a solid colorequal to the minimum (or maximum) pixel value in the image. To retain themajor features in the image while smoothing the noise, the two operations canbe performed in sequence. An erosion followed by a dilation is called an opening,while a dilation followed by an erosion is called a closing. The opening and closingoperations are idempotent, that is, repeating an opening (or closing) operationwill have no effect (Serra, 1982).Other operations can be substitued in place of the minimum and maximum inequations 5.84 and 5.85. Two other common operations are median (the medianfilter) and mean (blurring).The use of smoothing operations is something which requires applicationspecificknowledge. In some cases, no smoothing at all may be the best wayto keep features of interest, while in other cases the structuring element can betailored to help detect certain feature shapes. In the examples I consider in section5.3, the morphological smoothing step consists of an opening followed by a closing,using a 3 pixel by 3 pixel structuring element.

1065.3 Image Segmentation Examples5.3.1 Simulated Two Segment ImageThis simulation illustrates the use of spatial information by BIC P L compared tothe use of only marginal information by BIC IND , where BIC IND denotes the usualBIC value: twice the loglikelihood (assuming spatial independence) minus thepenalty term, which is the number of degrees of freedom in the model multipliedby the log of the number of data points. A more complete description of thesegmentation algorithm is given in section 5.3.2.The simulated two-segment image shown in figure 5.1 is comprised of two solidbands, with mean greyscale values of 120 and 140. Independent Gaussian noise(mean = 0, variance = 100) is added to each pixel, and then the values are roundedto integers. Compare this to the scrambled image in figure 5.2, which is a randomreordering of the pixels from figure 5.1. Since these two images contain exactlythe same pixels (just in a different order), their marginal information will be thesame. For instance, a histogram of the values in one image will be the same as theother. Such a histogram is shown in figure 5.3.Although it is clear visually that there are two segments in figure 5.1, thereis enough noise in the image that a histogram of the greyscale values, shown infigure 5.3, is unimodal. Since the marginal information of the histogram is thebasis of BIC IND , it is not surprising that it selects only one segment, as shownby the values in table 5.1. Note that the BIC IND results are identical for thetwo-segment image and the scrambled image, since they have the same marginalinformation. However, the BIC P L values favor two segments for the two-segmentimage and one segment for the scrambled image. The estimated φ values used incomputing BIC P L are also shown in the table. The large φ for the two segmentfit of the two-segment image indicates that a large degree of spatial homogeneity

107is found; this results in a much higher value of BIC P L than is obtained for eitherone or three segments.Table 5.1: BIC P L and BIC IND results for the simulated two-segment image and the scrambledimage.Two-Segment ImageScrambled ImageSegments ˆφ BICP L BIC INDˆφ BICP L BIC IND1 – -11731.28 -12998.79 – -11756.49 -12998.792 2.06 -10784.69 -13002.17 0.05 -11843.80 -13002.173 0.09 -11115.64 -13021.64 0.00 -11932.16 -13021.644 0.00 -10989.76 -13062.93 0.00 -11660.82 -13062.93

1080 10 20 30 400 10 20 30 40Figure 5.1: Simulated two-segment image.

1090 10 20 30 400 10 20 30 40Figure 5.2: Scrambled version of figure 5.1.

1100.0 0.005 0.010 0.015 0.020 0.025 0.0300 50 100 150 200 250Figure 5.3: Marginal histogram of the simulated image.

1115.3.2 Simulated Three Segment ImageThe simulated image shown in figure 5.4 is comprised of three solid bands, withgreyscale values of 70, 140, and 210. Independent Gaussian noise (mean = 0,variance = 225) is added to each pixel, and then the values are rounded to integers.This simulation is meant to provide a simple illustration of the algorithm. It isvisually clear that there are three segments; examination of the marginal histogram(figure 5.5) reinforces this.From table 5.2, we see that BIC P L correctly chooses three segments for thisimage. The BIC penalty term plays an important role in this example; the logpseudolikelihoodis maximized at four segments, but the penalty term changes thechoice to three segments. One reason that the logpseudolikelihood is so similarbetween the three and four segment cases is that the final segmentation with foursegments is extremely similar to the three segment one; it simply has the middlesegment subdivided.Values of BIC IND are also shown in table 5.2. BIC IND is computed from themarginal (without spatial information) greyscale values of the image, using theparameters estimated by EM. For this simulation, BIC IND chooses 3 segments.One entry in the table, denoted by †, is missing because the final EM segmentationconverges to a classification with only 4 segments present, which is not a valid 5segment result.The parameter estimates shown in table 5.3 show that the true parametervalues are estimated quite accurately in the three segment solution.The initial segmentation by Ward’s method is shown in figure 5.6; after EM,the marginal segmentation is shown in figure 5.7. It is clear from the marginalhistogram in figure 5.5 that it would be very difficult for any other marginal methodto improve on these results. Significant improvement can only be found by takingaccount of spatial information. After refining the segmentation with ICM, only

112three pixels remain incorrectly classified. One of these is on the border betweentwo segments, where spatial information is less useful, and the other two are onedges of the image. The morphological smoothing step corrects these three pixels,resulting in perfect restoration of the true image.Table 5.2: Logpseudolikelihood and BIC P L results for the simulated image. A missing value,noted with †, is discussed in the text.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -18474.83 -36974.22 -39608.72 2.92 -15979.14 -32007.40 -38537.43 1.86 -13853.21 -27780.11 -37421.74 3.81 -13847.57 -27793.41 -37480.25 3.26 -13848.28 -27819.38 †6 0.21 -13970.14 -28087.67 -37532.87 0.00 -14086.51 -28344.99 -37565.48 0.00 -14062.93 -28322.39 -37590.0

1130 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.4: Simulation of a three segment image, before processing.

114Table 5.3: EM-based parameter estimates for the simulated image.8 segments.Means: 211.59 196.48 157.62 223.93 138.95 125.97 62.24 79.02SDs : 8.39 8.68 11.49 10.62 11.73 14.91 12.38 10.73Probs: 0.119 0.106 0.069 0.104 0.201 0.073 0.187 0.147 segments.Means: 215.42 197.42 157.37 138.89 125.95 62.25 79.02SDs : 12.83 9.84 11.05 11.69 14.89 12.38 10.73Probs: 0.24 0.091 0.068 0.201 0.073 0.187 0.146 segments.Means: 215.4 197.61 141.13 125.15 62.13 78.69SDs : 12.87 10.12 14.89 24 12.33 10.43Probs: 0.239 0.092 0.302 0.045 0.187 0.1355 segments.Means: 215.42 197.58 142.77 128.43 69.95SDs : 12.85 9.98 14.88 11.34 14.86Probs: 0.239 0.092 0.283 0.053 0.3334 segments.Means: 210.17 142.77 127.3 69.97SDs : 14.86 14.38 11.18 14.87Probs: 0.333 0.28 0.054 0.3333 segments.Means: 210.2 140.14 69.85SDs : 14.85 15.26 14.78Probs: 0.333 0.335 0.3322 segments.Means: 212.02 110SDs : 13.58 42.57Probs: 0.296 0.7041 segments.Means: 140.17SDs : 59.15Probs: 1

115Percent0.0 0.002 0.004 0.006 0.008 0.0100 50 100 150 200 250GreyscaleFigure 5.5: Marginal histogram of the simulated image, with the estimated 3 component mixturedensity.

1160 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.6: Initial segmentation of the simulated image by Ward’s method, using 3 segments.

1170 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.7: Segmentation of the simulated image into 3 segments after EM.

1180 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.8: Segmentation of the simulated image into 3 segments after ICM.

1190 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.9: Segmentation of the simulated image into 3 segments after morphological smoothing(opening and closing, conditional on the edge pixels).

1205.3.3 Ice FloesFigure 5.10 shows an aerial image of ice floes. This is a 256-level greyscale image.From table 5.4, we see that the first maximum in BIC P L occurs at K = 2 segments.Parameter estimates for 1 to 8 segments are shown in table 5.5. For comparison,values of BIC IND are also given in table 5.4. These are based on the marginalsegmentation from the EM step, assuming spatial independence. A choice of 4segments is given by BIC IND .The marginal histogram of this image is shown in figure 5.11, along with theestimated mixture density for K = 2. The histogram is clearly bimodal, so the 2segment model makes sense intuitively. However, a large proportion of the datavalues occur between the modes, rather than outside the modes. In other words,each mode is skewed; this explains why the two fitted components appear not tobe centered on the modes.From the marginal histogram, it may appear that three segments should fitbetter than two. This may well be the case for the marginal values, but spatialinformation plays a large role in BIC P L in this example. The large changes inBIC P L values are attributable in part to the large changes in φ. For example,compare figures 5.14 and 5.15. The two segment version consists almost entirelylarge patches of solid color, whether white or black; the three segment version stillhas some large patches of white and black, but there is more clutter and the greysegment lacks spatial contiguity, resulting in a large decrease in the overall φ value.The segmentation process is illustrated in figures 5.12 to 5.16. The initialsegmentation by Ward’s method (figure 5.12) contains a bit of clutter in the water,with very little clutter in the ice floe interiors. After using the EM algorithm tofit a Gaussian mixture (figure 5.13), there is a small increase in clutter in the floeinteriors, but there is much less clutter in the water. The ICM refinement (figure5.14) does a very good job of eliminating clutter in both the water and the ice;

121only one possible melt pool is evident in the main floe in the image. The melt poolis smoothed out by the morphological smoothing step (figure 5.16), along withsome of the smaller bits of ice in the water; however, this step unfortunately linksthe main floe with the floe on the right side of the image. Depending on the goalsof a particular analysis, one might end the processing before the morphologicalsmoothing step.Table 5.4: Logpseudolikelihood and BIC P L results for the ice floe image.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -55649.19 -111326.33 -115800.62 1.28 -46342.76 -92741.44 -110028.83 0.67 -59865.64 -119815.16 -110002.04 0.45 -42259.93 -84631.71 -109388.25 0.33 -41835.65 -83811.10 -109394.36 0.28 -40094.29 -80356.34 -109330.87 0.22 -39428.42 -79052.56 -109345.38 0.13 -39537.04 -79297.76 -109375.4

1220 20 40 60 80 1000 20 40 60 80 100Figure 5.10: Aerial image of ice floes.

123Table 5.5: EM-based parameter estimates for the ice floe image.8 segments.Means: 83.88 38.15 52.85 110.16 131.16 146.59 67.44 163.08SDs : 8.63 8.28 8.42 9.91 9.23 7.09 7.4 6.96Probs: 0.097 0.083 0.123 0.098 0.131 0.329 0.06 0.0797 segments.Means: 79.63 38.27 53.69 111.91 131.43 146.61 163.08SDs : 13.44 8.35 9.12 9.38 9.14 7.1 6.96Probs: 0.165 0.082 0.129 0.087 0.129 0.329 0.0796 segments.Means: 79.89 48.05 112.07 131.45 146.61 163.08SDs : 13.69 11.94 9.24 9.1 7.09 6.96Probs: 0.164 0.214 0.085 0.129 0.329 0.0795 segments.Means: 79.99 48.07 112.39 133.78 147.61SDs : 13.69 11.95 9.13 12.7 11.13Probs: 0.165 0.214 0.082 0.094 0.4454 segments.Means: 77.1 47.61 119.82 147.72SDs : 13.26 11.76 17.94 10.9Probs: 0.148 0.207 0.2 0.4453 segments.Means: 55.53 112.67 147.28SDs : 16.47 23.02 10.97Probs: 0.302 0.248 0.452 segments.Means: 67.82 143.73SDs : 24.93 13.77Probs: 0.431 0.5691 segment.Means: 110.99SDs : 42.3Probs: 1

124Percent0.0 0.005 0.010 0.015 0.020 0.0250 50 100 150 200 250GreyscaleFigure 5.11: Marginal histogram of the ice floe image, with the estimated 2 component mixturedensity.

1250 20 40 60 80 1000 20 40 60 80 100Figure 5.12: Initial segmentation of the ice floe image by Ward’s method, using 2 segments.

1260 20 40 60 80 1000 20 40 60 80 100Figure 5.13: Segmentation of the ice floe image into 2 segments after EM.

1270 20 40 60 80 1000 20 40 60 80 100Figure 5.14: Segmentation of the ice floe image into 2 segments after refinement by ICM.

1280 20 40 60 80 1000 20 40 60 80 100Figure 5.15: Segmentation of the ice floe image into 3 segments after refinement by ICM.

1290 20 40 60 80 1000 20 40 60 80 100Figure 5.16: Segmentation of the ice floe image into 2 segments after morphological smoothing(opening and closing, conditional on the edge pixels).

1305.3.4 Dog LungFigure 5.17 presents a PET image of a dog lung. This image was obtained fromDr. H. T. Robertson at the University of Washington Division of Pulmonary andCritical Care. Dr. Robertson also provided an expert examination of these results,finding that the final segmentation was sufficient for separating out the lung as aregion of interest for further analysis.It is clear in figure 5.17 that the actual image area is circular, with the cornersof the image filled in with a constant grey value. This sort of artifact can beremoved quite easily with a mixture model; one of the components converges to aspike for that grey level, effectively separating it from the rest of the data. Thiscan be clearly seen in figure 5.18, and is also apparent in the parameter estimatesin table 5.7.The results shown in table 5.6 show that BIC P L chooses four segments for thisimage. In context of PET imagery, the choice of four segments is quite reasonablefor this image. Two segments are needed for the background in order to modelthe spike (due to the corner artifact) and the general background. Since the imageis constructed based on radioactive emissions from gas in the lung, it is not at allsurprising to see 2 segments for the lung itself to account for the high gas density inthe interior of the lung and the somewhat lower gas density around the periphery.For this case, BIC IND also chooses 4 segments using only the marginal (withoutspatial information) greyscale values from the image.The initial segmentation by Ward’s method is shown in figure 5.19, and thesegmentation after the EM algorithm is given in figure 5.20. In this case, Ward’smethod does a reasonable job of separating the lung from the background; wewould expect this since the lung is visually very bright compared to the relativelylow-level background. The EM step fills in some of the voids in the lung apparentin the initial segmentation, but at the cost of incorporating more of the background

131artifacts as well.The ICM refinement, shown in figure 5.21, does a good job of reducing clutterin the image, but it leaves an erroneus section sticking out of the top of the lung.Morphological smoothing, shown in figure 5.22, removes the worst of this artifact,as well as removing most of the other clutter in the image. The small spot separatefrom the lung and below is easily removed by simply considering the lung to bethe largest connected component. The small void in the center of the lung is notartifactual; it is real.This final segmentation shows that this segmentation method is sufficient forthe purposes of processing a database of lung images to separate out the region ofinterest (the lung) from the background. Currently, the most widely used methodfor this sort of segmentation is for a human expert to manually outline the lungwith an interactive computer program, a process which is quite tedious and cantake quite a long time for a large database of images. The automatic segmentationalgorithm can obviate the need for the manual process, only requiring humaninspection of the results.

132Table 5.6: Logpseudolikelihood and BIC P L results for the dog lung image.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -80641.02 -161311.13 -166007.12 1.14 -61081.56 -122221.33 -137193.03 1.09 -59536.76 -119160.82 -137019.84 0.52 -42147.64 -84411.69 -128746.25 0.71 -48914.07 -97973.64 -135323.56 0.69 -48508.12 -97190.83 -135333.67 0.47 -36602.26 -73408.23 -128800.68 0.42 -36254.37 -72741.54 -128825.80 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.17: PET image of a dog lung, before processing.

133Table 5.7: EM-based parameter estimates for the dog lung image.8 segments.Means: 41 36.24 46.28 78.28 120.06 147.19 174.8 214.25SDs : 0.5 9.14 9.25 18.22 15.01 9.15 13.93 15.13Probs: 0.241 0.285 0.336 0.04 0.028 0.019 0.036 0.0167 segments.Means: 41 36.26 46.27 78.25 135.39 175.21 216.28SDs : 0.5 9.13 9.26 20.39 23.58 16.97 14.58Probs: 0.241 0.285 0.335 0.042 0.05 0.034 0.0146 segments.Means: 40.86 42.67 90.68 137.66 175.45 216.31SDs : 3.2 12.29 16.57 22.71 17.05 14.59Probs: 0.398 0.479 0.029 0.048 0.033 0.0135 segments.Means: 40.86 42.65 90.23 138.2 182.43SDs : 3.2 12.27 17.38 25.81 28.69Probs: 0.398 0.478 0.029 0.046 0.054 segments.Means: 41 41.88 104.15 177.47SDs : 0.5 10.56 38.01 29.58Probs: 0.24 0.622 0.079 0.0593 segments.Means: 41.3 67.95 166.35SDs : 7.99 30.79 32.89Probs: 0.816 0.098 0.0862 segments.Means: 41.49 127.11SDs : 8.47 54.57Probs: 0.846 0.1541 segments.Means: 54.64SDs : 38.35Probs: 1

134Percent0.0 0.05 0.10 0.15 0.200 50 100 150 200 250GreyscaleFigure 5.18: Marginal histogram of the dog lung image, with the estimated 4 component mixturedensity.

1350 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.19: Initial segmentation of the dog lung image by Ward’s method, using 4 segments.

1360 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.20: Segmentation of the dog lung image into 4 segments after EM.

1370 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.21: Segmentation of the dog lung image into 4 segments after ICM.

1380 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.22: Segmentation of the dog lung image into 4 segments after morphological smoothing(opening and closing, conditional on the edge pixels).

1395.3.5 Washington CoastFigure 5.23 is a satellite image of a section of the Pacific coast of Washington state’sOlympic peninsula; this image, provided by the USGS as part of the NationalAerial Photography Program (NAPP), was obtained from the Terraserver webpage, terraserver.microsoft.com. The resolution is 8 meters per pixel, and it is a256-level greyscale image.Table 5.8 shows that the first maximum in BIC P L occurs at K = 6 segments,so we regard this as the optimal choice of K for this model. Parameter estimatesare given in table 5.9. The marginal histogram of this image is shown in figure5.24, along with the estimated mixture density for K = 6. The large spike is dueto the small variance in the greyscale value of the water.For comparison, BIC IND values are also shown in the table. The first localmaximum occurs at 4 segments. The BIC IND values are computed based onlyon marginal (without spatial information) greyscale data from the image. Twoof the table entries, denoted by †, are missing from the table. The entry for 7segments is missing because the EM algorithm converges to a classification whichcontains fewer segments than the number of components in the mixture. In otherwords, we do not obtain a valid segmentation into 7 segments from the final EMclassification. The entry for 8 segments is missing due to slow convergence, whichis not surprising in light of the result for 7 segments.Figures 5.25 to 5.28 illustrate the steps of the segmentation algorithm for thisimage. An initial segmentation is created by using Ward’s method to cluster themarginal greyscale values of the image; this is shown in figure 5.25. The wateris quite well classified as a single segment, while the land is mostly containedin two segments. The tideline accounts for most of the variability in the image.Starting from the initial segmentation, the EM algorithm is used to fit a Gaussianmixture; the resulting segmentation is displayed in figure 5.26. Most of the pixels

140interior to the land which had initially been classified into the water segment havenow been properly classifed into one of the land segments. Figure 5.27 gives theICM refinement of the segmentation; we can see that much of the noise in theland interior has been smoothed out, while the tideline is still quite clear. Themorphological smoothing step, shown in figure 5.28, smooths the land further, butthe tideline becomes more obscure.In both the ICM classification (figure 5.27) and the post-morphology classification(figure 5.28), the water is well-characterized by the darkest segment. Thesecond-darkest segment corresponds to shallow tidewater near the bright beach, aswell as combining with the third-darkest segment to characterize most of the dryland. Although the second-darkest segment comprises both dry land and shallowtidewater, these two land types are spatially separated by the beach. The twobrightest segments correspond to the beach, which is quite reflective. The thirdbrightestsegment is transitional; it fills in between dry land and water in regionswhere the beach is not evident, as well as capturing the small hill in the upperright hand corner of the image.Comparison of the classifications in figures 5.27 and 5.28 provides a good exampleof the fact that one needs to know the goals of the analysis in order to knowhow much smoothing is reasonable. Figure 5.27 yields good spatial separation ofthe water and land, along with detail in the tidal area; however, the interior of theland and the water near the tideline are both a bit cluttered. Figure 5.28 smoothsout much of the clutter in the image while preserving the location of the tideline,but detail in the tideline is lost.

141Table 5.8: Logpseudolikelihood and BIC P L results for the Washington coast image. Missingvalues, noted with †, are discussed in the text.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -64521.78 -129072.20 -133475.02 1.23 -56721.70 -113500.70 -125273.03 0.74 -45882.79 -91851.53 -117607.84 0.58 -45303.55 -90721.70 -117528.65 0.45 -42517.74 -85178.71 -117564.96 0.39 -42312.05 -84795.99 -117537.57 0.34 -55201.01 -110602.56 †8 0.30 -51279.68 -102788.55 †0 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.23: Satellite image of Washington coast, before processing.

142Table 5.9: EM-based parameter estimates for the Washington coast image.8 segments.Means: 105.09 87.29 154.81 184.37 223.56 78.79 55.71 67.53SDs : 25.61 6.6 14.62 14.65 17.39 5.52 2.25 4.63Probs: 0.125 0.173 0.018 0.014 0.006 0.187 0.354 0.1237 segments.Means: 112.1 87.56 153.29 141.34 78.98 55.71 67.52SDs : 15.71 7.07 21.44 46.16 5.62 2.27 4.65Probs: 0.052 0.188 0.015 0.065 0.197 0.356 0.1286 segments.Means: 114.08 85.85 152.2 182.73 74.36 55.66SDs : 15.8 7.65 16.96 31.42 10.23 2.17Probs: 0.069 0.186 0.025 0.026 0.361 0.3345 segments.Means: 113.01 86.11 179.14 74.1 55.67SDs : 26.91 7.26 31 9.61 2.18Probs: 0.114 0.181 0.03 0.338 0.3364 segments.Means: 120.1 78.8 178.59 55.68SDs : 25.04 11.08 31.64 2.19Probs: 0.089 0.542 0.03 0.3383 segments.Means: 132.05 79.07 55.69SDs : 39.48 11.09 2.19Probs: 0.123 0.539 0.3382 segments.Means: 130.82 70.04SDs : 39.83 14.44Probs: 0.126 0.8741 segments.Means: 77.69SDs : 28.08Probs: 1

143Percent0.0 0.02 0.04 0.06 0.080 50 100 150 200 250GreyscaleFigure 5.24: Marginal histogram of the Washington coast image, with the estimated 6 componentmixture density.

1440 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.25: Initial segmentation of the Washington coast image by Ward’s method, using 6segments.

1450 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.26: Segmentation of the Washington coast image into 6 segments after EM.

1460 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.27: Segmentation of the Washington coast image into 6 segments after refinement byICM.

1470 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.28: Segmentation of the Washington coast image into 6 segments after morphologicalsmoothing (opening and closing, conditional on the edge pixels).

1485.3.6 BuoyFigure 5.29 is an aerial image of a buoy against a background of dark water. Thehorizontal scan lines from the imaging process form a quite visible artifact. Asimple detrending process is used to remove most of the scan line artifact. Thisdetrending consists of renormalizing each row of pixels to smooth the row means.Figure 5.30 shows the image after this simple detrending. Further analysis beginswith this detrended image. The need for an ad hoc preprocessing step in thisexample is meant to be illustrative of the common situation in image analysis inwhich the data contain known artifacts which must be removed prior to processingby more general methods.Table 5.10 shows that BIC P L chooses 6 segments; the parameter estimatesfor 1 to 8 segments are displayed in table 5.11. Figure 5.31 gives the marginalhistogram of greyscale values in the buoy image (figure 5.30), along with the fittedmixture density. At a large scale, it is clear that there are two main groups of valuesin the histogram, at grey values of approximately 90 and 210. These two groupscorrespond to the buoy and the water background in the image. It is important tonote that since we are using a Gaussian mixture, more than one component in themixture might be needed to model a single non-Gaussian segment; this is discussedin section 3.4.2. Additional components may be needed due to subtle features orartifacts. A mixture of more flexible distributions would be more appropriatethan the Gaussian mixture when we are interested in features which do not havea Gaussian distribution.Results using BIC IND , shown in table 5.10, give a choice of 3 segments. Thisis based on the marginal segmentation of the image, before ICM.Proceeding through the steps of the segmentation into 6 segments, we see thatthe buoy is quite clearly separated from the water in the initial segmentation byWard’s method (figure 5.32); some influence from the scan line artifact is still

149visible, both in jagged horizontal bits and in the darker background area runningdown the middle of the image. There is some slight improvement after the EM step,shown in figure 5.33. The ICM refinement does a good job of smoothing the edgesof the buoy; the two brightest segments at this point would give a reasonablesegmentation of the buoy. However, the water background appears quite noisywhen in fact it does not contain any real features. The morphological smoothingstep seems unwarranted here, if not misleading; by reducing the amount of noisein the water segments, it may make the remaining artifacts appear to be features.This example shows some of the limitations of this method. With or withoutmorphological smoothing, the final segmentation is quite noisy and contains moresegments than are really needed to separate the buoy feature from the water background.This image has several difficult aspects. First, the water background isquite noisy, due largely to artifacts of the imaging process which are difficult toremove without extensive knowledge of the particular application at hand. Second,the level of spatial homogeneity is quite different between the feature andthe background. That is, the buoy is a single contiguous region, which on its ownmight have a very high φ value due to its high level of spatial homogeneity. Thebackground, on the other hand, would have a low φ value because it does not haveany large single-segment regions. In combination, it is difficult for the model to fitproperly, since the model only contains one φ parameter. Third, the buoy itself isa relatively small proportion of the image, so it is not surprising that even subtlevariations in the background would have a lot of influence on the model.

150Table 5.10: Logpseudolikelihood and BIC P L results for the buoy image.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -47504.28 -95036.51 -98338.822 1.05 -32799.86 -65655.61 -72682.933 1.04 -31999.44 -64082.71 -72286.604 0.63 -30331.81 -60775.39 -72335.395 0.53 -29886.69 -59913.11 -72276.066 0.25 -29332.52 -58832.69 -72347.417 0.19 -29404.65 -59004.90 -72359.498 0.00 -29887.55 -59998.65 -72385.190 20 40 60 80 1000 20 40 60 80 100Figure 5.29: Aerial image of a buoy, before processing.

151Table 5.11: EM-based parameter estimates for the buoy image.8 segments.Means: 90.97 87.58 83.6 96.73 114.42 144.02 183.64 211.52SDs : 3.37 2.47 2.92 6.3 10.34 14.88 18.12 11.72Probs: 0.491 0.246 0.144 0.063 0.018 0.008 0.012 0.0177 segments.Means: 90.64 86.69 97.16 115.51 144.58 183.74 211.53SDs : 3.55 3.72 6.46 10.13 14.74 18.09 11.72Probs: 0.482 0.404 0.06 0.017 0.008 0.012 0.0176 segments.Means: 90.65 86.69 96.96 114.44 169.52 210.18SDs : 3.54 3.71 6.51 14 23.07 12.29Probs: 0.482 0.403 0.058 0.023 0.015 0.0195 segments.Means: 90.48 86.99 102.18 151.5 208.75SDs : 3.96 3.79 10.64 28.15 13.07Probs: 0.532 0.379 0.048 0.02 0.0214 segments.Means: 90.5 87.11 106.12 192.83SDs : 4.09 3.76 13.84 25.96Probs: 0.535 0.382 0.048 0.0343 segments.Means: 89.08 106.11 192.43SDs : 4.29 13.43 26.29Probs: 0.917 0.048 0.0342 segments.Means: 89.21 152.97SDs : 4.44 45.97Probs: 0.933 0.0671 segments.Means: 93.45SDs : 20.29Probs: 1

1520 20 40 60 80 1000 20 40 60 80 100Figure 5.30: Buoy image after initial smoothing to mitigate the scan line artifact.

153Percent0.0 0.02 0.04 0.06 0.080 50 100 150 200 250GreyscaleFigure 5.31: Marginal histogram of the buoy image, with the estimated 6 component mixturedensity.

1540 20 40 60 80 1000 20 40 60 80 100Figure 5.32: Initial segmentation of the buoy image by Ward’s method, using 6 segments.

1550 20 40 60 80 1000 20 40 60 80 100Figure 5.33: Segmentation of the buoy image into 6 segments after EM.

1560 20 40 60 80 1000 20 40 60 80 100Figure 5.34: Segmentation of the buoy image into 6 segments after ICM.

1570 20 40 60 80 1000 20 40 60 80 100Figure 5.35: Segmentation of the buoy image into 6 segments after morphological smoothing(opening and closing, conditional on the edge pixels).

Chapter 6CONCLUSIONSThis dissertation presents an automatic and unsupervised method for segmentingimages which attempts to be quite general and computationally fast. Theautomatic choice of the number of segments based on the BIC P L is an advanceover most other automatic segmentation methods. This procedure is supportedby a theoretical consistency result. Another contribution of this dissertation is thepresentation of a clustering procedure for spatial point processes with nonlinearfeatures.Clustering with open principal curves can be extended by considering otherdistributions for both the background noise and the distribution of feature pointsalong and about the curve. Although the BIC is used with some success for modelselection with principal curve clustering, theoretical results would provide a moresolid basis for its use. The principal curve clustering examples presented are all2-dimensional; the method can be extended to higher dimensions, though biasproblems in fitting the principal curves might increase in higher dimensions. Forinstance, a bias correction step was considered in two dimensions, which involvedsmoothing the residuals from the principal curve fit; however, this requires determinationof which side of the curve the residuals are on, which is not defined inhigher dimensions. Note that this bias problem is due to principal curves in general,and is separate from the clustering method. Another way of characterizingprincipal curves is given by Delicado (1998); his approach generalizes more easilyto higher dimensions.

159The image segmentation examples in this dissertation all involve greyscale images.However, the methods presented here can be extended to color (3-band)images, or more generally to multiband images. This should yield improved sensitivityand specificity with regard to image features. Perhaps more importantly,it would allow the computer to “view” all bands of an image simultaneously; withmore than 3 bands, a human is forced to resort to flipping back and forth betweenmany images.This extension would require some modifications to the pseudolikelihood calculation,which I present here. Recall equation 5.16, the pseudolikelihood-basedBIC, and equation 5.15, the pseudolikelihood of the image. These are rewrittenhere as equation 6.1 and 6.2.BIC P L (K) = 2 log(L ˆX(Y |K)) − D K log(N) (6.1)L ˆX(Y |K) = ∏ iK∑f(Y i |X i = j)p(X i = j|N( ˆX i ), ˆφ) (6.2)j=1In order to compute BIC P L (K) with multiband images, the computationof L ˆX(Y |K) must reflect the multivariate nature of the image Y . Since thehidden states X are still univariate, the Markov random field term p(X i =j|N( ˆX i ), ˆφ) is unchanged. The Gaussian likelihood term f(Y i |X i = j) nowbecomes a multivariate Gaussian likelihood. On a practical level, this meansthat the EM algorithm used in the initial marginal segmentation would need touse multivariate Gaussian densities. Similarly, computation of the pseudolikelihoodin the ICM algorithm would require evaluation of a multivariate Gaussiandensity for each pixel. For an Splus implementation, it may be possibleto use the latest version of the “mclust” function (available on the web atwww.stat.washington.edu/fraley/mclust/home.html) to do EM, though some sort

160of subsampling would have to be used for all but the smallest images.Once the automatic unsupervised segmentation is accomplished using themethods presented here, other methods which make use of prior knowledge couldbe applied. My method would then play the role of reducing noise, which wouldassist shape-based feature finding methods, such as deformable templates. Forexample, in the dog lung example (section 5.3.4), a shape-based method couldstart from an initial estimate of the lung boundary given by the union of the twobrightest segments. An alternate approach would be to find a correlation peak betweenthe image and a template, which can be done very quickly with the Fouriertransform; a discussion of the Fourier transform can be found in most elementaryengineering texts, such as Oppenheim et. al. (1983).Shape-based methods would likely require extraction of the spatially connectedcomponents in the final segmentation; each connected component could then becompared to predefined templates (perhaps with parameters for rotation and scaling)to determine the presence or absence of features of interest in the image.If training data are available, the results of this segmentation method could beused to automatically search an image database for features of interest. The segmentationwould be performed on the training data first, perhaps with interactivechoice of the number of segments. Characteristics of the feature of interest wouldbe noted; these could include model parameters, such as mean and variance of aparticular component in the Gaussian mixture, or nonmodel observations, such asthe size and shape of one or more connected components representing the featureof interest. Then, new data would be segmented in an automatic mode. The modelparameters of the resulting segmentation could be compared to the training dataresults; connected components could also be extracted for analysis and comparisonto training data.With sufficient training data, one could fit a CART or logistic regression model

161to provide predictive inference for the presence of features of interest. For example,if there are many bright connected components in the training data, then a modelcan be formed to predict the probability that a single connected component is afeature of interest based on size, shape, intensity, and so on. This model can thenbe used with new data to assign a probability value to the presence of a feature ofinterest or to determine confidence limits on such predictions.If a feature of interest can be isolated into one segment by the segmentationmethod, then the rest of the image can be ignored in further analysis. This initself can be a purpose for segmentation, or it can be part of the analyses describedabove. For example, an analysis of the connected components present in asegmentation might be restricted to the connected components of only the brightestsegment. This can result in a considerable increase in speed when connectedcomponents are being analyzed.The morphological smoothing step can be tailored to specific applications. Asshown in the example of the Washington coast image, in which the morphologystep obscures the tideline, sometimes it is not appropriate to do any morphologicalsmoothing since it can smooth out features of interest. The size and shape ofthe structuring element can be adjusted to emphasize different types of features;for example, thin vertical or horizontal structuring elements can emphasize thinvertical or horizontal features. In the examples of section 5.3, I used a sequenceof an opening followed by a closing, but other steps could be used, such as insertinga median smooth between the erosions and dilations. Customization of themorphological smooth is something of an art, and might be done interactively. Alternatively,Forbes and Raftery (1999) present a morphological smoothing methodcast in the context of ICM.

REFERENCESAllard, D., and Fraley, C. (1997) “Nonparametric Maximum Likelihood Estimationof Features in Spatial Point Processes Using Voronoi Tesselation,”Journal of the American Statistical Association, vol. 92, pp. 1485-1493.Ambroise, C., Dang, M., and Govaert, G. (1996), ”Clustering of Spatial Data bythe EM Algorithm,” unpublished manuscript.Ambroise, C., and Govaert, G. (1996) “Constrained Clustering and KohonenSelf-Organizing Maps” Journal of Classification, vol. 13, pp. 299-313.Banfield, J.D., and Raftery, A.E. (1992), “Ice Floe Identification in SatelliteImages Using Mathematical Morphology and Clustering about PrincipalCurves,” Journal of the American Statistical Association, 87, 7-16.Banfield, J.D., and Raftery, A.E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering,” Biometrics, 49, 803–821.Besag, J. (1974), ”Spatial Interaction and the Statistical Analysis of Lattice Systems,”Journal of the Royal Statistical Society, Series B, 6, 192-236.Besag, J. (1986), ”Statistical Analysis of Dirty Pictures,” Journal of the RoyalStatistical Society, Series B, 48, 259-302.Bovik, A.C., and Munson, D.C. (1986), “Edge Detection Using Median Comparisons,”Computer Vision, Graphics, and Image Processing, 33, 377-389.Burdick, H. (1997) Digital Imaging. McGraw-Hill: New York.Byers, S.D., and Raftery, A.E. (1998), “Nearest Neighbor Clutter Removal forEstimating Features in Spatial Point Processes,” Journal of the AmericanStatistical Association, 93, 577-584.

163Campbell, N.W., Mackeown, W.P.J., Thomas, B.T., Troscianko, T. (1997), “InterpretingImage Databases by Region Classification,” Pattern Recognition,30, 555-563.Canny, J. (1986) “A Computational Approach to Edge Detection,” IEEE Transactionson Pattern Analysis and Machine Intelligence, 8, 679-698.Carstensen, J. (1992) Description and Simulation of Visual Texture, PhD Thesis,Imsor: Denmark.Celeux, G., and Govaert, G. (1992), “A Classification EM Algorithm and TwoStochastic Versions,” Computational Statistics and Data Analysis, 14, 315-332.Chickering, D. M., and Heckerman, D. (1996), ”Efficient Approximations for theMarginal Likelihood of Bayesian Networks with Hidden Variables,” Proceedingsof the 12th Conference on Uncertainty in Artificial Intelligence, 158-168.Cunningham, S., and MacKinnon, S. (1998) “Statistical Methods for Visual DefectMetrology” IEEE Transactions on Semiconductor Manufacturing, vol.11, pp. 48-53.Dasgupta, A., and Raftery, A.E. (1998), “Detecting Features in Spatial Point Processeswith Clutter via Model-Based Clustering,” Journal of the AmericanStatistical Association, 93, 294-302.Delicado, P. (1998) “Another Look at Principal Curves and Surfaces” WorkingPaper 309, Department d’Economia i Empresa, Universitat Pompeu Fabra.Dempster, A., Laird, N., and Rubin, D. (1977), “Maximum Likelihood fromIncomplete Data via the EM Algorithm (with Discussion),” Journal of theRoyal Statistical Society, Series B, 39, 1-38.Forbes, F., and Raftery, A. E., (1999) “Bayesian Morphology: Fast UnsupervisedBayesian Image Analysis, ” Journal of the American Statistical Association,

16494, 555-568.Fraley, C., and Raftery, A.E. (1998) “How many clusters? Which clusteringmethod? - Answers via Model-Based Cluster Analysis,” Computer Journal,41, 578-588.Fraley, C., and Raftery, A.E. (1999) “MCLUST: Software for Model-Based Clusteringand Discriminant Analysis,” Journal of Classification, to appear.Geman, S., and Graffigne, C. (1986), “Markov Random Field Image Models andTheir Applications to Computer Vision,” Proceedings of the InternationalCongress of Mathematicians, 1496-1517.Gray, R. M. (1988), Probability, Random Processes, and Ergodic Properties,Springer-Verlag: New York.Green, P. (1990), “On the Use of the EM Algorithm for Penalized LikelihoodEstimation,” Journal of the Royal Statistical Society, Series B, 52, 443-452.Green, P. (1995), “Reversible jump Markov chain Monte Carlo computation andBayesian model determination,” Biometrika, 82, 711-732.Guyon, X. (1995), Random Fields on a Network: Modeling, Statistics, and Applications,Springer-Verlag: New York.Hamilton, J. D. (1994), Time Series Analysis, Princeton University Press: Princeton.Hansen, F.R., and Elliott, H. (1982), “Image Segmentation Using Simple MarkovField Models,” Computer Graphics and Image Processing, 20, 101-132.Hansen, K.V., and Toft, P.A. (1996), “Fast Curve Estimation Using PreconditionedGeneralized Radon Transform,” IEEE Transactions on Image Processing,vol. 5, pp. 1651-1661.Haralick, R.M., and Shapiro, L. G. (1985), ”Survey: Image Segmentation Techniques,”Computer Vision, Graphics, and Image Processing, 29, 100-132.

165Hartigan, J. A. (1975). Clustering Algorithms. Wiley: New York.Hastie, T., and Stuetzle, W. (1989), “Principal Curves,” Journal of the AmericanStatistical Association, 84, 502-516.Hastie, T., and Tibshirani, R. (1990), Generalized Additive Models, Chapman &Hall: New York.Hathaway, R. (1986), ”Another interpretation of the EM algorithm for mixturedistributions,” Journal of Statistics and Probability Letters, 4, 53-56.Heijmans, H.J.A.M. (1994), “Mathematical Morphology: A Modern Approach inImage Processing Based on Algebra and Geometry,” SIAM Review, 37, 1-36.Hough, P.V.C., (1962), “A Method and Means for Recognizing Complex Patterns,”U.S. Patent 3,069,654.Hsiao, K. (1997) “Approximate Bayes Factors When a Mode Occurs on theBoundary,” Journal of the American Statistical Association, 92, 656-663.Illingworth, J., and Kittler, J. (1988) “A Survey of the Hough Transform,”CVGIP, vol. 44, pp. 87-116.Ji, C., and Seymour, L. (1996), ”A Consistent Model Selection Procedure forMarkov Random Fields Based on Penalized Pseudolikelihood,” Annals ofApplied Probability, 6(2), 423-443.Johnson, V. (1994), ”A Model for Segmentation and Analysis of Noisy Images,”Journal of the American Statistical Association, 89, 230-241.Kass, R.E. and Raftery, A.E. (1995), “Bayes Factors,” Journal of the AmericanStatistical Association, 90, 773–795.Kass, R.E. and Wasserman, L. (1995), “A Reference Bayesian Test for NestedHypotheses and its Relationship to the Schwarz Criterion,” Journal of theAmerican Statistical Association, 90, 928–934.Kohonen, T. (1982) “Self-organized Formation of Topologically Correct Feature

166Maps,” Biological Cybernetics, vol. 43, pp. 59-69.Kundu, A. (1990), “Robust Edge Detection,” Pattern Recognition, 23, 423-440.Latham, G., and Anderssen, R. (1994), “Assessing Quantification for the EMalgorithm,” Linear Algebra and its Applications, 210, Oct, 89-122.Latham, G. (1995), “Existence of EMS Solutions and A-Priori Estimates,” SIAMJournal on Matrix Analysis and Applications, 16, 3, 943-953.LeBlanc, M., and Tibshirani, R. (1994), “Adaptive Principal Surfaces,” Journalof the American Statistical Association, 89, 425, 53-64.Letts, P.A. (1978), ”Unsupervised Classification in the Aries Image Analysis System,”Proceedings of the 5th Canadian Symposium on Remote Sensing, 61-71.Lu, W. (1995), “The Expectation-Smoothing Approach for Indirect Curve Estimation,”Unpublished manuscript.Masson, P., and Pieczinsky, W. (1993), ”SEM Algorithm and Unsupervised StatisticalSegmentation of Satellite Images,” IEEE Transactions on Geoscienceand Remote Sensing, 31(3), 618-633.Murtagh, F. (1995) “Interpreting the Kohonen Self-Organization Feature MapUsing Contiguity Constrained Clustering” Pattern Recognition Letters, vol.16, pp. 399-408.Nychka, D. (1990), “Some Properties of Adding a Smoothing Step to the EMAlgorithm,” Statistics and Probability Letters, 9, 187-193.Oppenheim, A. V., Willsky, A. S., and Young, I. T. (1983), Signals and Systems,Prentice Hall: Englewood Cliffs, New Jersey.Pal, N., and Pal, S. (1993) “A Review on Image Segmentation Techniques,”Pattern Recognition, 26, 1277-1294.Priestley, M.B. (1981) Spectral Analysis and Time Series, Academic Press: NewYork.

167Prim, R. (1957), “Shortest Connection Networks and some Generalizations,” BellSystem Technical Journal, 1389-1401.Raftery, A.E. (1995), “Bayesian Model Selection in Social Research,” in SociologicalMethodology 1995, ed. P.V. Marsden, Blackwells: Cambridge, 111-163.Richards, J.A. (1986), Remote Sensing Digital Image Analysis, Springer-Verlag:New York.Roeder, K. and Wasserman, L. (1997), “Practical Bayesian Density EstimationUsing Mixtures of Normals,” Journal of the American Statistical Association,92, 894-902.Schwarz, G. (1978), “Estimating the Dimension of a Model,” The Annals ofStatistics, 6, 461-464.Serra, J. (1982) Image Analysis and Mathematical Morphology, Academic Press:New York.Silverman, B., Jones, M., Wilson, J., and Nychka, D. (1990), “A Smoothed EMApproach to Indirect Estimation Problems, with Particular Reference toStereology and Emission Tomography (with Discussion),” Journal of theRoyal Statistical Society, Series B, 52, 271-324.Steger, C. (1998) “An Unbiased Detector of Curvilinear Structures,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 20, pp. 113-125.Tibshirani, R. (1992), “Principal Curves Revisited,” Statistics and Computing,2, 183-190.Tibshirani, R. and Hastie, T. (1987), “Local Likelihood Estimation,” Journal ofthe American Statistical Association, 82, 559-568.Ward, J. (1963), “Hierarchical groupings to optimize an objective function,” Journalof the American Statistical Association, 58, 234–244.Wold, S. (1974), “Spline Functions in Data Analysis,” Technometrics, 16, 1-11.

168Zahn, C. (1971), “Graph-Theoretical Methods for Detecting and DescribingGestalt Structures,” IEEE Transactions on Computers, C-20,1, 68-86.

Appendix ASOFTWARE DISCUSSIONA.1 XVThe main image segmentation methods discussed in this dissertation have beenadded to a modified version of XV, a popular and widely used UNIX image viewingprogram. XV provides an ideal platform for implementing image processingalgorithms, because code for new algorithms can be added in a relatively modularway. XV has an easy to undertand graphical user interface, with controls for mostaspects of the image display (size, colormap, cropping, and so on). It reads mostcommon image formats. The internal representation of images in XV is the sameregardless of the image file type, which alleviates the need to convert between formats.A few image processing algorithms are built in to XV; these can be accessedusing the “algorithms” button. I have added several new algorithms under thisbutton: threshold, erode, dilate, segment, and autosegment.The threshold algorithm is a simple thresholding of the image based on a userspecifiedthreshold value.Erode and dilate are algorithms which carry out the erode and dilate operationsof mathematical morphology using a user-specified structuring element. Thestructuring element is specified by supplying a file which defines the structuringelement. The most common structuring element is probably the simple 3x3 square;a sample file specifying this structuring element is included in the same directoryas the executable, and the source code includes instructions on how to create a

170structuring element file.Segment and autosegment are implementations of automatic unsupervised imagesegmentation. They use a method based on histogram equalization to generatean initial segmentation. Then, EM is used to fit a Gaussian mixture with K components.For the Segment algorithm, K is specified by the user; for Autosegment,the user specifies a maximum value of K and the program chooses the best Kbased on a modified version of BIC (note that this does not use pseudolikelihood).All of these methods are designed for greyscale images; if the image is color,then the red color band is used as the image. These methods are quite fast;autosegment can examine values of K from 2 to 12 for a typical 256x256 imagein under a minute. There are some restrictions on image size due to memoryconstraints. These limits are hard-coded into the modified XV version, but theycan be altered by editing the source code and recompiling.A.2 C codegreyhist.c - This program finds the marginal histogram of an image. Input files:greyhistaux.txt (width, height), greyhistin.asc (ascii greyscale integers only). Outputfile: greyhistout.txt.covascii.c - This program computes the covariance matrix of the image and 4lags (the 4 adjacent neighbors preceding each pixel in raster scan order). Inputfiles: covinput.txt (width, height), covimagein.txt (ascii greyscale integers only).Output file: covoutput.txt.emheqpoisson.c - This program uses EM to fit a Poisson mixture model for segmentation,using histogram equalization as the initial segmentation. Final classificationis done without mixture probabilities. Input files: eminput.txt (width,height, # of segments) emimagein.asc (ascii greyscale integers only). Output files:

171emoutput.txt, emimageout.pgm.emnoinitpoisson.c - This program uses EM to fit a Poisson mixture model forsegmentation; a user-supplied initial segmentation is required. Final classificationis done without mixture probabilities. Input files: eminput.txt (width, height, #of segments) emimagein.asc (ascii greyscale integers only) eminitin.asc (integersindexed from zero). Output files: emoutput.txt, emimageout.pgm.emnoinit.c - This program uses EM to fit a Gaussian mixture model for segmentation;a user-supplied initial segmentation is required. Final classification is donewithout mixture probabilities. Minimum SD constraint is set to 0.25; emnoinit2.chas the constraint set to 0.5. Input files: eminput.txt (width, height, # of segments)emimagein.asc (ascii greyscale integers only) eminitin.asc (integers indexedfrom zero). Output files: emoutput.txt, emimageout.pgm.emseg.c - This program uses EM to fit a Gaussian mixture model for segmentation,using histogram equalization as the initial segmentation. Final classificationis done with mixture probabilities. Input files: eminput.txt (width, height, #of segments) emimagein.asc (ascii greyscale integers only). Output files: emoutput.txt,emimageout.pgm.emsegloop.c - This program uses EM to fit several Gaussian mixture models forsegmentation; iteration is done over values of K (the number of segments) from 2to 15. Histogram equalization is used to find the initial segmentation. Input files:eminput.txt (width, height, # of segments) emimagein.asc (ascii greyscale integersonly). Output file: emoutput.txt.mcmcpseudologlik.c - This program performs MCMC to compute pseudolikelihood.Input file: mcmcinput.txt Output file: mcmcoutput.txt

172A.3 Splus codehpcc - Hierarchical clustering on open principal curves is performed on point processdata; some plotting capability is also provided.clust.var.spline - This function is called by hpcc and is not intended for direct use.plot.pclust - This function is called by hpcc and is not intended for direct use.penalty - This function is called by hpcc and is not intended for direct use.vdist - This function is called by hpcc and is not intended for direct use.cempcc - This function refines a clustering on open principal curves by using theCEM algorithm; some plotting capability is also provided.calc.likelihood - This function is called by cempcc and is not intended for directuse.autoseg8 - This function fits greyscale image segmentation models for 1-8 segments.It calls segment.marginal, estimate.icm.phi0, and pseudol.unorder. It returns theinitial segmentation (by Ward’s method), the EM segmentation, and the ICMsegmentation, as well as all parameter estimates.segment.marginal - This function finds an initial segmentation by using Ward’smethod (calling ward.initclass) and then refines this marginal segmentation withthe EM algorithm (calling emfast.alg).ward.initclass - This function begins takes a matrix and a specified number of segmentsas input and returns a classification with the requisite number of segmentsbased on Ward’s method.emfast.alg - This function uses the EM algorithm to fit a Gaussian mixture. Aninitial classification of the data is required. The run time is linear in the number of

173unique data values, so it can handle large images in reasonable time as long as thenumber of unique data values is not large (as is the case with 256-level greyscaleimages).estimate.icm.phi0 - This function refines a segmentation by using ICM with parameterestimation. The phi parameter is constrained to be non-negative. Thisfunction calls the icm function.estimate.icm - The same as estimate.icm.phi0 but without the constraint on phi.icm - This function performs ICM without parameter estimation.pseudol.unorder - This function computes a pseudolikelihood for the unorderedcolors model conditional on parameter estimates.linehist - This function produces a histogram using vertical lines, which is usefulfor image data.mixplot - This function plots a Gaussian mixture density.flip - This function flips a matrix (not transpose) for plotting.morphit - This function performs a greyscale opening and closing on an integermatrix using a hard-coded 3x3 structuring element.

VITA1993 B.S. Mathematics, Harvey Mudd College1993 M.S. Mathematics, Harvey Mudd College and Claremont Graduate University

View - Statistics - University of Washington

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?