13.07.2015 Views

View - Statistics - University of Washington

View - Statistics - University of Washington

View - Statistics - University of Washington

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Fast Automatic Unsupervised Image Segmentation andCurve Detection in Spatial Point PatternsbyDerek C. StanfordA dissertation submitted in partial fulfillment<strong>of</strong> the requirements for the degree <strong>of</strong>Doctor <strong>of</strong> Philosophy<strong>University</strong> <strong>of</strong> <strong>Washington</strong>1999Program Authorized to Offer Degree: <strong>Statistics</strong>


<strong>University</strong> <strong>of</strong> <strong>Washington</strong>This is to certify that I have examined this copy <strong>of</strong> a doctoral dissertation byDerek C. Stanfordand have found that it is complete and satisfactory in all respects,and that any and all revisions required by the finalexamining committee have been made.Chair <strong>of</strong> Supervisory Committee:Reading Committee:Adrian RafteryNayak PolissarPaul SampsonWerner StuetzleDate


In presenting this dissertation in partial fulfillment <strong>of</strong> the requirements for theDoctoral degree at the <strong>University</strong> <strong>of</strong> <strong>Washington</strong>, I agree that the Library shallmake its copies freely available for inspection. I further agree that extensivecopying <strong>of</strong> the dissertation is allowable only for scholarly purposes, consistentwith “fair use” as prescribed in the U.S. Copyright Law. Requests for copying orreproduction <strong>of</strong> this dissertation may be referred to Bell and Howell Informationand Learning, 300 North Zeeb Road, P.O. Box 1346, Ann Arbor, MI 48106-1346,to whom the author has granted “the right to reproduce and sell (a) copies <strong>of</strong> themanuscript in micr<strong>of</strong>orm and/or (b) printed copies <strong>of</strong> the manuscript made frommicr<strong>of</strong>orm.”SignatureDate


<strong>University</strong> <strong>of</strong> <strong>Washington</strong>AbstractFast Automatic Unsupervised Image Segmentation and CurveDetection in Spatial Point Patternsby Derek C. StanfordChairperson <strong>of</strong> Supervisory CommitteePr<strong>of</strong>essor Adrian Raftery<strong>Statistics</strong> DepartmentThere is a growing need for image analysis methods which can process large imagedatabases quickly and with limited human input. I propose a method for segmentinggreyscale images which automatically estimates all necessary parameters,including choosing the number <strong>of</strong> segments. This method is both fast and general,and it does not require any training data. The EM and ICM algorithms are usedto fit an image model and compute a pseudolikelihood; this pseudolikelihood isused in a modified form <strong>of</strong> the Bayesian Information Criterion (BIC) to automaticallyselect the number <strong>of</strong> segments. A consistency result for this approach isproven and several example applications are shown. A method for automaticallydetecting curves in spatial point patterns is also presented. Principal curves areused to model curvilinear features; BIC is used to automatically select the amount<strong>of</strong> smoothing. Applications to simulated minefields and seismological data areshown.


TABLE OF CONTENTSList <strong>of</strong> FiguresList <strong>of</strong> TablesvixChapter 1: Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Other Methods for Image Segmentation . . . . . . . . . . . . . . . . 21.3 Other Methods for Image Segmentation with Choice <strong>of</strong> the Number<strong>of</strong> Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Chapter 2: Clustering on Open Principal Curves 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Model, Estimation and Inference . . . . . . . . . . . . . . . . . . . 92.2.1 Principal Curves . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Probability Model . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Estimation: The CEM-PCC Algorithm . . . . . . . . . . . . 112.2.4 Inference: Choosing the Number <strong>of</strong> Features and TheirSmoothness Simultaneously . . . . . . . . . . . . . . . . . . 122.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Denoising and Initial Clustering . . . . . . . . . . . . . . . . 132.3.2 Hierarchical Principal Curve Clustering (HPCC) . . . . . . . 14


2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.1 A Simulated Two-Part Curvilinear Minefield . . . . . . . . . 162.4.2 A Simulated Curvilinear Minefield . . . . . . . . . . . . . . . 202.4.3 New Madrid Seismic Region . . . . . . . . . . . . . . . . . . 232.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Chapter 3: Marginal Segmentation 333.1 BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3 BIC with Mixture Models . . . . . . . . . . . . . . . . . . . . . . . 383.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.1 Mixture versus Componentwise Classification . . . . . . . . 40Mixture Classification . . . . . . . . . . . . . . . . . . . . . 40Theorem 3.1: Optimality <strong>of</strong> Mixture Classification . . . . . . 40Pro<strong>of</strong> <strong>of</strong> Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . 41Componentwise Classification . . . . . . . . . . . . . . . . . 42Theorem 3.2: Optimality <strong>of</strong> Componentwise Classification . 43Pro<strong>of</strong> <strong>of</strong> Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . 433.4.2 Correspondence <strong>of</strong> Components with Segments . . . . . . . . 45Chapter 4: Adjusting for Autoregressive Dependence 474.1 Adjusting BIC for the AR(1) Model . . . . . . . . . . . . . . . . . . 474.1.1 Loglikelihood Adjustment . . . . . . . . . . . . . . . . . . . 47Independence Case . . . . . . . . . . . . . . . . . . . . . . . 48Dependence Case . . . . . . . . . . . . . . . . . . . . . . . . 49Effect on Computation . . . . . . . . . . . . . . . . . . . . . 50ii


4.1.2 Penalty Adjustment . . . . . . . . . . . . . . . . . . . . . . 52Independence Case . . . . . . . . . . . . . . . . . . . . . . . 53Dependence Case . . . . . . . . . . . . . . . . . . . . . . . . 544.1.3 Computing BIC with the AR(1) Model . . . . . . . . . . . . 564.2 Adjusting BIC for the Raster Scan Autoregression (RSA) Model . . 574.2.1 Loglikelihood Adjustment . . . . . . . . . . . . . . . . . . . 574.2.2 Penalty Adjustment . . . . . . . . . . . . . . . . . . . . . . 624.2.3 Computing BIC with the RSA Model . . . . . . . . . . . . . 634.3 Mixture RSA Models . . . . . . . . . . . . . . . . . . . . . . . . . . 634.4 Fitting the Raster Scan Autoregression Model . . . . . . . . . . . . 654.5 Choosing the Number <strong>of</strong> Segments with BIC . . . . . . . . . . . . . 664.6 Application <strong>of</strong> the RSA Model to Image Data . . . . . . . . . . . . 67Chapter 5: Automatic Image Segmentation via BIC 715.1 Pseudolikelihood for Image Models . . . . . . . . . . . . . . . . . . 715.1.1 Potts Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.1.2 ICM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.1.3 Pseudoposterior Distribution <strong>of</strong> the True Scene . . . . . . . 745.1.4 Pseudolikelihood and BIC . . . . . . . . . . . . . . . . . . . 755.1.5 Consistency <strong>of</strong> BIC P L . . . . . . . . . . . . . . . . . . . . . 77Theorem 5.1: Consistency <strong>of</strong> Choice <strong>of</strong> K . . . . . . . . . . 78Condition A . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Lemma 1: Integrability . . . . . . . . . . . . . . . . . . . . . 79Pro<strong>of</strong> <strong>of</strong> Lemma 1. . . . . . . . . . . . . . . . . . . . . . . . 80Lemma 2: Ergodicity . . . . . . . . . . . . . . . . . . . . . . 83Pro<strong>of</strong> <strong>of</strong> Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . 83iii


Case 1: K T = 1 . . . . . . . . . . . . . . . . . . . . . . . . . 83Case 2: K T = 2, K = 1, and condition A . . . . . . . . . . . 885.2 An Automatic Unsupervised Segmentation Method . . . . . . . . . 935.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2.3 Marginal Segmentation via Mixture Models . . . . . . . . . 96Parameter Estimation by EM . . . . . . . . . . . . . . . . . 96M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . 99Final Marginal Segmentation . . . . . . . . . . . . . . . . . 1015.2.4 ICM and Pseudolikelihood BIC . . . . . . . . . . . . . . . . 1025.2.5 Determining the Number <strong>of</strong> Components . . . . . . . . . . . 1035.2.6 Morphological Smoothing (Optional) . . . . . . . . . . . . . 1045.3 Image Segmentation Examples . . . . . . . . . . . . . . . . . . . . . 1065.3.1 Simulated Two Segment Image . . . . . . . . . . . . . . . . 1065.3.2 Simulated Three Segment Image . . . . . . . . . . . . . . . . 1115.3.3 Ice Floes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.3.4 Dog Lung . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.3.5 <strong>Washington</strong> Coast . . . . . . . . . . . . . . . . . . . . . . . 1395.3.6 Buoy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148Chapter 6: Conclusions 158References 162iv


Appendix A: S<strong>of</strong>tware Discussion 169A.1 XV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169A.2 C code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170A.3 Splus code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172v


LIST OF FIGURES1.1 (a) Simulated image with 3 underlying segments and Gaussian noise.(b) Result <strong>of</strong> automatic segmentation. . . . . . . . . . . . . . . . . . 22.1 (a) Simulated minefield with noise. (b) Final result. . . . . . . . . . 82.2 Principal curve example. . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Two part curvilinear minefield after denoising using nearest neighborcleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Initial clustering <strong>of</strong> two part curvilinear minefield . . . . . . . . . . 172.5 HPCC applied to the two-part curvilinear minefield. . . . . . . . . . 182.6 Simulated curvilinear minefield. . . . . . . . . . . . . . . . . . . . . 202.7 Simulated curvilinear minefield after denoising using nearest neighborcleaning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.8 Initial clustering <strong>of</strong> denoised curvilinear minefield. . . . . . . . . . . 222.9 CEM-PCC applied to the curvilinear minefield. . . . . . . . . . . . 222.10 New Madrid earthquakes 1974-1992. . . . . . . . . . . . . . . . . . 242.11 New Madrid data after denoising. . . . . . . . . . . . . . . . . . . . 242.12 Initial clustering <strong>of</strong> denoised New Madrid earthquake data . . . . . 262.13 HPCC applied to the New Madrid data. . . . . . . . . . . . . . . . 262.14 CEM-PCC applied to the New Madrid data. . . . . . . . . . . . . . 27vi


4.1 (a) Signal generated by an AR(1) process (R 2 = 0.93). (b) Signalconsisting <strong>of</strong> two sequences <strong>of</strong> independent Gaussian noise (R 2 =0.92). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.2 Simulated two segment image. . . . . . . . . . . . . . . . . . . . . . 705.1 Simulated two-segment image. . . . . . . . . . . . . . . . . . . . . . 1085.2 Scrambled version <strong>of</strong> figure 5.1. . . . . . . . . . . . . . . . . . . . . 1095.3 Marginal histogram <strong>of</strong> the simulated image. . . . . . . . . . . . . . 1105.4 Simulation <strong>of</strong> a three segment image, before processing. . . . . . . . 1135.5 Marginal histogram <strong>of</strong> the simulated image, with the estimated 3component mixture density. . . . . . . . . . . . . . . . . . . . . . . 1155.6 Initial segmentation <strong>of</strong> the simulated image by Ward’s method, using3 segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165.7 Segmentation <strong>of</strong> the simulated image into 3 segments after EM. . . 1175.8 Segmentation <strong>of</strong> the simulated image into 3 segments after ICM. . . 1185.9 Segmentation <strong>of</strong> the simulated image into 3 segments after morphologicalsmoothing (opening and closing, conditional on the edgepixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.10 Aerial image <strong>of</strong> ice floes. . . . . . . . . . . . . . . . . . . . . . . . . 1225.11 Marginal histogram <strong>of</strong> the ice floe image, with the estimated 2 componentmixture density. . . . . . . . . . . . . . . . . . . . . . . . . 1245.12 Initial segmentation <strong>of</strong> the ice floe image by Ward’s method, using2 segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.13 Segmentation <strong>of</strong> the ice floe image into 2 segments after EM. . . . . 1265.14 Segmentation <strong>of</strong> the ice floe image into 2 segments after refinementby ICM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127vii


5.15 Segmentation <strong>of</strong> the ice floe image into 3 segments after refinementby ICM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.16 Segmentation <strong>of</strong> the ice floe image into 2 segments after morphologicalsmoothing (opening and closing, conditional on the edge pixels). 1295.17 PET image <strong>of</strong> a dog lung, before processing. . . . . . . . . . . . . . 1325.18 Marginal histogram <strong>of</strong> the dog lung image, with the estimated 4component mixture density. . . . . . . . . . . . . . . . . . . . . . . 1345.19 Initial segmentation <strong>of</strong> the dog lung image by Ward’s method, using4 segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.20 Segmentation <strong>of</strong> the dog lung image into 4 segments after EM. . . . 1365.21 Segmentation <strong>of</strong> the dog lung image into 4 segments after ICM. . . 1375.22 Segmentation <strong>of</strong> the dog lung image into 4 segments after morphologicalsmoothing (opening and closing, conditional on the edgepixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385.23 Satellite image <strong>of</strong> <strong>Washington</strong> coast, before processing. . . . . . . . 1415.24 Marginal histogram <strong>of</strong> the <strong>Washington</strong> coast image, with the estimated6 component mixture density. . . . . . . . . . . . . . . . . . 1435.25 Initial segmentation <strong>of</strong> the <strong>Washington</strong> coast image by Ward’smethod, using 6 segments. . . . . . . . . . . . . . . . . . . . . . . . 1445.26 Segmentation <strong>of</strong> the <strong>Washington</strong> coast image into 6 segments afterEM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1455.27 Segmentation <strong>of</strong> the <strong>Washington</strong> coast image into 6 segments afterrefinement by ICM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1465.28 Segmentation <strong>of</strong> the <strong>Washington</strong> coast image into 6 segments aftermorphological smoothing (opening and closing, conditional on theedge pixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147viii


5.29 Aerial image <strong>of</strong> a buoy, before processing. . . . . . . . . . . . . . . . 1505.30 Buoy image after initial smoothing to mitigate the scan line artifact. 1525.31 Marginal histogram <strong>of</strong> the buoy image, with the estimated 6 componentmixture density. . . . . . . . . . . . . . . . . . . . . . . . . 1535.32 Initial segmentation <strong>of</strong> the buoy image by Ward’s method, using 6segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.33 Segmentation <strong>of</strong> the buoy image into 6 segments after EM. . . . . . 1555.34 Segmentation <strong>of</strong> the buoy image into 6 segments after ICM. . . . . 1565.35 Segmentation <strong>of</strong> the buoy image into 6 segments after morphologicalsmoothing (opening and closing, conditional on the edge pixels). . . 157ix


LIST OF TABLES2.1 BIC results for the two part curvilinear minefield. . . . . . . . . . . 192.2 BIC results for simulated sine wave minefield. . . . . . . . . . . . . 212.3 BIC results for New Madrid seismic data. . . . . . . . . . . . . . . . 254.1 Loglikelihood and BIC results for the data <strong>of</strong> figure 4.1B. . . . . . . 684.2 Loglikelihood and BIC results for the image <strong>of</strong> figure 4.2. . . . . . . 695.1 BIC P L and BIC IND results for the simulated two-segment imageand the scrambled image. . . . . . . . . . . . . . . . . . . . . . . . 1075.2 Logpseudolikelihood and BIC P L results for the simulated image. Amissing value, noted with †, is discussed in the text. . . . . . . . . . 1125.3 EM-based parameter estimates for the simulated image. . . . . . . . 1145.4 Logpseudolikelihood and BIC P L results for the ice floe image. . . . 1215.5 EM-based parameter estimates for the ice floe image. . . . . . . . . 1235.6 Logpseudolikelihood and BIC P L results for the dog lung image. . . 1325.7 EM-based parameter estimates for the dog lung image. . . . . . . . 1335.8 Logpseudolikelihood and BIC P L results for the <strong>Washington</strong> coastimage. Missing values, noted with †, are discussed in the text. . . . 1415.9 EM-based parameter estimates for the <strong>Washington</strong> coast image. . . 1425.10 Logpseudolikelihood and BIC P L results for the buoy image. . . . . 1505.11 EM-based parameter estimates for the buoy image. . . . . . . . . . 151x


DEDICATIONThis work is dedicated to my family, to my friends, and to anyone who finds ituseful. Caveat Emptor.xi


Chapter 1INTRODUCTION1.1 MotivationImage segmentation is the process <strong>of</strong> classifying each pixel <strong>of</strong> an image into a set<strong>of</strong> classes, where the number <strong>of</strong> classes is much smaller than the number <strong>of</strong> uniquepixel values. The goal <strong>of</strong> image segmentation is to separate features from eachother and from background, where features are items <strong>of</strong> interest in an image. Forexample, we might want to separate different tissue types in a brain image: greymatter, white matter, bone, blood, and so on.To illustrate this idea, a simple simulated image is shown in figure 1.1, alongwith a segmented version. In this simulation, it is visually clear that there arethree segments. We want the computer to be able to detect these segments automatically;in this case, the underlying segments are reconstructed perfectly usingthe algorithm described in chapter 5. The details <strong>of</strong> the analysis <strong>of</strong> this simulationcan be found in section 5.3.2.Segmentation can be accomplished manually by a human expert who simplylooks at an image, determines borders between regions, and classifies each region.This is perhaps the most reliable and accurate method <strong>of</strong> image segmentation, becausethe human visual system is immensely complex and well suited to the task.However, modern data acquisition methods create a huge amount <strong>of</strong> image datafor which manual analysis would be prohibitively expensive and time-consuming.


20 10 20 30 40 50 600 10 20 30 40 50 600 10 20 30 40 50 600 10 20 30 40 50 60(a)(b)Figure 1.1: (a) Simulated image with 3 underlying segments and Gaussian noise. (b) Result <strong>of</strong>automatic segmentation.This leads to the current goal: to develop a general method for segmenting imagesquickly and entirely automatically. Furthermore, I use no training data; inapplications where training data are available, this method could be an initialstep in a more complex classification process. Image segmentation is needed in awide variety <strong>of</strong> disciplines with many different imaging modalities. Some examplesare multispectral satellite or aerial images for geoscience or military reconnaisance;medical imaging with PET, CAT, MRI, and ultrasound; and real-timeimage streams for quality control in manufacturing.1.2 Other Methods for Image SegmentationImage segmentation is a well studied problem for which many methods have beendeveloped. Comprehensive surveys <strong>of</strong> the image segmentation literature are availablein Haralick and Shapiro (1985) and more recently in Pal and Pal (1993). Irestrict my attention to the automatic, unsupervised case; these methods work


3without training data (unsupervised) and without the need for manual fine-tuning<strong>of</strong> parameters (automatic). Although many such methods exist, most assume thatK, the number <strong>of</strong> segments, is known in advance. The basic approaches to imagesegmentation consist <strong>of</strong> three types <strong>of</strong> methods: region growing, edge finding, andpixel classification. Here, I give a brief description <strong>of</strong> these areas.Region growing methods seek homogeneous regions, and then grow and mergethese regions until the desired number <strong>of</strong> segments is reached. The growing ormerging <strong>of</strong> regions is typically controlled by a homogeneity measure, such as anentropy criterion or a least squares measure. A well-known example <strong>of</strong> the latteris Hartigan’s K-means clustering (Hartigan, 1975). In addition to the homogeneitymeasure, regions can be characterized by color, shape, size, and so on; thesemeasurements can be incorporated into the region growing algorithm or used in asubsequent processing step. For example, Campbell et al (1997) generate an initialsegmentation using a simple K-means approach, and then use texture, color, andshape to refine and classify the regions. Although the classification step requiresextensive training data, this approach achieves impressive results, correctly classifyingover 90% <strong>of</strong> the pixels in a set <strong>of</strong> outdoor urban test images into 11 objectclasses.Edge finding methods identify edges in a scene; after linking or extending theseedges to form closed regions, edges can be removed until the desired number <strong>of</strong>segments is attained. A wide variety <strong>of</strong> edge finding and edge enhancing methodsare available, with a corresponding range <strong>of</strong> computational complexity. On thesimpler end <strong>of</strong> the spectrum, high-pass filtering methods can be implemented asconvolution operations with simple kernels (for an introductory review <strong>of</strong> simpleconvolution methods, see Burdick, 1997); these approaches are similar to mathematicalmorphology. Usually, more complicated convolution approaches are used,such as Canny’s edge detection (Canny, 1986). Most convolution methods are very


4fast because they can be implemented in the Fourier domain. More complicatedmodels usually include some sort <strong>of</strong> distributional assumptions about noise in theimage; for example, Bovik and Munson (1986) show that when both Gaussian andimpulse noise are present, an edge detector based on local median values is morerobust than a similar algorithm using mean values. A similar distributional assumptionis made by Kundu (1990), who develops a multi-stage approach to dealwith the different noise types.Pixel classification methods attempt to classify individual pixels using eitherthe pixel value or the pixel value and the values <strong>of</strong> adjacent or nearby pixels (alsoknown as a neighborhood). A simple pixel classification method is given in chapter3; this assumes that the pixel values follow a Gaussian mixture distribution andignores spatial information. The pixels are then classified into the component <strong>of</strong>the mixture from which they are most likely to have arisen. Methods which makeuse <strong>of</strong> neighborhood information include Markov random field models; an earlyexample <strong>of</strong> segmentation with Markov random field models is given by Hansenand Elliott (1982), who achieve good results, especially considering the limitedcomputing power available at the time.1.3 Other Methods for Image Segmentation with Choice <strong>of</strong> the Number<strong>of</strong> SegmentsIn this dissertation I present an automatic and unsupervised method for image segmentationincluding the choice <strong>of</strong> K, based on mixture models and the BayesianInformation Criterion (BIC). Other methods which address the problem <strong>of</strong> choosingK can be divided into two categories: ad hoc procedures, and procedureswhich require tuning parameters. Although the ad hoc procedures may give goodresults for some applications, it is doubtful that they would be applicable in general.Similarly, methods which use an arbitrarily chosen tuning parameter may


5not be generally applicable, though <strong>of</strong>ten one can set the tuning parameter to avalue which is reasonable for a large class <strong>of</strong> images. Also, it is usually easier tochoose a tuning parameter than to choose K directly; the tuning parameter maylet K vary rather than forcing a single value <strong>of</strong> K. Below are a few recent paperswhich address the automatic choice <strong>of</strong> K.Dingle and Morrison (1996) use local empirical density functions to characterizeeach segment. They begin with K=1, and then create ”outlier” regions by choosingan arbitrary threshold on total variation to determine when two density functionsare different. Outlier regions become segments (thereby increasing K) when theirsize is larger than another arbitrarily chosen threshold.Chen and Kundu (1993) use a hidden Markov model (HMM) approach totexture segmentation. They define a distance between HMMs called the discriminationinformation (DI). A split-and-merge procedure is used in which a HMMis fit to each region, and regions are merged if their corresponding HMMs havea DI below a certain threshold. This threshold is chosen by a convoluted ad hocprocedure which depends on 3 arbitrarily chosen parameters.Johnson (1994) defines a Gibbs distribution on region identifiers in order toallow inference. An arbitrary parameter in the potential function for the Gibbsdistribution is used to penalize results with many segments.Given some choice for K, there are several estimation methods which can beused to fit a mixture model to the data. The EM algorithm (Dempster et. al., 1977)can be used to estimate parameters in a Gaussian or Poisson model (Hathaway,1986). Many variations <strong>of</strong> this approach have been developed: CEM (Celeux andGovaert, 1992), SEM (Masson and Pieczinsky, 1993), and NEM (Ambroise et.al., 1996). CEM is an adaptation <strong>of</strong> EM for hard classification; the other twomethods take some account <strong>of</strong> spatial information. A similar but nonparametricsegmentation method was developed by Letts (1978) and extended to the case <strong>of</strong>


6multidimensional observations at each pixel.1.4 OverviewI try to make only minimal assumptions about the imaging method, but somedecisions are needed. For instance, raw data from PET images are modeled muchbetter by Poisson mixtures than by Gaussian mixtures, and vice versa for aerialphotography. In general, different features may be distinguished by color or texture.I do not consider texture differences here. The examples and discussionwill focus on greyscale images, though these methods can be extended to color ormultispectral images.Chapter 2 presents a method for automatic detection <strong>of</strong> curves in spatial pointprocesses; this method uses principal curves to model features and the BIC tochoose the amount <strong>of</strong> smoothing. Chapter 3 discusses marginal segmentationmethods, which I use to find an initial segmentation <strong>of</strong> the image. Chapter 4presents two models for autoregressive dependence, the AR(1) model and theraster scan autoregression (RSA) model, and discusses how BIC can be adjustedto accomodate these models. In chapter 5, I discuss a Markov random field modelfor images, and I describe an algorithm for automatic, unsupervised image segmentation.Examples follow at the end <strong>of</strong> the chapter. Conclusions and furtherdiscussion are given in chapter 6. A discussion <strong>of</strong> the s<strong>of</strong>tware developed with thisdissertation is given in Appendix A.


Chapter 2CLUSTERING ON OPEN PRINCIPAL CURVESClustering about principal curves combines parametric modeling <strong>of</strong> noise withnonparametric modeling <strong>of</strong> feature shape. This is useful for detecting curvilinearfeatures in spatial point patterns, with or without background noise. Applicationsinclude the detection <strong>of</strong> curvilinear minefields from reconnaissance images, some<strong>of</strong> the points in which represent false detections, and the detection <strong>of</strong> seismic faultsfrom earthquake catalogs.Our algorithm for principal curve clustering is in two steps: the first is hierarchicaland agglomerative (HPCC), and the second consists <strong>of</strong> iterative relocationbased on the Classification EM algorithm (CEM-PCC). HPCC is used to combinepotential feature clusters, while CEM-PCC refines the results and deals withbackground noise. It is important to have a good starting point for the algorithm:this can be found manually or automatically using, for example, nearest neighborclutter removal or model-based clustering. We choose the number <strong>of</strong> features andthe amount <strong>of</strong> smoothing simultaneously using approximate Bayes factors.2.1 IntroductionWe wish to detect curvilinear features in spatial point processes automatically.We must deal with two kinds <strong>of</strong> noise: background noise, in the form <strong>of</strong> observedpoints which are not part <strong>of</strong> the features, and feature noise, which is the deviation<strong>of</strong> observed feature points from an underlying “true” feature curve. One suchproblem is the detection <strong>of</strong> curvilinear minefields in aerial reconnaissance images.


••••8Figure 2.1(a) is a simulation <strong>of</strong> such an image, and Figure 2.1(b) shows the featuresdetected by our method.•• •••• • ••• • •••••• • •• ••••• •••• • • • •••••• •• ••• • ••• • • ••• •••••••••• ••• •• •••• •• •• • •••••• • ••• ••••••••••• ••••••••• •••••• ••••• •••• • • •• • • •• •••••• •• • • •••• • •• •••••• ••• • • • • • ••••••• ••••••••• • ••• ••• • • ••••• • ••••••• •••••••••••• ••• ••• •••••••••••• • • •• ••• • •• •• ••••• • • •••• • • • • •• •••• • ••• •••• • •• • ••• ••••• ••• •••• • •• ••••••• ••• ••••••••• •••••• •••• • •• ••(a)(b)Figure 2.1: (a) Simulated minefield with noise. (b) Final result.In Section 2.2, we give some background on principal curves and introduce ourprobability model and clustering algorithm. Section 2.2.3 presents our methodfor clustering on open principal curves, and Section 2.2.4 describes our use <strong>of</strong>approximate Bayes factors to choose the number <strong>of</strong> features and the amount <strong>of</strong>smoothing simultaneously and automatically. Initialization methods, including ourhierarchical principal curve clustering (HPCC) algorithm, are discussed in Section2.3. Examples are presented in Section 2.4, and in Section 2.5 we discuss otherapproaches and areas <strong>of</strong> further work.


92.2 Model, Estimation and Inference2.2.1 Principal CurvesA principal curve is a smooth, curvilinear summary <strong>of</strong> n-dimensional data; it isa nonlinear generalization <strong>of</strong> the first principal component line. Principal curveswere introduced by Hastie and Stuetzle (1989) and discussed in the clusteringcontext by Banfield and Raftery (1992). The curve f is a principal curve <strong>of</strong> h ifE(X|λ f (X) = λ) = f(λ) (2.1)for almost all λ, where X is a random vector with density h in R n and λ f is thefunction which projects points in R n orthogonally onto f. When this holds, f isalso said to be self-consistent for h. A principal curve f is parametrized by λ, thearc length along the curve.The algorithm for fitting a principal curve from data involves iteratively applyingthe definition (2.1), where the conditional expectation is replaced by ascatterplot smoother. The choice <strong>of</strong> the smoothing parameter is discussed in Section2.2.4. Each data point, x j , has an associated projection point f(λ j ) on thecurve, which is the point on the curve closest to x j (see Figure 2.2). The linesegment from x j to f(λ j ) is orthogonal to the curve at f(λ j ), unless f(λ j ) is anendpoint <strong>of</strong> the curve. Bias correction for closed principal curves (Banfield andRaftery, 1992) can also be extended to the open principal curves that we use here.2.2.2 Probability ModelWe model a noisy spatial point process by making distributional assumptions aboutthe background noise and the feature noise. Suppose that X is a set <strong>of</strong> observationsx 1 . . . x N , and C is a partition consisting <strong>of</strong> clusters C 0 , C 1 . . . C K where clusterC j contains N j points. The noise cluster is denoted by C 0 ; we assume that the


10Principal Curve•••••Data PointProjection Point•••••••Figure 2.2: Principal curve example.background noise is uniformly distributed over the region <strong>of</strong> the image (this isequivalent to Poisson background noise). We assume that the feature points aredistributed uniformly along the true underlying feature; that is, their projectionsonto the feature’s principal curve are drawn randomly from a uniform distributionU(0, ν j ), where ν jis the length <strong>of</strong> the j-th curve. We assume that the featurepoints are distributed normally about the true underlying feature, with mean zeroand variance σj 2 . Distance about the curve is the orthogonal distance from a pointto the curve; if the point projects to an endpoint <strong>of</strong> the curve, it is simply thedistance from the point to the curve endpoint. The (K + 1) clusters are combinedin a mixture model, and we denote the unconditional probability <strong>of</strong> belonging tothe j-th feature by π j (j = 0, 1 . . . K).Let θ denote the entire set <strong>of</strong> parameters, {ν j , σ 2 j , π j: j = 1, . . . , K}, notincluding the curves themselves. Then the likelihood is L(X|θ) = ∏ Ni=1 π j L(x i |θ),


11where L(x i |θ) = ∑ Kj=1 π j L(x i |θ, x i ∈ C j ). For feature clusters,[ ] [( 1 1 −||xi − f(λL(x i |θ, x i ∈ C j ) = √ ij )|| 2 )]exp,ν j 2πσjwhere ||x i − f(λ ij )|| is the Euclidean distance from the point x i to its projectionpoint f(λ ij ) on curve j. For the noise cluster,where Area is the area <strong>of</strong> the image.L(x i |θ, x i ∈ C j ) = 1Area ,2σ 2 j2.2.3 Estimation: The CEM-PCC AlgorithmThe CEM-PCC algorithm refines a given clustering by using the Classification EMalgorithm (Celeux and Govaert, 1992), which is a version <strong>of</strong> the well known EMalgorithm (Dempster, et al, 1977) and the probability model <strong>of</strong> Section 2.2.2. Westart with an initial clustering from the methods discussed in Section 2.3.Overview <strong>of</strong> the CEM-PCC algorithm:1. Begin with an initial clustering (features and noise).2. (M-step) Conditional on the current clustering, fit a principal curve to eachfeature cluster and then compute estimates <strong>of</strong> the parameters (ν j , σ 2 j , andπ j ).3. (E-step) Conditional on the current curves and parameter estimates, calculatethe likelihood <strong>of</strong> each point being in each cluster.4. (Classification step) Reclassify each point into its most likely cluster.5. Check for convergence; end or return to step 2.Once we have calculated the probability <strong>of</strong> each point being in each cluster


12based on the estimates <strong>of</strong> the parameters at the current iteration, we reclassifyeach point into the cluster for which it has the highest likelihood. At the end <strong>of</strong>each iteration, we compute the overall likelihood L(X|θ). This process is executedfor a predetermined number <strong>of</strong> iterations, at which point we choose as the finalresult the clustering with the highest overall likelihood (CEM iterations sometimesdecrease the likelihood).We have found that it is useful to impose a lower bound on the estimate<strong>of</strong> the variance about the curve. If the variance is allowed to decrease withoutbound, the likelihood can grow without bound. This can be a problem whenthere are small clusters, since the smoothing is almost able to interpolate the datapoints. We impose a bound based on the assumption that the data are not knownwith absolute precision. For instance, if we assume the data are precise to threesignificant digits, then we can find a lower bound on the resolution <strong>of</strong> the data andtranslate this to a lower bound on the variance.2.2.4 Inference: Choosing the Number <strong>of</strong> Features and Their Smoothness SimultaneouslySince the number <strong>of</strong> clusters affects the overall amount <strong>of</strong> smoothing, we selectthe smoothing parameter and the number <strong>of</strong> clusters simultaneously. The amount<strong>of</strong> smoothing in each feature cluster is measured by the degrees <strong>of</strong> freedom (DF )used in fitting the principal curve to that cluster. We use a cubic B-spline (Wold,1974) in fitting the principal curves; specifically, we use the function principal.curve(obtained from Statlib) which calls the Splus function smooth.spline. The DF <strong>of</strong>a cubic B-spline is given by the trace <strong>of</strong> the implicit smoother matrix S; S is thematrix which yields the fitted values at the observed data points (Tibshirani andHastie, 1987; and Hastie and Tibshirani, 1990).Each combination <strong>of</strong> number <strong>of</strong> features and degrees <strong>of</strong> freedom (i.e. smooth-


13ness <strong>of</strong> a feature) considered is viewed as specifying a possible model for the data,and the competing models are compared using Bayes factors (Kass and Raftery.1995). We approximate the Bayes factor using the Bayesian Information Criterion(BIC; Schwarz, 1978); the difference between the BIC values for two models isapproximately equal to twice the log Bayes factor when unit information priorsfor the model parameters are used (Kass and Wasserman, 1995). These are priorsthat contain about the same amount <strong>of</strong> information as a single typical observation.This approach has been found to work well for mixture models (Roeder andWasserman, 1997).The BIC for a model with K features and background noise is defined by:BIC = 2 log(L(X|θ)) − M · log(N),where M = K(DF + 2) + K + 1 is the number <strong>of</strong> parameters. The number <strong>of</strong>feature clusters is K; for each feature cluster we estimate ν j and σ j , and we fit acurve using DF degrees <strong>of</strong> freedom. The mixing proportions add K parameters,and the estimate <strong>of</strong> the image area used in the noise density is one more parameter.The larger the BIC, the more the model is favored by the data. Conventionally,differences <strong>of</strong> 2–6 between BIC values for models represent positive evidence,differences <strong>of</strong> 6–10 correspond to strong evidence, while differences greater than10 indicate very strong evidence (Kass and Raftery, 1995).2.3 Initialization2.3.1 Denoising and Initial ClusteringThe performance <strong>of</strong> the CEM-PCC algorithm can be sensitive to the startingvalue, so it is important to have a good starting value. The initial clustering usedto obtain this should accomplish two objectives: separate the feature points frombackground noise, and provide an initial clustering <strong>of</strong> the feature points. The


14first <strong>of</strong> these can be done by a human, or by various automatic methods suchas nonparametric maximum likelihood using the Voronoi tesselation (Allard andFraley, 1997), or Kth nearest neighbor clutter removal (Byers and Raftery, 1998).This step does not need to be perfect, since CEM-PCC will examine the noisepoints to determine if they should be included in the features, and vice versa.Once the noise points have been removed, we need an initial clustering <strong>of</strong> thefeature points so that a curve can be fit to each cluster. We recommend that therebe at least seven points in each cluster; when there are fewer than seven, we fit aprincipal component line instead <strong>of</strong> a curve. The feature points can be clusteredusing model-based clustering as implemented in the MCLUST s<strong>of</strong>tware (Banfieldand Raftery, 1993; Dasgupta and Raftery, 1998; Fraley and Raftery, 1998). Asimpler method is to fit a minimum spanning tree to the feature points and cutthe longest edges, which will work well if the main clusters are well separated(Roeder and Wasserman, 1997; Zahn, 1971).2.3.2 Hierarchical Principal Curve Clustering (HPCC)Clustering on closed principal curves was introduced by Banfield and Raftery(1992). Their clustering criterion (V ∗ ) is based on a weighted sum <strong>of</strong> the squareddistances about the curve and the squared distances along the curve, and they statethat it is optimal when the data points are normally distributed about the curve(conditional on the estimated curves and assuming that α is chosen properly). Itis defined byV ∗ = V About + αV Along ,where V About = ∑ Nj=1 (x j − f(λ j )) 2 , V Along = 1 2∑ Nj=1(ɛ j − ¯ɛ) 2 , and ɛ j = f(λ j ) −f(λ j+1 ).The V About term measures the spread <strong>of</strong> observations about the curve (in or-


15thogonal distance to the curve), while the V Along term measures the variance in arclength distances between projection points on the curve. Minimizing ∑ V ∗ (wherethe sum is over all clusters) will lead to clusters with points regularly spaced alongthe curve and tightly grouped around it. Large values <strong>of</strong> α will cause the algorithmto avoid clusters with gaps, while small values will favor thinner clusters.Clustering stops when merging clusters would lead to an increase in ∑ V ∗ .We extend the method to open principal curves by changing V Along so that thesum goes only to (N − 1) instead <strong>of</strong> to N. This is because the closed curves couldwrap around, whereas the open curve stops at its end points.Overview <strong>of</strong> HPCC:1. Make a first estimate <strong>of</strong> the noise points and remove them.2. Form an initial clustering with at least seven points in each cluster.3. Fit a principal curve to each cluster.4. Calculate ∑ V ∗ for each possible merge.5. Perform the merge which leads to the lowest ∑ V ∗ .6. Keep merging until the desired number <strong>of</strong> clusters is reached.Deciding when to stop clustering is more difficult for open curves than forclosed curves. In the closed curve case, clustering stops when any merge wouldlead to an increase in ∑ V ∗ (Banfield and Raftery, 1993). For open curves, thismethod leads to an overfitting problem in which we end up with too many clusters.V ∗ can be made arbitrarily close to zero by increasing the number <strong>of</strong> clusters. Weovercame this problem by using approximate Bayes factors (Section 2.2.4).


162.4 Examples2.4.1 A Simulated Two-Part Curvilinear MinefieldThe simulated minefield shown in Figure 2.1(a) contains 100 points in each <strong>of</strong>the two curves and 200 points <strong>of</strong> background noise (400 points total). This simulationwas created using <strong>of</strong>fset semicircles as the true underlying features, andbackground noise was generated uniformly over the image area. Note that some<strong>of</strong> the background noise points will fall inside the regions <strong>of</strong> feature points; thesenoise points will be indistinguishable from feature points.The first step is to separate the features from the noise, which we did using 9thnearest neighbor denoising (Byers and Raftery, 1998); the resulting feature pointsare shown in Figure 2.3. We then used MCLUST (Banfield and Raftery, 1993) toprovide an initial clustering into 9 clusters; this is shown in Figure 2.4. We used 9clusters for the initial clustering because this is the largest number <strong>of</strong> clusters forwhich MCLUST returns a clustering in which each cluster has at least 7 points.HPCC was applied to obtain 2 clusters, shown in Figure 2.5. The noise pointswere then returned to the image with the HPCC clustering, and CEM-PCC wasused to refine the clustering. The final result is shown in Figure 2.1(b).Table 2.1 shows the BIC values for 1 to 3 features with a variety <strong>of</strong> DF values.The BIC is maximized for 2 features with 5 DF. The approximate Bayes factorsidentified the correct number <strong>of</strong> features quite decisively in this example.


••••17•• •• ••••• •••• • • ••• • ••••••••• ••••••• •••• • • •• ••••• •••••••••• ••• •••••• •• ••• • •• •• •• •• • ••••• •• •••• •• • • •••••• ••••••• ••• •••• • •••• •• •••••• •• ••• ••••• ••• • • •• ••• ••• ••• • • • •• •••• • ••• •• •• • ••• ••• •• •• • •• •••••Figure 2.3: Two part curvilinear minefield after denoising using nearest neighbor cleaning.Figure 2.4: Initial clustering <strong>of</strong> two part curvilinear minefield using MCLUST. There are nineclusters.


18Figure 2.5: HPCC applied to the two-part curvilinear minefield.


19Table 2.1: BIC results for the two part curvilinear minefield.BICDF 0 Features 1 Feature 2 Features 3 Features2 -1984 -1880 -1846 -17453 -1984 -1850 -1748 -17214 -1984 -1861 -1648 -16485 -1984 -1845 -1628 -16586 -1984 -1803 -1632 -16707 -1984 -1761 -1641 -16858 -1984 -1726 -1643 -16939 -1984 -1702 -1648 -170310 -1984 -1692 -1660 -171811 -1984 -1689 -1671 -173512 -1984 -1689 -1688 -174913 -1984 -1680 -1703 -177014 -1984 -1721 -1718 -179215 -1984 -1748 -1727 -181016 -1984 -1777 -1733 -180717 -1984 -1755 -1739 -1827


21Table 2.2: BIC results for simulated sine wave minefield.DF 0 Features 1 Feature 2 Features 3 Features2 -1031 -1034 -1018 -9913 -1031 -1016 -938 -9414 -1031 -975 -889 -9085 -1031 -911 -883 -9046 -1031 -884 -887 -9017 -1031 -873 -894 -9018 -1031 -868 -901 -9079 -1031 -869 -908 -92110 -1031 -872 -913 -92911 -1031 -873 -919 -94012 -1031 -877 -926 -951Figure 2.7: Simulated curvilinear minefield after denoising using nearest neighbor cleaning.


22Figure 2.8: Initial clustering <strong>of</strong> denoised curvilinear minefield using MCLUST. There are sevenclusters.Figure 2.9: CEM-PCC applied to the curvilinear minefield.


232.4.3 New Madrid Seismic RegionData on 219 earthquakes in the New Madrid seismic region were obtained fromthe Center for Earthquake Research and Information (CERI) World Wide Website http://samwise.ceri.memphis.edu. We have included all earthquakes in theNew Madrid catalog from 1974 to 1992 with a magnitude <strong>of</strong> 2.5 and above. Thistime period was chosen because the data collection methods were consistent; dataprior to this period are available but become sparser and less reliable as one goesback in time. The New Madrid region extends from Illinois to Arkansas: latitude35 to 38 and longitude -91 to -88. These data are displayed in Figure 2.10, andthe BIC results are shown in Table 2.3. Figures 2.11 to 2.14 show each step <strong>of</strong> ourprocess; the final result (Figure 2.14) corresponds to the parameters which yieldthe maximum BIC value (3 features, each with 10 degrees <strong>of</strong> freedom).This example illustrates some strengths and limitations <strong>of</strong> our method. We cansee in Figure 2.11 that the most striking features in this dataset are a combination<strong>of</strong> lines and blobs. While our method does a good job <strong>of</strong> picking out the curvilinearfeatures, blobs are not very well modeled by curves, causing the rather awkwardlooking result for the rightmost feature.


24Latitude35.5 36.0 36.5 37.0 37.5 38.0•••••••••• • • • ••• ••• •••• ••••• •••• •••• • ••• •••••••••• •••••• •••••• •••• •• •• ••••••• •• ••••••• •••• •• ••••••••• •• ••••••••• •••• ••••• ••••• •••-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.10: New Madrid earthquakes 1974-1992.Latitude35.5 36.0 36.5 37.0 37.5 38.0-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.11: New Madrid data after denoising.


25Table 2.3: BIC results for New Madrid seismic data.DF 0 Features 1 Feature 2 Features 3 Features 4 Features2 -861 -548 -447 -334 -3493 -861 -544 -403 -336 -3614 -861 -524 -380 -335 -3655 -861 -492 -367 -328 -3606 -861 -458 -363 -330 -3637 -861 -426 -362 -337 -3678 -861 -401 -362 -337 -3629 -861 -386 -363 -332 -34710 -861 -378 -364 -306 -36511 -861 -375 -372 -320 -38312 -861 -372 -377 -333 -40013 -861 -369 -381 -345 -41714 -861 -367 -387 -355 -43115 -861 -364 -391 -366 -44616 -861 -363 -404 -375 -45817 -861 -360 -408 -383 -46818 -861 -361 -410 -393 -48219 -861 -362 -414 -404 -49520 -861 -359 -420 -410 -50721 -861 -361 -418 -416 -51822 -861 -363 -424 -426 -53223 -861 -365 -424 -430 -54124 -861 -371 -426 -439 -554


26Latitude35.5 36.0 36.5 37.0 37.5 38.0-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.12: Initial clustering <strong>of</strong> denoised New Madrid earthquake data using MCLUST. Thereare four clusters.Latitude35.5 36.0 36.5 37.0 37.5 38.0-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.13: HPCC applied to the New Madrid data.


27Latitude35.5 36.0 36.5 37.0 37.5 38.0-91.0 -90.5 -90.0 -89.5 -89.0 -88.5LongitudeFigure 2.14: CEM-PCC applied to the New Madrid data.


282.5 DiscussionWe have introduced a probability model for noisy spatial point process data withcurvilinear features. We use the CEM algorithm to estimate it and classify thepoints, and we use approximate Bayes factors to find the number <strong>of</strong> features and theoptimal amount <strong>of</strong> smoothing, simultaneously and automatically. The hierarchicalprincipal curve clustering method <strong>of</strong> Banfield and Raftery (1992) is extended toopen principal curves (HPCC), and we describe an iterative relocation method(CEM-PCC) for refining a principal curve clustering based on the ClassificationEM algorithm (Celeux and Govaert, 1992). In combination with the denoisingmethod <strong>of</strong> Byers and Raftery (1998) and an initial clustering method such asMCLUST (Banfield and Raftery, 1993), we have an approach which takes noisyspatial point process data and automatically extracts curvilinear features. Themethod appears to work well in simulated and real examples.One way that this kind <strong>of</strong> data may arise is from image processing. There aremany methods for edge detection in images, but most <strong>of</strong> these methods yield noisyresults. These edge detector results can be viewed as a point process and analyzedwith principal curve clustering; this would enhance edge and boundary detectionin images by reducing noise and looking for larger scale structures. Note that thisrequires an important additional step to convert the edge detector results into abinary image which can then be regarded as a point process. This is illustratedwith ice floe images in Banfield and Raftery (1992).The Hough transform is a well known and widely used method for fitting aparameteric curve to a point set (Illingworth and Kittler, 1988; Hough, 1962).One limitation <strong>of</strong> the Hough transform is that it fits only parametric curves, sothat the form <strong>of</strong> the curve must be specified in advance. We wish to address thesituation in which little is known about the true shape <strong>of</strong> the features in a point


29set, so we want to avoid assumptions about the parametric form <strong>of</strong> the features.In this paper, we use open principal curves to model underlying curves in thedata. Principal curves provide a data-driven, nonparametric summary <strong>of</strong> featureshape; they are characterized by the number <strong>of</strong> degrees <strong>of</strong> freedom allowed. Forexample, a principal curve with 15 degrees <strong>of</strong> freedom could look like a line, an arc,a sinusoid, a spiral, or some combination <strong>of</strong> these. In order to give a parametriccurve as much flexibility as a principal curve, many parameters would be needed,which would greatly increase the already large computational requirements <strong>of</strong> theHough transform.Preconditioning on the parameter domain is used in (Hansen and T<strong>of</strong>t, 1996)to improve the speed <strong>of</strong> the Radon transform, a generalization <strong>of</strong> the Hough transform.Although this approach can greatly reduce the size <strong>of</strong> the parameter space,it has two drawbacks. First, several sensitivity parameters must be specified bythe user. This means that the use <strong>of</strong> this method needs to be interactive, with theuser trying various parameter values until a good result is obtained. Second, onlyparametric curves are allowed. Our principal curve clustering method uses BICto automatically choose the number <strong>of</strong> features and the amount <strong>of</strong> smoothing; thecurve shape is estimated adaptively and nonparametrically, so there is no need tosearch over a large parameter space.A curve detection method which makes no parametric assumptions about thecurve shape is given in (Steger, 1998). Candidate points for the curves underlyingthe features are detected using local differential operators. These points are thenlinked into curves; a user-specified threshold is used to determine which candidatesare used. The curves are subsequently modeled as two edges with an interior region.This allows determination <strong>of</strong> curve width, and a bias reduction step is usedto improve the result. Because the curves detected by this method are nonparameteric,they are much more general than curves which can be fitted using a Hough


30transform type <strong>of</strong> approach. The examples in (Steger, 1998) show that the curvesfit the data quite well, but it is still up to the user to interactively choose thesensitivity parameter. The method does not include a formal way to choose thenumber <strong>of</strong> features.Spatial point process data arise in visual defect metrology, and the Hough transformhas been used previously to detect linear features in these data (Cunninghamand MacKinnon, 1998). In this application <strong>of</strong> the Hough transform, several parametersand thresholds must be specified in advance by the user. Although usersexperienced with this technique may well be able to find reasonable values for all<strong>of</strong> the needed parameters, it seems more satisfactory to estimate parameters fromthe data. Principal curve clustering allows automatic detection <strong>of</strong> both linear andnonlinear features without the need for ad hoc parameter specification.The Kohonen self-organizing feature map (SOFM) is another data-drivenapproach to feature detection (Kohonen, 1982; Ambroise and Govaert, 1996;Murtagh, 1995). Neither principal curve clustering nor the SOFM approach requiresprior specification <strong>of</strong> feature shape, and both algorithms are hierarchical innature. Like principal curves, the SOFM can be combined with further clusteringmethods to produce a more powerful clustering algorithm (Murtagh, 1995). However,unlike our method, the SOFM approach does not provide an explicit estimate<strong>of</strong> feature shape.Tibshirani (1992) proposes an alternate definition <strong>of</strong> principal curves based onmixture models and a new algorithm for fitting principal curves based on the EMalgorithm. It is argued that this definition avoids the bias problems inherent in theapproach <strong>of</strong> Hastie and Stuetzle (1989), and an example is presented showing thatthese principal curves can be different in practice from curves <strong>of</strong> the Hastie andStuetzle type. It would be <strong>of</strong> interest to see what effect this alternate definitionwould have on our results.


31The examples we have presented in this paper consist <strong>of</strong> two dimensional pointpatterns, but our methods can be generalized to higher dimensions. Principalcurves could be used in higher dimensions, or our approach could be modified touse a different model as the basis for features, such as principal surfaces (Hastieand Stuetzle, 1989) or adaptive principal surfaces (LeBlanc and Tibshirani, 1994).Many variations <strong>of</strong> the EM algorithm are available. Green (1990) introducedthe One Step Late (OSL) algorithm, a version <strong>of</strong> EM for use with penalized likelihoods.Silverman, Jones, Wilson and Nychka (1990) added a smoothing step tothe EM algorithm, and similarities between this approach and maximum penalizedlikelihood were discussed by Nychka (1990). Lu (1995) replaced the M-stepwith a smoothing step. Theoretical properties <strong>of</strong> smoothing in the EM algorithmwere discussed by Latham and Anderssen (1994) and Latham (1995). Our use <strong>of</strong>principal curves in CEM-PCC is similar to these ideas <strong>of</strong> smoothing. Althoughthe curves themselves are not smoothed across CEM iterations, the curves can beviewed as smoothing the pointwise likelihoods, and thus indirectly smoothing theparameter estimates.In addition to approximate Bayes factors, we explored cross-validation as amethod for choosing the number <strong>of</strong> clusters and amount <strong>of</strong> smoothing. This involvesiteratively leaving out one data point, recomputing the entire clustering,and then calculating the likelihood for the left out point. We found that the resultsare quite similar to the Bayes factor results, but that the cross-validationapproach involves much more computation.Our model assumes that the principal curve underlying each feature has thesame smoothness. This may not always be realistic, and it would be possibleand worthwhile to relax this assumption. Furthermore, as seems to be the casein the New Madrid data set, one or more <strong>of</strong> the features might be circular (orhyperspherical in higher dimensions), concentrated about a point rather than a


32curve. Extending the method to accommodate this possibility explicitly wouldalso be worthwhile.Splus source files for HPCC and CEMPCC are available athttp://www.stat.washington.edu/stanford/princlust.html. Statlib has Splus functionsavailable for fitting principal curves, Kth nearest neighbor denoising, andmodel-based clustering (http://lib.stat.cmu.edu/S/principal.curve,http://lib.stat.cmu.edu/S/NNclean, andhttp://lib.stat.cmu.edu/S/emclust).


Chapter 3MARGINAL SEGMENTATIONIn this chapter I present a discussion <strong>of</strong> image segmentation based on themarginal (without spatial information) pixel values. I begin by presenting somebackground on the Bayesian Information Criterion (BIC), and then I introduce themixture model. I discuss the use <strong>of</strong> BIC for model selection in this context, andI examine two schemes for classification <strong>of</strong> pixels after model selection has beendone.3.1 BICThe Bayesian Information Criterion (BIC) was first given by Schwarz (1978) in thecontext <strong>of</strong> model selection with IID observations from a certain class <strong>of</strong> densities,and Haughton (1988) extended the class <strong>of</strong> densities to curved exponential families.The difference in the BIC value between two models is an approximation <strong>of</strong> twicethe log <strong>of</strong> the Bayes factor comparing the two models; the BIC has the advantage<strong>of</strong> being relatively easy to compute. The basic idea <strong>of</strong> the BIC is to use Laplace’smethod to approximate the integrated likelihood in the Bayes factor, and thenignore terms which do not increase quickly with N. In this section I present aderivation <strong>of</strong> the BIC, which largely follows the discussions found in Kass andRaftery (1995) and Raftery (1995). The following sections show how the BIC canbe adjusted for use with the AR(1) model and for the raster scan autoregression(RSA) model.A common approach to comparing two models or hypotheses, say M 2 vs. M 1 ,


34is to use a likelihood ratio test (LRT). The test statistic is2 log( )pMLE (X|M 2 )∼ χ 2 Dp MLE (X|M 1 )2 −D 1(3.1)where p MLE (X|M) is the maximized likelihood <strong>of</strong> the data X given the modelM, and (D 2 − D 1 ) is the difference in degrees <strong>of</strong> freedom between the two models.The LRT requires that M 1 must be nested in M 2 ; the Bayes factor approach doesnot have this restriction.The Bayes factor B 21 for comparing the same two models has a form similarto the LRT.B 21 = p(X|M 2)p(X|M 1 )(3.2)Here, p(X|M) denotes the integrated likelihood rather than the maximizedlikelihood. If θ is the set <strong>of</strong> parameters in p(X|M) (so θ may be a vector), thenthe integrated likelihood is∫p(X|M i ) =p(X|θ, M i )p(θ|M i )dθ (3.3)where p(θ|M i ) is the prior density <strong>of</strong> θ given M i .Denote log(p(X|θ, M i )p(θ|M i )) by g(θ|M i ) , and rewrite the integrated likelihoodas∫p(X|M i ) =exp(g(θ|M i ))dθ (3.4)Suppose ˜θ is the posterior mode <strong>of</strong> θ, i.e. the value which is the mode <strong>of</strong> theposterior distribution <strong>of</strong> θ. The maximum likelihood estimate ˆθ and the posteriormode ˜θ converge to the same value as the sample size increases to infinity. Replacethe inner term in equation 3.4 by the first few terms <strong>of</strong> a Taylor series expansion


35about ˜θ; this is a good approximation as long as N is large enough that g(θ) ishighly peaked.p(X|M i )≈∫exp(g(˜θ|M i ) + (θ − ˜θ) T g ′ (˜θ|M i )Since ˜θ is a maximum, g ′ (˜θ) = 0. We now have+ 1 2 (θ − ˜θ) T g ′′ (˜θ|M i )(θ − ˜θ))dθ (3.5)p(X|M i )∫ (≈ exp g(˜θ|M i ) + 1 2 (θ − ˜θ) T g ′′ (˜θ|M i )(θ − ˜θ))dθ (3.6)∫ (= exp(g(˜θ|M i )) exp ( 1 2 (θ − ˜θ) T g ′′ (˜θ|M i )(θ − ˜θ))dθ (3.7)The integral has the form <strong>of</strong> a multivariate normal density with covarianceequal to the inverse <strong>of</strong> −g ′′ (˜θ|M i ).p(X|M i ) ≈ exp(g(˜θ|M i ))(2π) D i/2 | − g ′′ (˜θ)| −1/2 (3.8)Recall that g(θ|M i ) = log(p(X|θ, M i )π(θ|M i )).log(p(X|M i )) ≈ log(p(X|˜θ, M i )) + log(p(˜θ|M i )) + (D i /2) log(2π)+(−1/2) log(| − g ′′ (˜θ)|) (3.9)If N is large, then −g ′′ (˜θ|M i ) ≈ E[−g ′′ (˜θ|M i )]. This is the Fisher informationfor the data Y , which will be equal to N times the Fisher information for oneobservation. Let I denote the Fisher information matrix for a single observation;this will be a D i by D i matrix. Now we have | − g ′′ (˜θ)| ≈ N D i|I|.log(p(X|M i )) ≈ log(p(X|˜θ, M i )) + log(p(˜θ|M i )) + (D i /2) log(2π)+(−D i /2) log(N) + (−1/2) log(|I|) (3.10)


36The error <strong>of</strong> the approximation in equation 3.10 is O(N −1/2 ).At this point, we could drop terms which do not increase with N from equation3.10 and obtain equation 3.11. The error <strong>of</strong> our approximation would then be O(1).However, if we consider the prior on θ more closely, we find that the approximationis actually better for a certain prior.Suppose p(θ|M i ) is multivariate normal with mean ˜θ and covariance matrixequal to I −1 . On average, this would give the prior about the same impact onlog(p(X|M i )) as a single observation. We can compute p(˜θ|M i ) by finding the density<strong>of</strong> this multivariate normal evaluated at its mean; this is (2π) −Di/2 |I −1 | −1/2 .Recall that for any nonsingular matrix A, |A −1 | = |A| −1 . Substituting into equation3.10, we see that several terms cancel.log(p(X|M i )) ≈ log(p(X|˜θ, M i )) − D i2log(N) (3.11)In going from equation 3.10 to equation 3.11, we have simply chosen a certainprior which conveniently cancels a few other terms. Thus, the error <strong>of</strong> theapproximation is still O(N −1/2 ).We now multiply equation 3.11 by 2 and substitute the MLE <strong>of</strong> θ for theposterior mode. Denote the maximized loglikelihood <strong>of</strong> the data given model i byL(X|M i ). We arrive at the usual formulation <strong>of</strong> BIC in equation 3.12.BIC(M i ) = 2L(X|M i ) − D i log(N) (3.12)Returning to the Bayes factor idea <strong>of</strong> equation 3.2, we now have a relativelyeasy way to compute an approximate Bayes factor.2 log(B 21 ) = 2 log(p(X|M 2 )) − 2 log(p(X|M 1 )) (3.13)≈ BIC(M 2 ) − BIC(M 1 ) (3.14)


373.2 Mixture ModelsThe mixture density provides a natural way <strong>of</strong> modeling the observed “mixture”<strong>of</strong> features in an image. We can model the marginal (without spatial information)distribution <strong>of</strong> greyscale values in an image with a mixture density, shown inequation 3.15. We use the Gaussian density for Φ when the data are given inan image format, but the Poisson density is more appropriate when raw data areavailable for gamma camera images (e.g. PET). In the Poisson case, we allow thePoisson parameter λ to take on the value zero, which gives a probability mass atzero; this allows easy modeling <strong>of</strong> zero-intensity background regions, a commonand usually artifactual feature <strong>of</strong> medical images.K∑f(Y i |K, θ) = P j Φ(Y i |θ j ) (3.15)j=1Here, Y i is the greyscale value <strong>of</strong> pixel i, P j is the mixture probability <strong>of</strong> componentj, Φ is the single component density (e.g. Gaussian), θ j is the vector <strong>of</strong> parametersfor the jth density, and K is the number <strong>of</strong> components.Suppose we have a given value <strong>of</strong> K. Let Z i denote the true component whichgenerated the observation for pixel Y i , and suppose we have an initial classification<strong>of</strong> each pixel into one <strong>of</strong> the K components (i.e. an initial estimate ˜Z i ). Considerthe Z i to be missing data; now we have cast the problem <strong>of</strong> estimating the θ jand P j parameters into a missing data problem which can be approached by theEM algorithm (Dempster, et al, 1977). Details <strong>of</strong> the estimation are presented inchapter 5.2.Once we have estimated θ j and P j , we can use equation 3.15 to compute alikelihood for each pixel. Under the assumption that all pixels are independent, thelikelihood <strong>of</strong> the whole image is the product <strong>of</strong> the pixel likelihoods, which makescomputation <strong>of</strong> the loglikelihood <strong>of</strong> the image relatively easy. The loglikelihood <strong>of</strong>


38the image with this independence assumption is given in equation 3.16.N∑ K∑L(Y |K, θ) = log( P j Φ(Y i |θ j )) (3.16)i=1 j=1However, pixels are typically not independent. Chapter 4 explores the impact<strong>of</strong> autoregressive dependence on the BIC in both one-dimensional and twodimensionaldata, and presents a combined mixture model and raster-scan autoregression(RSA) model which allows for autoregressive dependence with themixture model. In chapter 5, I use a Markov random field model for the spatialdependence.3.3 BIC with Mixture ModelsWe wish to use BIC for model selection with mixture models which have differentnumbers <strong>of</strong> components. BIC can be computed by using the likelihood fromequation 3.16 in the BIC formula from equation 3.12. The resulting BIC formulais shown in equation 3.17.BIC(K) = 2L(Y |K) − D K log(N) (3.17)Recall from section 3.1 that the BIC approximation is based on the use <strong>of</strong>Laplace’s method to approximate the integrated likelihood in a Bayes factor. Aregularity condition for Laplace’s method is that the parameters must be in theinterior <strong>of</strong> the parameter space; as they approach the boundary <strong>of</strong> parameter space,the approximation breaks down. This presents a problem for the use <strong>of</strong> BIC inthe mixture model context. Suppose we fit a model with K segments when thetrue number <strong>of</strong> segments K 0 is smaller than K. In this case, the true mixtureproportion (P j in equation 3.15) for each extra segment would be zero, which is atthe boundary <strong>of</strong> parameter space. Also, there would be no information available


39for estimating the model parameters such as µ and σ for the extra segments; infact, µ and σ for the extra segments would have no meaning.Regardless <strong>of</strong> this, the BIC has been used previously for model selection inthe mixture context with some success (Banfield and Raftery, 1993; Banfield andRaftery, 1992; Fraley and Raftery, 1998). One possible reason for the apparentsuccess <strong>of</strong> BIC in these situations is the complexity <strong>of</strong> the data. Real data is <strong>of</strong>tennot truly a Gaussian mixture, so additional components in the mixture modelbeyond those which represent the main features <strong>of</strong> the data may be justified on asmaller scale. In this case, the BIC is not choosing a “true” model out <strong>of</strong> a set <strong>of</strong>possible models. Instead, the BIC is choosing a parsimonious model which capturesthe main features <strong>of</strong> the data, even though there may be smaller or more subtlefeatures present. The goal then ceases to be one <strong>of</strong> selecting the correct model andbecomes one <strong>of</strong> choosing a model which parsimoniously captures the importantfeatures in the data. The definition <strong>of</strong> an important feature is necessarily vague;this can vary between different applications. As a general method, the BIC seemsto perform reasonably well.3.4 ClassificationOnce the parameters <strong>of</strong> the mixture model have been estimated, we want to classifyeach pixel in the image into one <strong>of</strong> the K classes. This seemingly simplestep requires careful consideration because <strong>of</strong> two issues. First, the classificationwhich maximizes the mixture likelihood corresponds to a particular utility function,and this may not be the one we really want to use. Second, we have so farregarded components <strong>of</strong> the mixture as being synonymous with distinct features(or background) in the image, which may not be the case.


403.4.1 Mixture versus Componentwise ClassificationThe appropriate method for making our final classification depends on how wewill evaluate the classification when it is done. For instance, we might count thenumber <strong>of</strong> pixels in a feature <strong>of</strong> interest which are correctly classifed; clearly, a classificationwhich places all pixels into that feature regardless <strong>of</strong> all the estimatedparameters would be optimal, though rather unuseful. A more reasonable evaluationcriterion would be to simply count the number <strong>of</strong> pixels correctly classifiedin the whole image; the optimal classification method for this case would lead toa very different result than the previous example. The point here is that we canthink <strong>of</strong> many different evaluation methods; these may be driven by concerns <strong>of</strong> aparticular application, or they may just be different common sense approaches. Inthis section I consider two sensible evaluation criteria, mixture and componentwise,and give the classification methods which are appropriate for each.Mixture ClassificationFirst let us consider the case mentioned above in which we want to maximize thenumber <strong>of</strong> pixels which are correctly classified. This is the mixture classificationcase.Theorem 3.1: Optimality <strong>of</strong> Mixture ClassificationTo maximize the number <strong>of</strong> correctly classified pixels, the optimal classification ruleis to assign each pixel to the segment which has the largest posterior probability,as shown in equation 3.18.C i = argmax m P m Φ(Y i |θ m ) (3.18)In equation 3.18, Y is the observed image and C is the estimated classification.


41The ith pixel <strong>of</strong> X or C is denoted by subscript. The parameter vector θ, themixture proportion P , and the density Φ are defined in equation 3.15 <strong>of</strong> section3.2.Pro<strong>of</strong> <strong>of</strong> Theorem 3.1We define a utility function g(X, C) where X is the true image (i.e. the unobservabletrue classification). Equation 3.19 shows the utility function for mixtureclassification.g(C|X) = (1/N) ∑ iI(X i , C i ) (3.19)Here, I(A, B) = 1 if A = B and 0 otherwise, and X i is the true (unobserved) valueunderlying the observation Y i . We wish to find the value <strong>of</strong> C to maximize theutility.argmax C g(C|X) = argmax C (1/N) ∑ iI(X i , C i ) (3.20)Since we are not able to observe X, we must restate this in probabilistic terms.Let P (X i = C i |Y i ) denote the probability that X i = C i given that we observe Y i .argmax C g(C|Y ) = argmax C (1/N) ∑ iP (X i = C i |Y i ) (3.21)Inside the sum, each term depends only on a single pixel <strong>of</strong> C, so maximizationwill be accomplished by choosing as the value <strong>of</strong> C i that value which maximizesthe probability that X i = C i . Note that X i and C i can take on only the classvalues <strong>of</strong> 1...K, so the maximization becomesApplying Bayes theorem,C i = argmax m P (X i = m|Y i ) (3.22)


42P (X i = m|Y i ) ∝ P (Y i |X i = m)P (X i = m) (3.23)As shown in equation 3.15, we have modeled the distribution <strong>of</strong> Y i , and we canestimate the prior P (X i = m) by using the mixture proportion P m . Substitutinginto equation 3.22,C i = argmax m P m Φ(Y i |θ m ) (3.24)In other words, pixel i is assigned to the component for which it has the highestlikelihood, and we include the mixture proportions (P m ) in the likelihood. Thisseems intuitively reasonable since we are using the mixture density given by equation3.15.End <strong>of</strong> Pro<strong>of</strong>.Componentwise ClassificationAlthough maximizing the number <strong>of</strong> correctly classified pixels is both reasonableand consistent with our presumed mixture model, a componentwise approach tothis problem may be more useful in some circumstances. Consider a case in whichwe have a large background and only a few small features <strong>of</strong> interest. The mixtureproportions will be dominated by the component describing the background, givinginordinate weight to classification <strong>of</strong> pixels as background. As the proportion <strong>of</strong>pixels in the background increases, classification by the mixture likelihood maylead to classifying all pixels as background even when Φ(Y i |θ j ) for the feature isseveral orders <strong>of</strong> magnitude larger than Φ(Y i |θ j ) for the background. In this case,we would want to use componentwise classification so that the feature componentswould receive a more reasonable share <strong>of</strong> influence on the classification.


43In componentwise classification, our aim is to maximize the sum <strong>of</strong> the proportions<strong>of</strong> correctly classified pixels for each component. This gives equal weightto each component. I now derive the classification rule which corresponds to thecomponentwise approach.Theorem 3.2: Optimality <strong>of</strong> Componentwise ClassificationTo maximize the sum <strong>of</strong> the proportions <strong>of</strong> correctly classified pixels for eachcomponent, the optimal classification rule is to assign each pixel to the segmentwhich has the largest component likelihood, as shown in equation 3.25.C i = argmax m Φ(Y i |θ m ) (3.25)Pro<strong>of</strong> <strong>of</strong> Theorem 3.2To find the appropriate classification rule for the componentwise approach, webegin by stating its utility function.(∑)K∑i I(X i , j)I(X i , C i )g(C|X) =∑j=1i I(X i , j)(3.26)As before, I(A, B) = 1 if A = B and 0 otherwise, and X i is the true (unobserved)value underlying the observation Y i . Note that the term in the denominator<strong>of</strong> equation 3.26 is constant with respect to C. The form <strong>of</strong> equation 3.26is meant to be conceptually clear, but an equivalent and more computationallyuseful form is obtained by moving the denominator into the sum in the numeratorand interchanging the order <strong>of</strong> summation.g(C|X) = ∑ i⎛ ()⎞K∑ 1⎝ ∑ I(X i , j)I(X i , C i ) ⎠ (3.27)q I(X q , j)j=1


44In finding argmax C g(C|X), we can now consider each pixel separately sinceeach term in the sum over i only involves pixel i. Let N j denote the number <strong>of</strong>pixels in class j. The maximization takes the following form.K ( )∑ 1C i = argmax m I(X i , j)I(X i , m) (3.28)N jj=1Because <strong>of</strong> the two indicator functions, terms in the sum will be nonzero onlywhen j = m. Note that I(X i , m) 2 = I(X i , m).C i = argmax m( 1N m)I(X i , m) (3.29)We can now return to our modeling assumptions and multiply by N N .Note that N N m=( )1 NC i = argmax m P (X i = m|Y i ) (3.30)N N m1P (X i, and apply Bayes’ theorem as in equation 3.23.=m)()1 1C i = argmax m P (Y i |X i = m)P (X i = m) (3.31)N P (X i = m)C i = argmax m P (Y i |X i = m) (3.32)As shown in equation 3.15, we have modeled the distribution <strong>of</strong> Y i , so we cansubstitute into equation 3.32 to obtain equation 3.33.C i = argmax m Φ(Y i |θ m ) (3.33)End <strong>of</strong> Pro<strong>of</strong>.Application <strong>of</strong> equation 3.33 means that we should classify each pixel into itsmost likely component, without regard for the mixture proportion <strong>of</strong> each com-


45ponent. This classification procedure differs from the one suggested by equation3.24, unless the mixture proportions for all components happen to be equal.In this section, I have examined two different classification goals by expressingthem as different utility functions and deriving the appropriate classificationstrategy. Choice <strong>of</strong> the classification goals is something which must be consideredoutside the context <strong>of</strong> an algorithm, since the appropriate algorithm mustbe chosen to fit the task. Of the two methods I have presented here, the second(componentwise classification) is more appropriate when there is interest in smallerfeatures in the data, since these might be washed out by the dominance <strong>of</strong> a fewlarge features if the mixture classification method is used. The componentwiseclassification method is used in my algorithm in chapter 5.2.3.4.2 Correspondence <strong>of</strong> Components with SegmentsWe have so far been using the ideas <strong>of</strong> segments (in the image) and components (inthe mixture model) almost interchangeably. However, treating the mixture modelthis way sometimes yields results which are difficult to interpret. For example, twocomponents might have the same mean and different variances. If we are consideringthe mean to define a feature, then we should consider the two components torepresent one segment. Similarly, when one component has the highest likelihoodfor two unconnected (in greyscale value) groups <strong>of</strong> pixels, it is unclear whether weshould consider this to be one segment or two.It is important to make a distinction between segments and components. Forimages in general, we need to recognize that a segment can be modeled by one ormore components and a component can represent one or more segments. Althoughit might be difficult to account for this in an automatic method, our currentapproach <strong>of</strong> equating each component with exactly one segment could be seen asa specific case in this more general view. Particular applications might require


46fine tuning <strong>of</strong> this. For instance, if we know from the application that each featureshould have a very consistent and unique grey value, then we should split a segmentwhich contains two or more disjoint sets <strong>of</strong> grey values. This will not change themodel fitting process, only the final pixel classification will be changed.


Chapter 4ADJUSTING FOR AUTOREGRESSIVE DEPENDENCEIn this chapter I present a theoretical approach to model selection for data withautoregressive dependence. I show how the BIC should be modified to deal withcertain kinds <strong>of</strong> dependence in the data. First, I address the simplest case, whichis the the AR(1) model with one-dimensional data. I then present the raster-scanautoregression (RSA) model, which is a special case <strong>of</strong> an AR(P) model. I showthat for data with autoregressive dependence, a simple modification to the usualBIC formula is necessary, and the appropriate “boundary” data points must beexcluded from analysis.4.1 Adjusting BIC for the AR(1) ModelIn the following sections, I use an AR(1) model to examine the effect <strong>of</strong> spatial (ortemporal) dependence on the BIC (equation 3.12). I approach this in two parts,first considering the loglikelihood term and then the penalty term.4.1.1 Loglikelihood AdjustmentConsider data from the AR(1) modelY i = C + βY i−1 + ɛ i (4.1)where |β| < 1 and the ɛ i are IID N(0, σ 2 ɛ ). Note that it is important todistinguish between the variance <strong>of</strong> ɛ i , which is σ 2 ɛ , and the variance <strong>of</strong> Y i , which


48is σ 2 Y .Independence CaseWhen β = 0, the Y i are independent. In this case the following statements aretrue:E[Y i ] = C (4.2)V AR[Y i ] = σ 2 Y = σ2 ɛ (4.3)E[Ȳ ] = E[Y i ] = C (4.4)V AR[Ȳ ] =(σ2YN)=( )σ2ɛN(4.5)Furthermore, we can write down the loglikelihood by using the usual product<strong>of</strong> independent Gaussian densities.L(Y |M) = − N 2 log(2π) − N 2 log(σ2 Y ) − 1 N∑(Y2σY2 i − C) 2 (4.6)i=1Because we will later want to condition on the value <strong>of</strong> Y 1 , we do this here also:L(Y −1 |M, Y 1 ) = − N − 12log(2π) − N − 1 log(σY 2 ) − 122σY2N∑(Y i − C) 2 (4.7)i=2As N increases, the contribution <strong>of</strong> Y 1 to the loglikelihood becomes proportionatelysmaller.Equations 4.2 to 4.5 are actually special cases <strong>of</strong> equations 4.8 to 4.11, whichare shown in the next section. When β = 0 is substituted into equations 4.8 to4.11, they reduce to equations 4.2 to 4.5.


49Dependence CaseWhen |β| < 1, then the following statements hold (Hamilton, 1994).E[Y i ] = C/(1 − β) (4.8)V AR[Y i ] = σ 2 Y = σ2 ɛ /(1 − β2 ) (4.9)E[Ȳ ] = E[Y i] = C/(1 − β) (4.10)V AR(Ȳ ) =( ) ( )1 + β σ2ɛ1 − β N(4.11)We are able to observe the Y i−1 value preceding each Y i for every observationexcept the first. Not surprisingly, the contribution <strong>of</strong> the first observation to theoverall loglikelihood is different than that <strong>of</strong> the other observations, and this makesthe loglikelihood equation more difficult to analyze. Since the proportion <strong>of</strong> thecontribution <strong>of</strong> Y 1 to the loglikelihood becomes small as N increases, we chooseto condition on its value, which has the effect <strong>of</strong> simplifying the formula for theloglikelihood. Furthermore, estimates based on this conditional loglikelihood areconsistent and asymptotically equal to estimates based on the exact loglikelihood.The exact loglikelihood isL(Y |M) = − 1 2 log(2π) − 1 2 log(σ2 ɛ /(1 − β 2 )) − (Y 1 − C/(1 − β)) 2− N − 12log(2π) − N − 1 log(σɛ 2 ) − 122σɛ22σɛ 2 /(1 − β 2 )N∑(Y i − C − βY i−1 (4.12) ) 2The first three terms in equation 4.12 can be thought <strong>of</strong> as the contribution <strong>of</strong>Y 1 , with the remaining terms resulting from Y 2 ...Y N . Conditioning on Y 1 we obtaini=2


50L(Y −1 |M, Y 1 ) = − N − 12log(2π)− N − 1 log(σɛ 2 2)− 12σɛ2N∑(Y i −C −βY i−1 ) 2 (4.13)i=2As we would expect, equation 4.13 is equivalent to equation 4.7 when β = 0.Effect on ComputationWe wish to investigate the differences between equation 4.13 and equation 4.7.Since we are interested in the practical effects <strong>of</strong> these differences, I begin byrestating these equations in their empirical form, that is, the equations whichwould be used to calculate the loglikelihoods from the data. I use the standardconvention <strong>of</strong> denoting a maximum likelihood estimate with a hat.If we were to assume independence (i.e. β = 0), then the loglikelihood wouldbe computed asL(Y −1 |M, Y 1 ) ≈ − N − 12log(2π) − N − 1 log(ˆσ Y 2 2) − 12ˆσ Y2N∑(Y i − Ȳ ) 2 (4.14)i=2With the more general AR(1) model, the loglikelihood would be computed asL(Y −1 |M, Y 1 ) ≈ − N − 12log(2π)− N − 1 log(ˆσ ɛ 2 )− 122ˆσ ɛ2N∑(Y i −Ĉ − ˆβY i−1 ) 2 (4.15)i=2The difference between equation 4.14 and equation 4.15 is the use in the secondterm <strong>of</strong> σ Y as opposed to σ ɛ . We can correct this by using equation 4.16.ˆσ Y 2 = ˆσ ɛ 2 /(1 − R 2 ) (4.16)The R 2 value in equation 4.16 refers to the regression model suggested byequation 4.1. In general, the R 2 value gives the relationship between the variationin the response values and the variation in the residuals, so equation 4.16 can be


52For practical purposes, equation 4.21 will allow fast estimation <strong>of</strong> the dependenceloglikelihood as long as we can compute the value <strong>of</strong> R 2 . For the AR(1) case,such an estimate can be obtained through an ordinary least squares regression.For large datasets, an adequate estimate <strong>of</strong> R 2 might be obtained by subsamplingthe data. A random set <strong>of</strong> points would be sampled (forming the dpendentvariable), along with the points preceding each sampled point (forming the predictorvariable). From these vectors, least squares regression can be performed toestimate the β coefficient and and yield a value for R 2 . This assumes that β isconstant over the whole dataset.4.1.2 Penalty AdjustmentIn this section I derive the adjustment to N which follows from use <strong>of</strong> the BICwith the AR(1) model. In section 3.1, we saw how the loglikelihood and penaltyterm in equation 3.11 arise from equation 3.9. In the previous section I derivedthe adjustment to the loglikelihood term which we need with the AR(1) model, soour interest now lies in how the term log(| − g ′′ (˜θ|M i )|) should be computed forthe AR(1) case (the other terms in equation 3.9 are constant with respect to N).We again use the approximation that ˜θ ≈ ˆθ when N is large. Dropping thesubscript on M since it is not <strong>of</strong> interest here, we wish to computelog(| − g ′′ (ˆθ|M)|) (4.22)where g(θ) = log(p(Y |θ, M)p(θ|M). Simplifying g, we haveg(θ) = log(p(Y |θ, M)) + log(p(θ|M)) (4.23)The second term and its derivatives are constant with respect to N; since wewill later drop terms which do not increase with N we can exclude it from further


53analysis. We now consider g ′′ (θ), the matrix <strong>of</strong> second partial derivatives <strong>of</strong> g(θ)with respect to θ. Note that θ = (µ, σ), so g ′′ (θ) will be a 2x2 matrix. The (i, j)thelement <strong>of</strong> this matrix is given by equation 4.24.g ′′ (θ) ≈ ∂2 log(p(Y |θ, M))∂θ i ∂θ j(4.24)We assume that N is large, so that −g ′′ (θ|M) ≈ E[−g ′′ (θ|M)]. This expectedvalue is also known as the Fisher Information I(Y, θ).Independence CaseTo compute I(Y, θ) for a set Y = (Y 1 ...Y N ) <strong>of</strong> independent Normal (µ, σ 2 ) observations, we need the second derivatives <strong>of</strong> the Gaussian loglikelihood.g(θ) ≈ log(p(Y |θ, M)) (4.25)=N∑(− 1 )1log(2π) − log(σ) −2 2σ 2(Y i − µ) 2 (4.26)i=1The derivatives are shown below.This leads to∂ log(p(Y |θ, M))∂µ∂ log(p(Y |θ, M))∂σ∂ 2 log(p(Y |θ, M))∂µ∂σ===∂ 2 log(p(Y |θ, M))∂σ 2 =∂ 2 log(p(Y |θ, M))∂µ 2 =N∑− 1i=1σ 2(−Y i + µ) (4.27)N∑− 1i=1σ + 1 σ 3(Y i − µ) 2 (4.28)N∑ 2i=1σ 3(−Y i + µ) (4.29)N∑ 1i=1σ − 3 2 σ 4(Y i − µ) 2 (4.30)N∑− 1i=1σ 2 (4.31)


54g ′′ (θ) ≈⎛⎜⎝∑ Ni=1− 1 ∑ Ni=1 2(−Yσ 2 σ 3 i + µ)∑ Ni=1 2∑(−Yσ 3 i + µ) Ni=1( 1 − 3 (Yσ 2 σ 4 i − µ) 2 )The Fisher information is E[−g ′′ (θ|M)], which we can now compute.I(Y, θ) ≈⎛⎜⎝N0σ 202Nσ 2⎞⎞⎟⎠ (4.32)⎟⎠ (4.33)Denote the dimension <strong>of</strong> the Fisher Information matrix by D (in this caseD = 2). We can factor out N from the determinant <strong>of</strong> I(Y, θ).|I(Y, θ)| = N D |I(Y i , θ)| (4.34)We now return to equation 4.22 and drop terms which do not increase with Nto obtain the usual BIC result.log(| − g ′′ (ˆθ|M)|) ≈ log(N D |I(Y i , θ)|) (4.35)= D log(N) + log(|I(Y i , θ)|) (4.36)≈ D log(N) (4.37)Dependence CaseWe now assume that the data Y are generated by the AR(1) model <strong>of</strong> equation4.1. In this case, θ = (C, β, σ ɛ ) so the Fisher Information matrix will be 3x3 ratherthan 2x2 as it was in the independence case. For the more general AR(P) case, θwould contain C, all <strong>of</strong> the autorgressive coefficients, and σ ɛ .We begin by stating the loglikelihood for Y under the AR(1) model (conditioningon Y 1 ) and finding its first and second derivatives. Let N 0 denote the totalnumber <strong>of</strong> observations minus the number on which we are conditioning; for theAR(1) case, N 0 = N − 1.


55g(θ) ≈ log(p(Y |θ, M)) (4.38)(N∑= − 1 2 log(2π) − log(σ ɛ) − 1)(Y2σɛ2 i − C − βY i−1 ) 2 (4.39)i=2The derivatives are shown below.∂ 2 log(p(Y |θ, M))∂C 2 =∂ 2 log(p(Y |θ, M))∂C∂β=∂ 2 log(p(Y |θ, M))∂C∂σ ɛ=∂ 2 log(p(Y |θ, M))∂β 2 =∂ 2 log(p(Y |θ, M))∂β∂σ ɛ=∂ 2 log(p(Y |θ, M))∂σ 2 ɛ=N∑i=2N∑i=2N∑i=2N∑i=2N∑i=2N∑−1σ 3 ɛ−Y i−1σ 2 ɛ(4.40)(4.41)−2(Yσɛ3 i − C − 2βY i−1 ) (4.42)−Y 2i−1σ 2 ɛFollowing analogously to section 4.1.2, we find⎛I(Y, θ) ≈⎜⎝1N 0 σɛ2CN 0 (1−β)σɛ2(4.43)−2(Yσɛ3 i−1 )(Y i − C − βY i−1 ) (4.44)( 1− 3 )(Yi=2σɛ2 σɛ4 i − C − βY i−1 ) 2 (4.45)N 0( 1σ 2 ɛCN 0 (1−β)σɛ) ( 2 σ 2ɛ+1−β 2)C2(1−β) 220 0 N 0 σɛ2As in the independence case, we can factor out N 0 and drop the terms whichdo not increase with N.00⎞⎟⎠(4.46)log(| − g ′′ (ˆθ|M)|) ≈ log(N D 0 |I(Y i , θ)|) (4.47)= D log(N 0 ) + log(|I(Y i , θ)|) (4.48)≈ D log(N 0 ) (4.49)


564.1.3 Computing BIC with the AR(1) ModelWe can now construct the correct form <strong>of</strong> the BIC for the AR(1) model. SupposeM i is the model under consideration, and M i has D i parameters (degrees <strong>of</strong>freedom). Let B denote the beginning <strong>of</strong> the sequence (or border <strong>of</strong> an image) onwhich we condition in order to estimate parameters in the dependence model. Y −Bis the data not including the set B, and N 0 is the number <strong>of</strong> data points in Y −B .L(Y −B |M i ) is the maximized loglikelihood <strong>of</strong> the data under model M i . To avoidconfusion, we must distinguish between the BIC with the dependence correctionand usual form <strong>of</strong> the BIC, which assumes independence among all observations.I will use BIC IND and L IND to denote the BIC and loglikelihood with the independenceassumption, and BIC ADJ and L ADJ will be the BIC and loglikelihoodwith adjustment for dependence. The version <strong>of</strong> BIC which should be used for theAR(1) model is given by equation 4.52.L ADJ (Y −1 |M, Y 1 ) ≈ L IND (Y −1 |M) − N − 1 log(1 − R 2 ) (4.50)2BIC ADJ (M i ) = 2L ADJ (Y −B |M i ) − D i log(N 0 ) (4.51)BIC ADJ (M i ) = 2(L IND (Y −B |M i ) − N − 1 log(1 − R 2 )) − D i log(N 0 ) (4.52)2Note that the value <strong>of</strong> D ineeds to include the count <strong>of</strong> the autoregressiveparameters. For example, in the AR(1) case described above we have D = 3parameters in the model: mean (or equivalently the C parameter from the autoregressivemodel), variance, and autoregressive coefficient.


574.2 Adjusting BIC for the Raster Scan Autoregression (RSA) ModelI present the raster scan autoregression (RSA) model as a generalization <strong>of</strong> theAR(1) model which I investigated in the previous section. Instead <strong>of</strong> modelingspatial dependence with only the preceding (to the left) pixel, the RSA model uses4 neighbors. These are the 4 pixels which both precede the current pixel in rasterscan order and are adjacent to it (raster scan order is the ordering <strong>of</strong> pixels fromleft to right, then top to bottom <strong>of</strong> the image). In other words, these 4 neighborsare the adjacent neighbors to the left, above left, above, and above right in relationto the current pixel.Suppose we have an image which is H pixels high and W pixels wide. TheRSA model is given by equation 4.53.Y i = C + β W +1 Y i−(W +1) + β W Y i−W + β W −1 Y i−(W −1) + β 1 Y i−1 + ɛ i (4.53)Here, the ɛ i are IID Normal(0, σɛ 2 ), C is a constant, and the β values are theautoregressive coefficients which characterize the dependence structure. This isequivalent to an AR(W+1) model, with the additional constraint that only theautoregressive coefficients in equation 4.53 are nonzero. I consider only covariancestationary AR processes, so the autoregressive coefficients must satisfy certainregularity conditions (the interested reader is referred to (Priestley, 1981)).As in section 4.1, I wish to find the appropriate formulation <strong>of</strong> BIC for thisdependence model, and I will proceed by examining the loglikelihood and penaltyterms <strong>of</strong> the BIC separately.4.2.1 Loglikelihood AdjustmentFollowing analogously to section 4.1.1, I need to find the loglikelihood for the 4-neighbor RSA model. Due to the spatial nature <strong>of</strong> the data, I condition on the


58upper, left, and right borders <strong>of</strong> the image; I will refer to these borders as theboundary set B. The loglikelihood for one pixel is given in equation 4.54.L(Y i |M, B) = − 1 2 log(2π) − 1 2 log(σ2 ɛ ) − 12σ 2 ɛ(Y i − (C + β W +1 Y i−(W +1)+β W Y i−W + β W −1 Y i−(W −1) + β 1 Y i−1 )) 2 (4.54)Since this is an AR(P) model, we can write the loglikelihood <strong>of</strong> the whole image(conditional on B) as follows.L(Y −B |M, B) = ∑ i∈ ¯BL(Y i |M, B) (4.55)L(Y −B |M, B) = − N 02 log(2π) − N 02 log(σ2 ɛ ) − 12σ 2 ɛ∑(Y i − (C + β W +1 Y i−(W +1)+β W Y i−W + β W −1 Y i−(W −1) + β 1 Y i−1 )) 2 (4.56)i∈ ¯BN 0 is the total number <strong>of</strong> data points (N) minus the number <strong>of</strong> points on whichI condition (the number <strong>of</strong> points in the set B). The summation over ¯B meansthat we sum over all <strong>of</strong> the data points except those in B. M denotes a particularmodel (i.e. a set <strong>of</strong> coefficient values).Rewriting equation 4.56 in terms <strong>of</strong> estimation, we obtain the following.L(Y −B |M, B) = − N 02 log(2π) − N 02 log(ˆσ2 ɛ ) − 12ˆσ 2 ɛ∑(Y i − (Ĉ + ˆβ W +1 Y i−(W +1)+ ˆβ W Y i−W + ˆβ W −1 Y i−(W −1) + ˆβ 1 Y i−1 )) 2 (4.57)i∈ ¯BNote that the last <strong>of</strong> the three terms in equation 4.57 can be greatly simplifiedbecause both the denominator and1N 0 −1 times the numerator are estimators <strong>of</strong> σ2 ɛ .


59ˆσ 2 ɛ = 1N 0 − 1∑(Y i − (Ĉ + ˆβ W +1 Y i−(W +1) + ˆβ W Y i−W + ˆβ W −1 Y i−(W −1) + ˆβ 1 Y i−1 )) 2i∈ ¯BSubstituting this into equation 4.57, we obtain(4.58)L(Y −B |M, B) = − N 02 log(2π) − N 02 log(ˆσ2 ɛ ) − N 0 − 12(4.59)Let L IND denote the loglikelihood as it would be computed with the assumptionthat all pixels are independent.As shown in equation 4.60, the independenceloglikelihood is quite similar to the dependence loglikelihood, except for the secondterm.L IND (Y −B |M, B) = − N 02 log(2π) − N 02 log(ˆσ2 Y ) − N 0 − 12(4.60)We now need to examine the relation between σ 2 ɛ and σ 2 Y so that an adjustmentterm relating equations 4.59 and 4.60 can be found (analogous to equation 4.21).We saw in equations 4.9 and 4.16 that a simple relation exists between these twoquantities for the AR(1) case. As mentioned in section 4.1.1, equation 4.16 is validfor any AR(P) model; in particular, it is valid for the RSA model. This meansthat with the RSA model we can correct the loglikelihood in the same way that itcan be corrected with the AR(1) model. I restate equation 4.16 as equation 4.61.ˆσ Y 2 = ˆσ ɛ 2 /(1 − R 2 ) (4.61)The R 2 value in equation 4.61 is from the RSA autoregression model, that is,the least squares regression with Y −B as the response vector and lagged values <strong>of</strong>Y (corresponding to the four adjacent neighbors preceding each Y i in raster scanorder) as the predictors. Combining equations 4.59, 4.60, and 4.61, we arrive atequation 4.62.


60L(Y −B |M, B) = L IND (Y −B |M) − N 02 log(1 − R2 ) (4.62)Equation 4.62 shows that the loglikelihood with the RSA model can be computedby adjusting the independence loglikelihood with an additive term based onR 2 .I now proceed to show that σY 2 can also be expressed as a function <strong>of</strong> the modelcoefficients (the β values) and σɛ 2 . This is not needed for the BIC; it is presentedonly for completeness. The rest <strong>of</strong> this section can be skipped without loss <strong>of</strong>continuity.Begin with the RSA model <strong>of</strong> equation 4.53. Take expectations and let µdenote the expected value <strong>of</strong> Y i .E[Y i ] = µ =C(1 − β W +1 − β W − β W −1 − β 1 )(4.63)C = µ(1 − β W +1 − β W − β W −1 − β 1 ) (4.64)Subtract µ from both sides <strong>of</strong> equation 4.53 and substitute in for C.(Y i −µ) = β W +1 (Y i−(W +1) −µ)+β W (Y i−W −µ)+β W −1 (Y i−(W −1) −µ)+β 1 (Y i−1 −µ)+ɛ iMultiply both sides by (Y (i−j) − µ) and take expectations.(4.65)


61E[(Y i − µ)(Y (i−j) − µ)] = E[β W +1 (Y i−(W +1) − µ)(Y (i−j) − µ)+β W (Y i−W − µ)(Y (i−j) − µ)+β W −1 (Y i−(W −1) − µ)(Y (i−j) − µ)+β 1 (Y i−1 − µ)(Y (i−j) − µ) + ɛ i (Y (i−j) − µ)](4.66)Let γ j denote the covariance <strong>of</strong> Y i and Y i−j . Note that γ j = γ −j . Since E[Y i −µ] = 0, equation 4.66 is actually equal to γ j . We can now rewrite equation 4.66 interms <strong>of</strong> γ.γ j = β W +1 γ W +1−j + β W γ W −j + β W −1 γ W −1−j + β 1 γ 1−j + I(j = 0)σ 2 ɛ (4.67)where I(j = 0) is an indicator function equal to 1 when j = 0 and 0 otherwise.We now have an equation for γ j for arbitrary j in terms <strong>of</strong> β values, σɛ 2 , andother γ j values. The equations for γ 0 to γ W +1 form a system <strong>of</strong> W + 2 equationswith W + 2 unknowns (treating the β and σɛ2 values as constant). This can besolved for γ 0 ; this will yield an equation <strong>of</strong> the form given in equation 4.68.γ 0 = h(β W +1 , β W , β W −1 , β 1 )σɛ 2 (4.68)Here, h(β W +1 , β W , β W −1 , β 1 ) is a function which can be found algebraically forany given value <strong>of</strong> W . For the more general AR(P) model, equation 4.68 willcontinue to hold, except that h will be a function <strong>of</strong> β 1 to β P . Since γ 0 is thevariance <strong>of</strong> Y , we can rewrite equation 4.68 as equation 4.69.σ 2 ɛ =σ 2 Yh(β W +1 , β W , β W −1 , β 1 )(4.69)


62We now return to equation 4.56 and substitute in equation 4.69.L(Y −B |M, B) = − N 02 log(2π) − N 02 log( σY2h(β W +1 , β W , β W −1 , β 1 ) ) − N 0 − 12(4.70)L(Y −B |M, B) = − N 02 log(2π)−N 02 log(σ2 Y )+ N 02 log(h(β W +1, β W , β W −1 , β 1 ))− N 0 − 12(4.71)Let L IND denote the loglikelihood as it would be calculated assuming independenceamong all the pixels. Equation 4.71 can be rewritten in terms <strong>of</strong> the independenceloglikelihood in order to show that there is only an additive correctionterm between the independence loglikelihood and the correct (with dependence)loglikelihood <strong>of</strong> equation 4.56.L(Y −B |M, B) = L IND (Y −B |M, B) + N 02 log(h(β W +1, β W , β W −1 , β 1 )) (4.72)Equation 4.72 shows the adjustment needed to correct the independence loglikelihoodfor dependence. Since h(β W +1 , β W , β W −1 , β 1 ) is usually difficult to find,I recommend using equation 4.62 instead.4.2.2 Penalty AdjustmentAlthough section 4.1.2 focuses on the AR(1) model, the arguments are equallyvalid for other dependence models in which the loglikelihood can be expressed asa sum over the data points. In both the AR(P) model and the RSA model, theloglikelihood is given by such a sum (see equation 4.55).As shown in section 4.1.2, the adjustments to the BIC penalty term whendependence is present consist <strong>of</strong> using values <strong>of</strong> the degrees <strong>of</strong> freedom D andthe number <strong>of</strong> data points N which are consistent with the dependence model.


63The dependence model uses additional parameters, so the count <strong>of</strong> these must beincluded in D. For the RSA model, 4 autoregressive coefficients are used, as wellas a mean and a variance, so the overall number <strong>of</strong> degrees <strong>of</strong> freedom is D = 6.We exclude the boundary points from computation, so N is reduced to N 0 , thenumber <strong>of</strong> data points excluding the boundary. Thus the penalty term for theadjusted BIC is D log(N 0 ), where D includes the autoregressive coefficients.4.2.3 Computing BIC with the RSA ModelLet BIC ADJ denote the BIC as it would be computed assuming dependence amongthe data. I have shown in section 4.2.1 that the independence loglikelihood L INDcan be adjusted for dependence using equation 4.62.This, combined with thesimple penalty adjustment, means that we can write BIC ADJ in terms <strong>of</strong> L INDand a correction term based on R 2 from the RSA model. This is shown in equation4.73.(BIC ADJ (M i ) = 2 L IND (Y −B |M i ) − N )02 log(1 − R2 ) − D i log(N 0 ) (4.73)In equation 4.73, Y −B indicates the data Y excluding the boundary B. N 0 is thenumber <strong>of</strong> data points in Y −B , and D i is the number <strong>of</strong> parameters in model M i ,including autoregressive coefficients. The R 2 value is from the RSA autoregression,as described in section 4.2.1.4.3 Mixture RSA ModelsThe mixture model and the RSA model can be combined by using the mixturemodel to estimate the mean structure and the RSA model to estimate the dependencestructure. I begin by estimating the parameters <strong>of</strong> the mixture model


64and then computing a mean-corrected version <strong>of</strong> the image. This mean-correctedimage is used as data for the RSA model.Suppose we have an estimate <strong>of</strong> the classification ˜Z and parameter estimatesfor each segment θ. Recall from above that Z i = j means that pixel i is classifiedinto segment j, and that we have arranged the pixels in raster scan order. Now,the mean-corrected image M can be formed with the following equation.M i = Y i − µ Zi (4.74)Here, µ j is the mean for segment j (µ j will be one <strong>of</strong> the elements <strong>of</strong> θ j ). Foreach pixel, I am removing the mean <strong>of</strong> the segment into which the pixel has beenclassified. I now fit the RSA model with the mean-corrected image. Recall thatW is the width <strong>of</strong> the image.M i = β W +1 M i−(W +1) + β W M i−W + β W −1 M i−(W −1) + β 1 M i−1 + ɛ i (4.75)Since the mean has been removed from every pixel, E(M i ) = 0 and so there isno constant term in equation 4.75. The β coefficients can be estimated by fittinga least-squares regression in which the M i values (excluding the boundary) are theresponse and appropriately lagged M i values are the predictors. This regressionprovides an R 2 value for use in equation 4.76; the results <strong>of</strong> this regression canbe examined using the same methods and diagnostics as would be applied to anyAR(P) model.To the extent that BIC can be used with mixture models, it can also be usedwhen autoregressive dependence is present. The formula derived in section 4.2.3for BIC with an RSA model (equation 4.73) is restated here for the case where themodel is a mixture RSA model.


65(BIC(K) = 2 L IND (Y −B | θ ˆ K , K) − N )02 log(1 − R2 K) − D K log(N 0 ) (4.76)In equation 4.76, Y −Bis the image Y excluding the boundary B. K is thenumber <strong>of</strong> segments, and ˆ θ K is the vector <strong>of</strong> estimated parameters for the modelwith K segments. The 4 autoregressive coefficients are included in ˆ θ K , as well as(K − 1) mixture proportions, K mean parameters, and, for the Gaussian case, Kvariance parameters. The R 2 K value is from the autoregression described in section4.4. N 0 is the number <strong>of</strong> pixels in Y −B , and D K is the number <strong>of</strong> parameters inˆ θ K .4.4 Fitting the Raster Scan Autoregression ModelAfter performing a segmentation <strong>of</strong> the image into K segments, the estimatedsegmentation ˜Z can be used to create a mean-corrected version <strong>of</strong> the image M.This procedure is described in section 4.3. After finding M, we need to fit themodel given in equation 4.77.M i = β W +1 M i−(W +1) + β W M i−W + β W −1 M i−(W −1) + β 1 M i−1 + ɛ i (4.77)It is quite straightforward to compute this model with standard least squaresregression s<strong>of</strong>tware. The response variable is the vector <strong>of</strong> M i values in rasterscan order, excluding the observations on the image boundary.Since the RSAmodel involves the four neighbors which precede each observation in raster scanorder, the four predictor vectors are these four lagged values for each pixel. Thepredictor vectors will each contain some <strong>of</strong> the boundary pixels. When a leastsquares regression is computed from this model, the coefficients <strong>of</strong> the 4 predictorsare exactly the four estimated β values for the RSA model. The R 2 value from this


66regression, which can be written as R 2 K to emphasize the fact that it correspondsto a segmentation with K segments, can then be used in equation 4.80.4.5 Choosing the Number <strong>of</strong> Segments with BICAfter running the EM algorithm, we have estimates <strong>of</strong> the model parameters and afinal value <strong>of</strong> the loglikelihood <strong>of</strong> the data with the estimated mixture model. Thisloglikelihood L is calculated under the assumption that all pixels are independent,but excluding the pixels on the edge <strong>of</strong> the image. Denote the interior (non-edge)pixels by I.⎛⎞log(L(Y |K, ˆθ)) = ∑ K∑log ⎝ P j Φ(Y i |θ j ) ⎠ (4.78)i∈I j=1As with the parameter estimation equations, this loglikelihood computationcan be made faster by only computing each term once for each unique data value.Equation 4.79 is equivalent to equation 4.78, but equation 4.79 only requires iterationover the unique data values instead <strong>of</strong> iteration over the entire data set.⎛⎞log(L(Y |K, ˆθ))C∑K∑= H i log ⎝ P j Φ(V i |θ j ) ⎠ (4.79)ij=1As shown in previous chapters, we can use the BIC based on this loglikelihoodwith an adjustment. Computation <strong>of</strong> the BIC at this point is straightforward,as shown in equation 4.80. I use the notation BIC(K) to emphasize that thisBIC value is computed for a particular value <strong>of</strong> K and all other parameters areestimated automatically in the segmentation algorithm.BIC(K) = 2 log(L(Y |K, ˆθ)) − N 0 log(1 − R 2 K) − D K log(N 0 ) (4.80)The R 2 K value is from the autoregression with K segments described in section4.4, and N 0 is the number <strong>of</strong> pixels in the interior <strong>of</strong> the image (i.e., the total


67number <strong>of</strong> pixels minus the number <strong>of</strong> boundary pixels). The number <strong>of</strong> degrees<strong>of</strong> freedom D K in this equation is equal to the number <strong>of</strong> parameters estimated;in other words, D K is equal to the number <strong>of</strong> elements in θ, which contains the 4autoregressive paramters, K − 1 mixture proportions, and the density parameters.For a Gaussian mixture, D K = 3K + 3; for a Poisson mixture, D K = 2K + 3.To choose the number <strong>of</strong> segments, we compute BIC(K) for several values<strong>of</strong> K, starting with K = 1. We then increase K until we find a local maximumin BIC(K); the value <strong>of</strong> K at this first local maximum is chosen as the optimalnumber <strong>of</strong> segments. This scheme for choosing K is reasonable when one expects asmall number <strong>of</strong> segments; it avoids problems with the model failing to hold whenlarge numbers <strong>of</strong> segments are fitted.Unfortunately, although the results presented in this chapter for using BIC withautoregressive data are valid, the raster scan autoregression model is not good forimage segmentation, as shown in the next section.4.6 Application <strong>of</strong> the RSA Model to Image DataAlthough it was initially thought that the RSA model might provide a computationallyfast way <strong>of</strong> performing image segmentation, it turns out that the RSAmodel is a bad fit for image data. This is because a model with too few segmentswill result in a very large R 2 value due to large changes in the mean from segmentto segment. This simple flaw can be illustrated even with a one-dimensionalexample.Figure 4.1 shows two time series. The first was generated by an AR(1) process(autoregressive coefficient 0.95, Gaussian noise with mean 0.5 and standard deviation1), while the second consists <strong>of</strong> two sequences <strong>of</strong> independent Gaussian noise(means 5 and 15, both with standard deviation 1). When we fit an AR(1) modelto each <strong>of</strong> these signals, a high level <strong>of</strong> autoregressive dependence is evident; the


68resulting R 2 values are 0.93 for the true AR(1) data and 0.92 for the change pointdata. For the first signal, it makes sense to try fitting an AR(1) model. For thesecond signal, the AR(1) model is not a good model because the change point hastoo much influence on the fit.Value0 5 10 15 20•• •••••••• • ••••• •• • • •• ••• •••• • • •• • •• ••• •••• •• •• • • ••• • • • •• •••• • • ••••• ••• • • ••• ••••• ••• • ••• • •• • • • •• • • •• •• •• •••• • •• • •• • •••••• • • • •• •• ••••• • • •••••• • •••• • • •• ••• ••• ••• ••• •• • •• •• •• • • ••• ••• •• •••• • •Value0 5 10 15 20•• ••••••• • • •• • •• • •••• • •••• •••••• •••• •• ••••• ••••••• • ••• •••• •• • • •••• • ••• • •••••• • •• ••• • •• • • •••••••• •• •• • ••• • •• • ••• •• • ••• • • •• • • ••• • • • • ••• •• ••• • •• •••• •• ••• •• • • •• ••• •••••• • • • ••• •• ••• • •• •• • ••••••• • •0 50 100 150 200Time0 50 100 150 200Time(a)(b)Figure 4.1: (a) Signal generated by an AR(1) process (R 2 = 0.93). (b) Signal consisting <strong>of</strong> twosequences <strong>of</strong> independent Gaussian noise (R 2 = 0.92).Table 4.1: Loglikelihood and BIC results for the data <strong>of</strong> figure 4.1B.Number <strong>of</strong> Mixture Unadjusted AdjustedSegments Loglikelihood BIC BIC1 -607.2 -1230.2 -727.62 -273.1 -572.8 -566.7When we compute the adjusted BIC shown in equation 4.80, the R 2 adjustmentterm does not distinguish between true autoregressive dependence (as in figure4.1A) and apparent dependence due to fitting a model with too few segments (as


69in figure 4.1B). This problem causes the adjusted BIC to tend to underestimatethe number <strong>of</strong> segments because smaller models yield large R 2 values due to thelack <strong>of</strong> fit <strong>of</strong> the model.The BIC and adjusted BIC values for figure 4.1B are shown in table 4.1. Becausethe difference in loglikelihood is so large in this case, both the BIC and theadjusted BIC make the correct choice between one and two segments, though theunadjusted BIC is more decisive.An example <strong>of</strong> this problem with image data is given by figure 4.2 and table4.2. The image clearly has two segments, but the adjusted BIC (equation 4.80)incorrectly chooses one segment. An unadjusted BIC, i.e. equation 4.80 withoutthe R 2 term, correctly selects two segments in this relatively easy example. Themixture loglikelihood shows little change as more segments are added beyond two,which is consistent with the fact that only two segments are needed for this image.Table 4.2: Loglikelihood and BIC results for the image <strong>of</strong> figure 4.2.Number <strong>of</strong> Mixture Unadjusted AdjustedSegments Loglikelihood BIC BIC1 -2095 -4203 -32672 -1771 -3572 -36163 -1770 -3589 -36474 -1770 -3606 -3679


700 5 10 15 200 5 10 15 20Figure 4.2: Simulated two segment image.


Chapter 5AUTOMATIC IMAGE SEGMENTATION VIA BIC5.1 Pseudolikelihood for Image Models5.1.1 Potts ModelWe can model the spatial dependence in an image by using a Markov randomfield to model the true state <strong>of</strong> each pixel. We assume that each pixel has a truehidden state X i , where X i is an integer denoting one <strong>of</strong> the K states, and that thetrue state <strong>of</strong> a pixel is likely to be similar to the states <strong>of</strong> its neighbors. DefineI(X i , X j ) as an indicator function equal to 1 when X i = X j and zero otherwise.Let N(X i ) be the neighbors <strong>of</strong> X i (that is, the 8 pixels adjacent to pixel X i ),and let U(N(X i ), k) denote the number <strong>of</strong> points in N(X i ) which have state k(so U(N(X i ), X i ) is the number <strong>of</strong> neighbors <strong>of</strong> pixel i which have the same stateas pixel i). The Potts model is characterized by the joint distribution given inequation 5.1, in which the sum is over all neighbor pairs.p(X) ∝ exp(φ ∑ i jI(X i , X j )) (5.1)Equation 5.1 leads to the conditional distribution in equation 5.2.p(X i = j|N(X i ), φ) =exp(φU(N(X i), j))∑k exp(φU(N(X i ), k))(5.2)The parameter φ expresses the amount <strong>of</strong> spatial homogeneity in the model.A positive value <strong>of</strong> φ means that neighboring pixels tend to be similar, while a


72negative value would mean that neighboring pixels tend to be dissimilar. If φ = 0,then the pixels are independent.Note that pixels on the boundary <strong>of</strong> the image will not have a full set <strong>of</strong> observedneighbors. For simplicity and because the boundary is only a small fraction <strong>of</strong> thedata, the remainder <strong>of</strong> this chapter assumes that boundary pixels are excludedfrom analysis except in their use as neighbors <strong>of</strong> interior pixels. That is, any pixel<strong>of</strong> interest can be assumed to be an interior pixel.5.1.2 ICMThe Iterated Conditional Modes (ICM) algorthm was introduced by Besag (1986)as a method <strong>of</strong> image reconstruction when local characteristics <strong>of</strong> the true imagecan be modeled as a Markov random field. In particular, this can be used withthe Potts model described in equation 5.1. The algorithm begins with an initialestimate <strong>of</strong> the true scene X, and proceeds iteratively to estimate all necessaryparameters, as well as estimating X.Recall that we do not observe the X values, but we do observe Y i for each pixel.We assume that the density <strong>of</strong> Y i conditional on its true state X i = j is Gaussianwith mean µ j and variance σj 2 , and it follows that the Y values are conditionallyindependent given the X values as shown in equations 5.3 and 5.3. Let θ K denotethe vector <strong>of</strong> parameters (µ, σ 2 ) for state K.f(Y |X) = ∏ if(Y i |X i ) (5.3)f(Y i |X i ) = f(Y i |θ Xi ) (5.4)To initialize the ICM algorithm, we use marginal segmentation to find a firstestimate <strong>of</strong> the scene ˆX. The algorithm proceeds by first updating the estimate <strong>of</strong>


73the Gaussian parameters ˆθ, which is done by maximizing the likelihood in equation5.3 given the current ˆX. In other words, the usual maximum likelihood estimators<strong>of</strong> µ and σ 2 are computed, using ˆX as the assignment <strong>of</strong> each pixel to one <strong>of</strong> thetrue states.The next step is to estimate φ using maximum pseudolikelihood. Again, weuse the current estimate ˆX for the true scene. The function to be maximized isthe product over all pixels <strong>of</strong> equation 5.2.P L( ˆX|φ) = ∏ ip( ˆX i |N( ˆX i ), φ) (5.5)ˆφ = argmax φ P L( ˆX|φ) (5.6)We have one φ parameter for the image, which means we are assuming thateach segment has the same amount <strong>of</strong> spatial cohesion. In estimating the value<strong>of</strong> φ, we are examining segments which are estimated correctly (high cohesion),incorrectly combined (high cohesion), or incorrectly subdivided (low cohesion).This means that when the number <strong>of</strong> groups K is less than or equal to the truenumber <strong>of</strong> groups K T rue , then φ is estimated correctly. As K grows larger thanK T rue , we expect to underestimate φ. The most extreme example <strong>of</strong> this occurswhen K T rue = 1, and we fit models with K > 1. In these cases, the unconstrainedestimate <strong>of</strong> φ could be negative, even if the true scene has positive φ. Because <strong>of</strong>this, we constrain φ to be nonnegative; if ˆφ is negative, then we reset it to zerobefore continuing.At this point, ˆθ and ˆφ have been updated, and we now update ˆX. This is doneby considering each pixel in turn and replacing ˆX i with the state which maximizesequation 5.7.ˆX i = argmax j f(Y i |X i = j)p(X i = j|N( ˆX i ), ˆφ) (5.7)


74Once each pixel has been updated, we have a new ˆX. We can now check forconvergence <strong>of</strong> ˆX, or stop after a predetermined number <strong>of</strong> iterations. At the end<strong>of</strong> ICM, we also have ˆθ and ˆφ. I now comment on the posterior distribution <strong>of</strong> X,which might be used for inference in some applications, before I continue on to adiscussion <strong>of</strong> inference for the number <strong>of</strong> segments K.5.1.3 Pseudoposterior Distribution <strong>of</strong> the True SceneThis section presents a pseudolikelihood-based expression for the posterior distribution<strong>of</strong> the true segmentation X, which is the distribution <strong>of</strong> X conditional onthe observed Y . Recall that we do not observe the X values, but we do observeY i for each pixel. We assume that the density <strong>of</strong> Y i conditional on its true stateX i = j is Gaussian with mean µ j and variance σj 2 , and it follows that the Y valuesare conditionally independent given the X values.f(X|Y ) = f(Y |X)f(X)/f(Y ) ∝ f(Y |X)f(X) (5.8)The first term is easy to compute, since f(Y |X) = ∏ i f(Y i |X i ), and this isjust a Gaussian density. Here we replace the second term by the pseudolikelihoodP L(X). Dependence on the parameter φ is made explicit in the equations whichfollow.P L(X, φ) = ∏ ip(X i |N(X i ), φ) (5.9)p(X i |N(X i ), φ) = exp(φU(N(X i), X i ))∑k exp(φU(N(X i ), k))(5.10)Rewriting this in computational terms, we obtain the following expression forthe pseudoposterior distribution <strong>of</strong> X, based on the pseudolikelihood.


75f( ˆX|Y, ˆφ, ˆθ) ∝ ∏ if(Y i | ˆX i , ˆθ)p( ˆX i |N( ˆX i ), ˆφ) (5.11)In some situations, one might wish to conduct inference based on this pseudoposterior.However, we use the ICM reconstruction <strong>of</strong> X, so our only remaininggoal is to conduct inference for K, the number <strong>of</strong> segments in the image.5.1.4 Pseudolikelihood and BICWe wish to conduct inference for K, the number <strong>of</strong> segments in the image. Aninitial thought would be to use BIC, as discussed in section 3.1. However, thiswould require evaluation <strong>of</strong> the likelihood <strong>of</strong> the observed data, L(Y |K), which isshown in equation 5.12.L(Y |K) = ∑ xf(Y |X = x, K)p(X = x|K) (5.12)The sum in equation 5.12 involves all possible configurations <strong>of</strong> the hiddenstates. With N pixels and K states, there are K N possible configurations, makingthis approach intractable. Instead, we replace the likelihood term with a pseudolikelihoodwhich maintains computational feasibility.An alternative to this pseudolikelihood method is to explore the space <strong>of</strong> allpossible configurations <strong>of</strong> hidden states. This can be done using reversible jumpMarkov chain Monte Carlo (Green, 1995), which would yield an estimate <strong>of</strong> theposterior probability <strong>of</strong> each K value. The main drawback <strong>of</strong> reversible jumpMCMC is its large computational demand. In most cases, we will not really beinterested in the posterior probabilities <strong>of</strong> values <strong>of</strong> K; instead, we simply wanta single best K value. In the context <strong>of</strong> the consistency result shown below,we would expect that as the amount <strong>of</strong> data increases, the choice <strong>of</strong> the singlebest K should be the same for both the reversible jump MCMC method and thepseudolikelihood-based BIC method.


76The basic idea <strong>of</strong> this psuedolikelihood approach is that instead <strong>of</strong> summingover all possible configurations <strong>of</strong> X we will consider only configurations whichare close to the ICM estimate <strong>of</strong> X, denoted by ˆX. Specifically, we consider eachpixel Y i in turn and condition on ˆX −i , which is ˆX excluding the value at X i . Thelikelihood <strong>of</strong> the ith pixel observation is L(Y i |K), shown in equation 5.13K∑L(Y i |K) = f(Y i |X i = j)p(X i = j) (5.13)j=1The sum in equation 5.13 is now over the K possible values <strong>of</strong> X i . Conditioningon ˆX −i , we obtain the conditional likelihood shown in equation 5.14, in whichN( ˆX i ) denotes the neighbors <strong>of</strong> ˆXi .K∑L(Y i | ˆX −i , K) = f(Y i |X i = j)p(X i = j|N( ˆX i )) (5.14)j=1The first term in the sum, f(Y i |X i = j), simply requires evaluation <strong>of</strong> a Gaussiandensity; the second term, p(X i = j|N( ˆX i )) is evaluated using equation 5.2.The conditional likelihoods from equation 5.14 are combined to form the pseudolikelihood<strong>of</strong> the image, L ˆX(Y |K), shown in equation 5.15. Forming the productin this way makes intuitive sense because the Y i values are independent conditionalon the underlying hidden states.L ˆX (Y |K) = ∏ if(Y i | ˆX −i , ˆφ) = ∏ iK∑f(Y i |X i = j)p(X i = j|N( ˆX i ), ˆφ) (5.15)j=1Recall that Y i is the ith observed pixel, X i is the hidden state <strong>of</strong> pixel i, ˆφ is theestimate <strong>of</strong> φ (from the ICM algorithm), and N( ˆX i ) is the estimate <strong>of</strong> the hiddenstate <strong>of</strong> each neighbor <strong>of</strong> pixel i. I use L ˆX(Y |K) to denote the quantity in equation5.15, since it is a likelihood integrated over the approximate posterior distribution<strong>of</strong> a set <strong>of</strong> models near the MAP estimate ˆX.


77After running ICM, we have an estimate <strong>of</strong> φ, as well as estimates <strong>of</strong> the µand σ parameters for each segment. We can compute the log <strong>of</strong> the quantity inequation 5.15, and, since this is an approximation <strong>of</strong> the intractable L(Y |K), weuse it in place <strong>of</strong> the loglikelihood in the BIC, as shown in equation 5.16. I usethe notation BIC P L (K) to differentiate this equation from the usual BIC.BIC P L (K) = 2 log(L ˆX (Y |K)) − D K log(N) (5.16)Ideally, one could compute BIC P L (K) for a large range <strong>of</strong> K values and thenchoose K to maximize BIC P L (K). However, this would require an excessiveamount <strong>of</strong> computation, and we do not expect the model assumptions to holdfor values <strong>of</strong> K very far from the true value. Because <strong>of</strong> this, we adopt a sequentialtesting approach. We begin by computing BIC P L (K) for K = 1, andthen incrementally increase the value <strong>of</strong> K. At each step, we compare BIC P L (K)with BIC P L (K − 1), and stop the process when the larger model is rejected. Inother words, as we increase K incrementally from K = 1, we take the first localmaximum <strong>of</strong> BIC P L (K) to be our choice <strong>of</strong> the number <strong>of</strong> segments K.5.1.5 Consistency <strong>of</strong> BIC P LEquation 5.16 gives the formula for BIC P L (K) which I use for model selection.In this section I present a consistency result for BIC P L (K). First, I refine thenotation. Let K T denote the true number <strong>of</strong> segments. Let ˆX K be the estimatedX given that there are K segments, and similarly let ˆθ K and ˆφ K be the parameterestimates given K segments. N(X i ) is the neighborhood <strong>of</strong> X i , which consists <strong>of</strong>the 8 pixels adjacent to X i , and B i is the union <strong>of</strong> X i and its neighborhood. Inother words, B i is a three by three block <strong>of</strong> pixels centered at pixel i; omission <strong>of</strong>the subscript will indicate an arbitrary three by three block <strong>of</strong> pixels. I assumethat f(Y i |X i ) is a Gaussian distribution.


78The consistency result presented here is shown for a limited case, but it ishoped that future work might extend the result to more general cases. I first statethe theorem, and then two lemmas precede the pro<strong>of</strong>.Theorem 5.1: Consistency <strong>of</strong> Choice <strong>of</strong> KWe observe an image Y consisting <strong>of</strong> pixels Y i which we assume are each generatedfrom a Gaussian distribution which depends on the true state X i <strong>of</strong> each pixel. Wedefine i to index only the interior pixels <strong>of</strong> the image. We assume that the localcharacteristics <strong>of</strong> the true state image X can be modeled as a Markov random field;in particular, we assume the Potts model given by equation 5.1. Let K T denote thetrue number <strong>of</strong> segments in the image, and let K denote a hypothesized number<strong>of</strong> segments.Suppose that one <strong>of</strong> the following cases holds.Case 1: K T = 1 and K > 1.Case 2: K T = 2, K = 1, and condition A: log(σ K ) − log(σ 1 ) − 8φ > 0, whereσ K is the standard deviation from the K = 1 fit and σ 1 largest <strong>of</strong> the two standarddeviations from the K T = 2 fit.In case 1 or case 2, BIC P L (K) is consistent for K; that is, as N → ∞ in such away that the size <strong>of</strong> the image increases in both dimensions,P KT (BIC P L (K) < BIC P L (K T )) → 1 (5.17)Condition AThis condition is relevant only when K T = 2 and K = 1. Denote the true densityparameters in θ K T by µ1 , µ 2 , σ1, 2 and σ2. 2 Similarly, let the parameters in θ K be


79denoted by µ K and σK; 2 equations 5.18 and 5.19 give formulas for µ K and σK 2 interms <strong>of</strong> the θ K T parameters. Let P1 be the proportion <strong>of</strong> pixels for which the true(unobservable) state X i is 1, and similarly let P 2 be the proportion <strong>of</strong> pixels instate 2.µ K = P 1 µ 1 + P 2 µ 2 (5.18)σ 2 K = P 1 (σ 2 1 + µ 2 1 − 2P 1 µ 2 1 − 2P 2 µ 1 µ 2 + P 2 1 µ 2 1 + P 2 2 µ 2 2 + 2P 1 P 2 µ 1 µ 2 )+P 2 (σ 2 2 + µ 2 2 − 2P 2 µ 2 2 − 2P 1 µ 1 µ 2 + P 2 1 µ 2 1 + P 2 2 µ 2 2 + 2P 1 P 2 µ 1 µ 2 )(5.19)Suppose, without loss <strong>of</strong> generality, that σ 1 > σ 2 . Condition A is given byequation 5.20.log(σ K ) − log(σ 1 ) − 8φ > 0 (5.20)Note from equation 5.57 that σ K becomes larger as the two true mixture componentsbecome more separated; that is, σ K can be made arbitrarily large bymoving µ 1 and µ 2 farther apart. Thus, condition A can be thought <strong>of</strong> as a regularitycondition which requires a certain amount <strong>of</strong> separability between the twotrue segments.When φ = 0 (the spatial independence case), condition A reduces to log(σ K ) −log(σ 1 ) > 0, assuming σ 1 > σ 2 . If, in addition, we have σ 1 = σ 2 , then condition Ais guaranteed to hold as long as µ 1 ≠ µ 2 .Lemma 1: IntegrabilitySuppose we define g i , a function <strong>of</strong> Y i , as shown in equation 5.21.


80⎛∑ ⎞Kj=1f(Yg i (Y i ) = log ⎝i |X i = j, θ 1 )p(X i = j|N 1 (X i ), φ 1 )∑ ⎠KT(5.21)j=1 f(Y i |X i = j, θ 2 )p(X i = j|N 2 (X i ), φ 2 )Here, θ 1 , θ 2 , φ 1 , and φ 2 are fixed parameter values; N 1 (X i ) and N 2 (X i ) arefixed neighborhoods <strong>of</strong> X i . Let us further assume that f(Y i ) denotes a Gaussianor Gaussian mixture density. Then g i ∈ L 1 , that is, ∫ ∞−∞ |g i (Y i )|f(Y i )dY i < ∞.Pro<strong>of</strong> <strong>of</strong> Lemma 1.Since g i is a function <strong>of</strong> Y i , I need to show that ∫ ∞−∞ |g i(Y i )|f(Y i )dY i < ∞. NowN 1 (X i ) and N 2 (X i ) are fixed, so p(X i = j|N 1 (X i ), φ 1 ) and p(X i = j|N 2 (X i ), φ 2 )are also fixed for a given i and I denote them by p 1 j and p 2 j. I can now write g i asfollows.g i (Y i ) = log⎛⎝∑ Kj=1f(Y i |X i = j, θ 1 )p 1 j∑ KTj=1 f(Y i |X i = j, θ 2 )p 2 j⎞⎠ (5.22)Note that both the numerator and denominator inside the logarithm in equation5.22 are Gaussian mixtures. I assume that all variances are nonzero. Thusg i (Y i ) is finite over any finite interval, and its integral is finite when integratedover any finite interval.Let S 1 denote the state with largest variance in the numerator <strong>of</strong> equation 5.22,and similarly S 2 for the denominator. These terms will dominate in the tails; inother words, there exists a value Y A such that for all Y i > Y A (and also for allY i < −Y A ), the following two inequalities will hold.K∑f(Y i |X i = j, θ 1 )p 1 j < f(Y A |X i = S 1 , θ 1 ) (5.23)j=1K T∑f(Y i |X i = j, θ 2 )p 2 j < f(Y A|X i = S 2 , θ 2 ) (5.24)j=1We can now write the following, in which C denotes a constant.


81∫ ∞−∞|g i (Y i )|f(Y i )dY i =∫ −YA−∞|g i (Y i )|f(Y i )dY i +∫ −∞Y A|g i (Y i )|f(Y i )dY i + C (5.25)Now note that g i (Y i ) is the log <strong>of</strong> a fraction; making use <strong>of</strong> equations 5.23 and5.24, the inequality <strong>of</strong> equation 5.26 will hold when Y i > Y A or Y i < −Y A .K∑K|g i (Y i )| = | log( f(Y i |X i = j, θ 1 )p 1 j ) − log( ∑ Tf(Y i |X i = j, θ 2 )p 2 j )|≤| log(j=1K∑j=1j=1f(Y i |X i = j, θ 1 )p 1 ∑K Tj)| + | log(j=1f(Y i |X i = j, θ 2 )p 2 j )|≤ | log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))| (5.26)Combining the inequality <strong>of</strong> equation 5.26 with equation 5.25, we obtain equation5.27, in which C 1 and C 2 are irrelevant constants.≤≤∫ ∞−∞∫ −YA−∞∫ ∞+∫ ∞−∞|g i (Y i )|f(Y i )dY i(| log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))|)f(Y i )dY iY A(| log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))|)f(Y i )dY i + C 1(| log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))|)f(Y i )dY i + C(5.27)2At this point, showing that g i ∈ L 1 reduces to showing that the inequality inequation 5.28 holds.∫ ∞−∞(| log(f(Y i |X i = S 1 , θ 1 ))| + | log(f(Y i |X i = S 2 , θ 2 ))|)f(Y i )dY i < ∞ (5.28)The log terms in the integral can be simplified further, where C 1 , C 2 , and C 3are irrelevant constants and S is either S 1 or S 2 .


82| log(f(Y i |X i = S, θ 1 ))| ≤ C 1 + C 2 Y i + C 3 Y 2i (5.29)Thus, equation 5.28 becomes 5.30; again, C 1 , C 2 , and C 3 are irrelevant constants.∫ ∞−∞(C 1 + C 2 Y i + C 3 Y 2i )f(Y i )dY i < ∞ (5.30)The inequality in equation 5.30 holds if f(Y i ) is Gaussian or a Gaussian mixture;therefore g i ∈ L 1 .End <strong>of</strong> Pro<strong>of</strong> <strong>of</strong> Lemma 1.My second lemma is theorem 3.1.1 from Guyon (1995), which I state belowwithout pro<strong>of</strong>. In the lemma, X is a process defined over Z d . X has some distributionf; for our purposes, we need only consider Z 2 , so f would be a distributionfunction for the possible realizations <strong>of</strong> X on an infinite plane. X is assumed tobe stationary, which means that its distribution is invariant under translations τ i ,i ∈ Z d . It is further assumed to be in L p , meaning that the expected value <strong>of</strong> |X i | pin finite. The I in the lemma is the σ-algebra <strong>of</strong> invariant sets, which is defined byA ∈ I if and only if τ i (A) = A for all i. In other words, if there are non-invariantsets (e.g. initial conditions or boundary conditions), then their influence is excludedfrom the expectation in the lemma. This makes intuitive sense from theviewpoint that, for a stationary process, dependence on initial conditions shoulddie out as the size <strong>of</strong> the data goes to infinity. Let (D M ) be a sequence <strong>of</strong> boundedconvex sets; d(D M ) denotes the interior diameter <strong>of</strong> D M , which is the diameter <strong>of</strong>the largest ball around a point in D M which is entirely contained in D M .


83Lemma 2: ErgodicityLet X = X i , i ∈ Z d be a stationary process in L p , 1 ≤ p < ∞. If (D M ) is a sequence<strong>of</strong> bounded convex sets such that d(D M ) → ∞, and if ¯X M = |D M | −1 ∑ D MX i , thenit follows that lim M→∞ ¯X M = E(X 0 |I) in L p .I now describe a sequence <strong>of</strong> bounded convex sets (D M ) which satisfy therequirements <strong>of</strong> the lemma. Consider a sequence <strong>of</strong> rectangular subsets <strong>of</strong> aninfinitely large image, with lower left hand corner at the origin and each side <strong>of</strong>length M. For the sequence <strong>of</strong> rectangular sets I have defined (and in fact for anysequence which increases in size in both dimensions as M increases), it is clearthat d(D M ) → ∞ as M → ∞. Furthermore, |D M | = M 2 , so ¯X M defined in thelemma is the usual sample average.Pro<strong>of</strong> <strong>of</strong> Theorem 5.1I will examine each case in turn.Case 1: K T = 1The inequality in equation 5.17 can be rewritten as follows.2 log(L ˆX(Y |K)) − D K log(N) < 2 log(L ˆX(Y |K T )) − D KT log(N) (5.31)⇒ log(L ˆX(Y |K)) − (D K − D KT ) log(N)/2 < log(L ˆX(Y |K T )) (5.32)⇒ (L ˆX (Y |K)) exp(−(D K − D KT ) log(N)/2) < (L ˆX (Y |K T )) (5.33)


84⇒ (L ˆX (Y |K)) exp(−(D K − D KT ) log(N)/2)(L ˆX (Y |K T ))< 1 (5.34)⇒ (L ˆX(Y |K))(L ˆX (Y |K T )) < exp((D K − D KT ) log(N)/2) (5.35)⇒∏ ∑ Kj=1i f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ K )∏ ∑ KTi j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ< exp((D K −D K KT ) log(N)/2)T )(5.36)⇒ 1 N⎛∑log ⎝i∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K T), ˆφ⎠K T )∑ KT< ((D K − D KT ) log(N)/2)Ni(5.37)Define h i as shown in equation 5.38.⎛h i = log ⎝∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K T), ˆφ⎠ (5.38)K T )∑ KTLet ZY 2 be the subset <strong>of</strong> Z 2 on which Y is defined, and define a process H =h i , i ∈ ZY 2 . Consider a translation in Z 2 denoted by τ, such that the distribution<strong>of</strong> H does not change under τ. The terms in h i are Gaussian densities and localcharacteristics <strong>of</strong> a Markov random field process; we are using the Potts model<strong>of</strong> equation 5.2 to model the local characteristics <strong>of</strong> this process. The Gaussiandensities model the spatially independent noise at each pixel, while the Markovrandom field terms capture the spatial dependence <strong>of</strong> the image. Now, theselocal characteristics do not change for any interior pixel; they differ only at theboundaries, which (as noted previously) have been excluded from this analysis. Inexcluding the boundaries (as well as in letting the image size increase to infinity ini


85both dimensions) we are asymptotically dealing with an image with infinite size.For such an infinite image, there is no translation which will move H to a boundary,and so τ can be any translation in Z 2 . In other words, the distribution <strong>of</strong> H isinvariant under translations τ ∈ Z 2 ; since this is the definition <strong>of</strong> a stationaryprocess, it follows that H is stationary.I will now show that h i ∈ L 1 ; in other words, I need to show that∫ ∞−∞ |h i|f(Y )dY < ∞, where f(Y ) is the true joint distribution <strong>of</strong> Y . Now h idepends on values in Y other than Y i only through the parameter estimates ˆθ K ,ˆφ K ,N( ˆX K ) , ˆθ K T , ˆφK T , and N( ˆX K T ). Conditional on these parameter estimates,regardless <strong>of</strong> their values, h i will have the form <strong>of</strong> g i in lemma 1; furthermore,it will depend on Y only through the marginal distribution <strong>of</strong> Y i . This marginaldistribution is a Gaussian mixture since Y i depends on other Y values through thedependence between X i and X. Whatever the probability <strong>of</strong> X i taking on eachstate, Y i will be distributed as a Gaussian mixture. Applying lemma 1, h i ∈ L 1 .I now apply lemma 2, using the sequence <strong>of</strong> sets (D M ) described above. Sinceh i satisfies the requirements for X i in the lemma, we obtain the result that, asN → ∞ in such a way that the size <strong>of</strong> the image is increasing in both dimensions,1N∑h i = E(h) (5.39)Ni=1We can now conclude that the limit <strong>of</strong> the left hand side <strong>of</strong> equation 5.37 asN → ∞ in such a way that the size <strong>of</strong> the image is increasing in both dimensionsis equal to equation 5.40.⎛ ⎛E KT⎝log ⎝∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ⎠⎠ (5.40)K T )∑ KTNow, − log is a strictly convex function, so by Jensen’s inequality the quantityin equation 5.40 is less than the quantity in equation 5.41, with equality only when


86the object <strong>of</strong> the expectation is constant.⎛ ⎛log ⎝E KT⎝∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ⎠⎠ (5.41)K T )∑ KTIf the inequality in equation 5.42 holds, then consistency is implied.⎛E KT⎝∑ Kj=1f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞K )j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ⎠K T )( )((DK − D KT ) log(N)/2)≤ expN∑ KT(5.42)When K T = 1, there is only one possible configuration for X; in other words,X is constant. Equation 5.43 follows.f(Y ) = f(Y |X) = ∏ if(Y i |X i ) = ∏ if(Y i ) (5.43)Note that f(Y i |ˆθ K T , Xi ) is equal to f(Y i |ˆθ K T ) when KT = 1, and this in turn isasymptotically equal to f(Y i |θ K T ), since in this case ˆθ is just the usual maximumlikelihood estimate with independent data. The expected value in equation 5.42can be rewritten as equation 5.44.⎛∑ Kj=1f(YE KT⎝i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ⎞K )⎠ (5.44)f(Y i |ˆθ K T )I now show that equation 5.44 simplifies to equation 5.45. Consider the value <strong>of</strong>ˆφ K when K T = 1 and K > K T . The estimate ˆφ K is a maximum pseudolikelihoodestimate. Maximum pseudolikelihood estimates were shown to be consistent byGeman and Graffigne (1986), so we know that ˆφ K → φ K Twhen K T = 1 all points are independent, so φ K Tas N → ∞. However,= 0. When φ = 0, the probability<strong>of</strong> any X i is (1/K), regardless <strong>of</strong> its neighbors. It follows that the expectation inequation 5.44 is equal to the expectation in equation 5.45.


87⎛∑ Kj=1f(YE KT⎝i |X i = j, ˆθ⎞K )(1/K)⎠ (5.45)f(Y i |ˆθ K T )This expectation can be found by integrating over Y i with respect to the truedensity <strong>of</strong> Y i .⎛∫⎝Y i∑ Kj=1f(Y i |X i = j, ˆθ⎞K )(1/K)⎠ f(Y i |θ K T)dY i (5.46)f(Y i |ˆθ K T )As noted previously, f(Y i |θ K T) cancels with the term in the denominator, soequation 5.46 becomes equation 5.47.∫K ∑Y i j=1f(Y i |X i = j, ˆθ K )(1/K)dY i = 1 (5.47)Thus I have shown that the expectation on the left hand side <strong>of</strong> equation 5.42is equal to 1. The right hand side <strong>of</strong> the inequality in equation 5.42 is equal to thefollowing.exp( )((DK − D KT ) log(N)/2)Since D K > D KT , we know that the following inequalities hold.N(5.48)((D K − D KT ) log(N)/2)N> 0 (5.49)( )((DK − D KT ) log(N)/2)exp> 1 (5.50)NIn the limit as N → ∞, the inequality in equation 5.50 becomes an equality.Combining equations 5.47 and 5.50, we see the following.∫K ∑Y i j=1f(Y i |X i = j, ˆθ K )(1/K)dY i = 1 < exp( )((DK − D KT ) log(N)/2)N(5.51)


88Thus, equation 5.42 holds, so BIC P L (K) is consistent for K in case 1. Notethat in the limit as N → ∞, the inequalities in equations 5.50 and 5.51 becomeequalities; equation 5.42 still holds since its inequality is not strict.A comment about the pro<strong>of</strong> for case 1 is needed. In equation 5.45, the numeratorcorresponds to a certain mixture density with K components, while thedenominator has K T = 1 component. I emphasize that this consistency result doesnot hold for the general Gaussian mixture model in which mixture proportions areestimated along with θ, since the mixture implied by the numerator <strong>of</strong> equation5.45 has the mixture proportions held constant at 1/K.Case 2: K T = 2, K = 1, and condition ABegin as in case 1, up to equation 5.36 which is rewritten here as equation 5.52.∏ ∑ Kj=1i f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ K )∏ ∑ KTi j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ< exp((D K − D K KT ) log(N)/2)T )(5.52)Inverting the fraction we obtain equation 5.53∏ ∑ KTi j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ K T )∏ ∑ Kj=1i f(Y i |X i = j, ˆθ K )p(X i = j|N( ˆX i K ), ˆφ K )> exp((D KT − D K ) log(N)/2)(5.53)Since K = 1, this simplifies to equation 5.54.∏ ∑ KTi j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ K T )∏i f(Y i |ˆθ K )> exp((D KT − D K ) log(N)/2)(5.54)The inequality in equation 5.54 is equivalent to the inequality in equation 5.55.


89⎛∑ ∑ KTlog ⎝ j=1 f(Y i |X i = j, ˆθ K T )p(Xi = j|N( ˆX K Ti ), ˆφ⎞K T )⎠ > ((Dif(Y i |ˆθ K KT −D K ) log(N)/2))(5.55)Recall from case 1 that as N → ∞, it follows that ˆθ K T → θK T and ˆφK T →φ K T . Some additional notation is needed at this point. Denote the true densityparameters in θ K Tby µ1 , µ 2 , σ 2 1 , and σ2 2 . Similarly, let the parameters in θK (whichin this case consists <strong>of</strong> only one component) be denoted by µ K and σ 2 K.We can deduce the asymptotic values <strong>of</strong> µ K and σ 2 K in terms <strong>of</strong> the true parameters.Let P 1 be the proportion <strong>of</strong> pixels for which the true (unobservable)state X i is 1, and similarly let P 2 be the proportion <strong>of</strong> pixels in state 2. Then µ Kand σ 2 K will be given by equations 5.56 and 5.57.µ K = P 1 µ 1 + P 2 µ 2 (5.56)σK 2 = P 1 (σ1 2 + µ 2 1 − 2P 1 µ 2 1 − 2P 2 µ 1 µ 2 + P1 2 µ 2 1 + P2 2 µ 2 2 + 2P 1 P 2 µ 1 µ 2 )+P 2 (σ2 2 + µ 2 2 − 2P 2 µ 2 2 − 2P 1 µ 1 µ 2 + P1 2 µ 2 1 + P2 2 µ 2 2 + 2P 1 P 2 µ 1 µ 2 )(5.57)The inequality in equation 5.55 holds if the inequality <strong>of</strong> equation 5.58 holds;this is true for any set <strong>of</strong> values <strong>of</strong> m i , since the left hand side <strong>of</strong> equation 5.58 isless than the left hand side <strong>of</strong> equation 5.55.(∑ f(Yi |X i = m i , θ K T )p(Xi = m i |N(logˆX K Ti ), φ K )T )> ((Dif(Y i |θ K KT − D K ) log(N)/2))(5.58)Now construct two sets, S 1 and S 2 , as follows. The set S 1 consists <strong>of</strong> all i suchthat (Y i − µ 1 ) 2 /2σ1 2 < (Y i − µ 2 ) 2 /2σ2, 2 and similarly S 2 is the set <strong>of</strong> i such that(Y i − µ 1 ) 2 /2σ1 2 > (Y i − µ 2 ) 2 /2σ2. 2 Combining this with equation 5.58, we find that


90the inequality in equation 5.55 holds if the inequality in equation 5.59 holds. Fromthis point onward I will use p(X i = m) to denote p(X i = m|N( ˆX K Ti ), ˆφ K T ).∑i∈S 1log(f(Yi |X i = 1, θ K )T )p(Xi = 1)f(Y i |θ K )+ ∑i∈S 2log(f(Yi |X i = 2, θ K )T )p(Xi = 2)f(Y i |θ K )> ((D KT − D K ) log(N)/2) (5.59)I will now examine the left hand side <strong>of</strong> equation 5.59 in order to show thatunder condition A the inequality in equation 5.59 is guaranteed to hold, thusshowing consistency. The left hand side <strong>of</strong> equation 5.59 can be written as shownin equation 5.60.∑log(f(Y i |X i = 1, θ K T)p(X i = 1)) + ∑i∈S 1i∈S 2log(f(Y i |X i = 2, θ K T)p(X i = 2))− ∑ ilog(f(Y i |θ K ))(5.60)Let |S 1 | and |S 2 | denote the sizes <strong>of</strong> the sets S 1 and S 2 ; note that |S 1 |+|S 2 | = N,where N is the number <strong>of</strong> pixels under consideration. The following three identitieshold asymptotically.∑log(f(Y i |X i = 1, θ K T)p(X i = 1)) = −|S 1 | log( √ 2π) − |S 1 | log(σ 1 )i∈S 1+ ∑log(p(X i = 1))i∈S 1− ∑ (Y i − µ 1 ) 2(5.61)i∈S 12σ12∑log(f(Y i |X i = 2, θ K T)p(X i = 2)) = −|S 2 | log( √ 2π) − |S 2 | log(σ 2 )i∈S 2+ ∑log(p(X i = 2))i∈S 2


91− ∑ (Y i − µ 2 ) 2i∈S 22σ22(5.62)∑log(f(Y i |θ K )) = −N log( √ 2π) − N log(σ K ) − N/2 (5.63)iCombining equations 5.61 to 5.63, we find that equation 5.60 is equal to equation5.64.N log(σ K ) − |S 1 | log(σ 1 ) − |S 2 | log(σ 2 ) + ∑p(X i = 1) + ∑p(X i = 2)i∈S 1 i∈S 2− ∑ (Y i − µ 1 ) 2− ∑ (Y i − µ 2 ) 2+ N/2 (5.64)i∈S 12σ12 i∈S 22σ22From the construction <strong>of</strong> the sets S 1 and S 2 , it is clear that the inequality inequation 5.65 holds.∑ (Y i − µ 1 ) 2+ ∑ (Y i − µ 2 ) 2< ∑ (Y i − µ 1 ) 2(5.65)i∈S 12σ12 i∈S 22σ22 i2σ12Thus, the quantity in equation 5.66 is less than the quantity in equation 5.64.N log(σ K ) − |S 1 | log(σ 1 ) − |S 2 | log(σ 2 ) + ∑log(p(X i = 1)) + ∑log(p(X i = 2))i∈S 1 i∈S 2(5.66)Recalling equation 5.60, consistency is implied if the inequality in equation 5.67holds.N log(σ K ) − |S 1 | log(σ 1 ) − |S 2 | log(σ 2 ) + ∑log(p(X i = 1)) + ∑log(p(X i = 2))i∈S 1 i∈S 2> ((D KT − D K ) log(N)/2) (5.67)Note that the minimum value <strong>of</strong> log(p(X i = 1)) or log(p(X i = 2)) is − log(e 8φ +1), which is approximately equal to −8φ. Suppose, without loss <strong>of</strong> generality, thatσ 1 > σ 2 . The inequality in equation 5.67 will be assured when equation 5.68 holds.


92N log(σ K ) − N log(σ 1 ) − N(8φ) > ((D KT − D K ) log(N)/2) (5.68)Equation 5.68 is equivalent to equation 5.69.N(log(σ K ) − log(σ 1 ) − 8φ) > ((D KT − D K ) log(N)/2) (5.69)The left hand side <strong>of</strong> equation 5.69 increases with order O(N) as long as(log(σ K ) − log(σ 1 ) − 8φ) > 0. This inequality is condition A. Since the righthand side <strong>of</strong> equation 5.69 only increases with order O(log(N)), it is clear that asN → ∞ the inequality <strong>of</strong> equation 5.69 will hold, thus implying consistency.End <strong>of</strong> Pro<strong>of</strong>.Case 1 is the case in which we are comparing models with K > 1 componentsto the true model, K T = 1. No restriction is imposed on K in this case. Case2 assumes that K T = 2 and imposes the restriction that the hypothesized Kmust be 1. Although this at first seems a rather extreme restriction, it is similarto a nested model restriction. Suppose the estimated segmentation with K + 1segments is nested in the segmentation with K segments, in the sense that bothsegmentations are the same except for one <strong>of</strong> the K segments which has beensubdivided in the segmentation with K + 1 segments. In this case, the maindifference in BIC P L between the two models will be attributable to the subdividedsegment. Consideration <strong>of</strong> this segment alone becomes a comparison <strong>of</strong> a 2 segmentmodel with a 1 segment model. Thus, the consistency result for case 2 addressesa basic comparison (2 segments versus 1 segment) which may be the main drivingforce in many other comparisons.


935.2 An Automatic Unsupervised Segmentation Method5.2.1 OverviewThis automatic unsupervised image segmentation method consists <strong>of</strong> several steps:1. Initialize.2. Use EM to fit a mixture model and find a marginal segmentation throughmaximum likelihood classification.3. Use ICM to refine the segmentation and estimate parameters.4. Choose the number <strong>of</strong> segments (K) using pseudolikelihood BIC.5. (Optional) Morphological smoothing.For the first and last steps I present several possible methods, though these arenot critical and could be replaced with methods tailored to particular applications.Similarly, additional steps could be added following the segmentation to allow use<strong>of</strong> training data or other application-specific knowledge.5.2.2 InitializationAn initial segmentation <strong>of</strong> the image is needed before we can use EM. I haveexplored several methods <strong>of</strong> initialization: Ward’s method (Ward, 1963), randomparameter estimates, and a method based on histogram equalization.Ward’s method is an agglomerative, hierarchical clustering method. It beginsby considering each unique greyscale level observed in the image as a single cluster.At each step, two adjacent clusters are merged until we arrive at the desired


94final number <strong>of</strong> clusters. The choice <strong>of</strong> which clusters to merge is based on minimizinga sum <strong>of</strong> squares criterion at each step. When this process is complete,we have divided the greyscale levels into K groups, and we use this as the initialsegmentation. Because <strong>of</strong> the sum <strong>of</strong> squares criterion used in Ward’s method, itis appropriate when one intends to fit a Gaussian mixture model.One disadvantage <strong>of</strong> Ward’s method is the amount <strong>of</strong> computing it requires; asan alternative to Ward’s method, the initial parameter estimates can be randomlygenerated. This is quite fast. Since the EM algorithm typically moves very quicklytoward good parameter values, it is reasonably robust to the initial parameterestimates. However, random estimates may sometimes lead EM into undesiredlocal maxima in the likelihood. This can be alleviated by starting from severaldifferent sets <strong>of</strong> random parameter values, though this solution has the drawbackthat it once again increases computation time. Also, the use <strong>of</strong> random values isintuitively somewhat unsatisfying.Histogram equalization provides the basis for a fast and robust method <strong>of</strong>finding an initial segmentation. Consider a histogram <strong>of</strong> the greyscale values in animage (without regard to spatial information). The idea <strong>of</strong> histogram equalizationis to divide the greyscale levels into K bins which contain roughly equal numbers<strong>of</strong> pixels. Graphically, this means that we adjust the bin sizes (bins are not allthe same size) until the histogram is flat. I compute the histogram equalizationwith an iterative algorithm. First, I divide the number <strong>of</strong> data points N by thedesired number <strong>of</strong> bins K; the result N/K is the number <strong>of</strong> data points we wouldlike to have in each bin. Let T 0 denote the threshold number <strong>of</strong> pixels to allocateto each bin, so we set T 0 = N/K. Beginning with greyscale level 0, I allocate allpixels with that greyscale level into the first bin. Continuing with greyscale levels1, 2, and so on, I allocate each into the first bin until I have at least T 0 pixels inthe bin. I then allocate the following greyscale levels into the second bin until it


95has at least T 0 pixels. The process continues until all greyscale levels have beenallocated. Note that there may not be enough pixels to fill all <strong>of</strong> the bins. If emptybins remain at the end <strong>of</strong> the process, then it is computed again except that T 0is replaced by T 1 = CT 0 , where C is a fraction between 0 and 1. I usually use avalue <strong>of</strong> C = 2/3. The process is iterated, replacing the threshold each time withT i = CT i−1 , until all K bins have pixels. Even when the data are extremely skewedor concentrated on just a couple <strong>of</strong> grey values, this algorithm is guaranteed toconverge as long K is less than or equal to the number <strong>of</strong> distinct grey values inthe data.A common use <strong>of</strong> histogram equalization is in gamma correction. Gammacorrection refers to adjusting the brightness with which each greyscale value isdisplayed. For instance, consider a greyscale image which contains a few pixelswith grey value 250 and all the rest <strong>of</strong> the pixels with grey values less than 30.At first glance, the image might appear mstly black, even though there may befeatures in the dimmer pixels. The problem is that the sensitivity <strong>of</strong> the human eyeis not linear with brightness. We can tell light grey from dark grey, but not lightblack from dark black. By changing the mapping from pixel values into displayedbrightness, we can move or stretch the region <strong>of</strong> pixel values which is displayed atan appropriate brightness for human viewing. For the example I just mentioned, itwould be appropriate to map the values from 1 to 30 to the range <strong>of</strong> dark to lightso that the features would be visible. Clearly, this depends entirely on the imageunder consideration. Histogram equalization is an automatic way <strong>of</strong> selecting thisgamma correction; it attempts to map equal numbers <strong>of</strong> pixels into the darker andlighter parts <strong>of</strong> the display spectrum.Histogram equalization and Ward’s method are both good at picking out highdensity regions in the greyscale histogram, but histogram equalization is muchfaster. My implementation <strong>of</strong> automatic segmentation in XV uses histogram equal-


96ization to initialize (see section A.1), while my Splus implementation uses Ward’smethod.5.2.3 Marginal Segmentation via Mixture ModelsParameter Estimation by EMI use the EM algorithm to estimate the parameters <strong>of</strong> the mixture density. LetQ denote the pixel probabilities, where Q ij is the probability that pixel i is fromcomponent j. The initial segmentation is viewed as providing an initial estimate<strong>of</strong> Q; specifically, this initial estimate will consist <strong>of</strong> Q ij = 1 if pixel i is initiallyclassified in component j, and zero otherwise. After initialization, the algorithmiterates between the M-step, computing maximum likelihood estimates <strong>of</strong> the mixtureparameters θ conditional on Q, and the E-step, estimating Q conditional onθ (the name <strong>of</strong> this step comes from the fact that Q ij is the expected value <strong>of</strong>I(Z i = j), which is an indicator function which is equal to 1 if pixel i is generatedby component j and zero otherwise).At each iteration, I compute the overall loglikelihood <strong>of</strong> the data Y given theparameters θ, using the assumption <strong>of</strong> independent pixels. In general, the EMprocess is repeated until the loglikelihood converges. My Splus implementationallows a user-specified limit on the number <strong>of</strong> iterations. The parameters θ whichmust be estimated are the density parameters from each <strong>of</strong> the K components <strong>of</strong>the mixture distribution and K − 1 mixture proportions (the mixture proportionssum to 1, so one <strong>of</strong> the K proportions is fixed given the other K − 1 proportions).The mixture density is shown in equation 5.70, where Y i is the observedgreyscale value <strong>of</strong> pixel i, P j is the mixture proportion <strong>of</strong> component j (also sometimescalled the prior probability <strong>of</strong> component j), Φ is the single componentdensity (e.g. Gaussian), θ j is the vector <strong>of</strong> parameters for the jth density, and K


97is the number <strong>of</strong> components (this is the same as equation 3.15 <strong>of</strong> section 3.2).K∑f(Y i |K, θ) = P j Φ(Y i |θ j ) (5.70)j=1Under the assumption that all pixels are independent, the likelihood for thewhole image is given by equation 5.71, where N is the number <strong>of</strong> data points.f(Y |K, θ) =M-Step⎛⎞N∏ K∑⎝ P j Φ(Y i |θ j ) ⎠ (5.71)i=1 j=1The M-step consists <strong>of</strong> finding the maximum likelihood estimate <strong>of</strong> θ conditionalon Q. Think <strong>of</strong> Q as a matrix with one row for each pixel and one column foreach component. The mixture proportions are easily obtainable from Q as shownin equation 5.72.ˆP j = 1 NN∑ˆQ ij (5.72)Since Q ij is the probability <strong>of</strong> pixel i being in component j, we know that eachrow <strong>of</strong> Q must sum to 1. The sum <strong>of</strong> all the elements <strong>of</strong> Q will be equal to N.This agrees with the fact that ∑ j P j = 1.We now find estimates for the density parameters. For a Gaussian mixture,i=1we need estimates for the mean µ j and variance σ 2 jfor each component j. For aPoisson mixture, only the mean is needed. Formulas for the mean and varianceestimates are gven in equations 5.73 and 5.74. In essence, these are simply weightedversions <strong>of</strong> the usual maximum likelihood estimators, with the weights given byQ.ˆµ j =∑ Ni=1 ˆQij Y i∑i ˆQ ij(5.73)


98ˆσ 2 j =∑ Ni=1 ˆQ ij (Y i − ˆµ j ) 2∑i ˆQ ij(5.74)Significant computational savings can be achieved by calculating these estimatesin a slightly different way. With equations 5.72, 5.73, and 5.74, an iterationover all N pixels is needed. This will make the algorithm take time proportional tothe number <strong>of</strong> pixels. However, this time can be reduced to a time proportional tothe number <strong>of</strong> unique values in the data. This is done by creating a list <strong>of</strong> uniquedata values and counting the number <strong>of</strong> pixels with each value (like a histogram).The parameter estimates can be computed by iterating over the unique data valuesinstead <strong>of</strong> the pixels. This makes the algorithm operate in constant time withrespect to the number <strong>of</strong> pixels; the time is linear in the number <strong>of</strong> unique datavalues. The time saved by this change is enormous. For instance, with a 256-levelgreyscale image, we will have at most 256 unique values, regardless <strong>of</strong> whetherthere are thousands or millions <strong>of</strong> pixels. The equations for computing estimatesin this way are shown in equations 5.75, 5.76, and 5.77. My Splus implementationand my XV implementation both use this time-saving approach.Suppose there are C unique data values (e.g. C = 256 grey levels). Let Vdenote the unique data values, so V is a vector <strong>of</strong> length C. Let H i be the number<strong>of</strong> pixels with the ith unique data value, so H is like a histogram <strong>of</strong> the data.Define R the same way as Q, except that each row is for a unique data valuerather than a particular pixel. In other words, R ij is the probability that the ithunique data value being generated by the jth component. As with Q, each row <strong>of</strong>R must sum to 1. The sum <strong>of</strong> all elements <strong>of</strong> R will be equal to C.ˆP j = 1 NC∑H i ˆRij (5.75)i=1


99ˆµ j =∑ Ci=1H i ˆR ij V i∑ Ci=1H i ˆR ij(5.76)ˆσ 2 j =∑ Ci=1H i ˆR ij (V i − ˆµ j ) 2∑ Ci=1H i ˆRij(5.77)E-StepIn the E-step, we update the estimate <strong>of</strong> Q (or R) conditional on the current estimate<strong>of</strong> θ. Equations 5.78 and 5.79 show how to compute ˆQ or ˆR. In a particularimplementation <strong>of</strong> the algorithm, only one approach is needed; I recommend theuse <strong>of</strong> R (equations 5.75, 5.76, 5.77, and 5.79) instead <strong>of</strong> Q (equations 5.72, 5.73,5.74, and 5.78). The two approaches are numerically equivalent, but the R methodis much faster.ˆQ ij = ˆP j Φ(Y i |ˆµ j , ˆσ 2 j )∑j ˆP j Φ(Y i |ˆµ j , ˆσ 2 j )ˆR ij = ˆP j Φ(V i |ˆµ j , ˆσ 2 j )∑j ˆP j Φ(V i |ˆµ j , ˆσ 2 j )(5.78)(5.79)Practical IssuesThere are several practical problems which arise in the use <strong>of</strong> this algorithm. Inthis section I discuss these problems and the solutions I have implemented.When Gaussian mixtures are used, we must keep in mind that the data are infact discrete. Trouble occurs when the variance <strong>of</strong> a components gets too small.The discrepancy between the continuous Gaussian distribution and the discretedata becomes more pronounced when there are fewer unique data values or whenthe number <strong>of</strong> components K is increased. Trouble occurs when the variance <strong>of</strong> acomponents gets too small. This can easily happen if there are many pixels with


100one particular data value, which causes a spike in the histogram <strong>of</strong> data values.Since the spike consists <strong>of</strong> a single discrete data value, the variance <strong>of</strong> a Gaussiancomponent might shrink as the component tries to model this single value. Ifthe variance is allowed to approach zero, then the likelihood will approach infinity.This problem is solved by imposing a lower bound on the variance <strong>of</strong> a component.If we imagine that the observed data are a discretized version <strong>of</strong> a true Gaussianmixture, then an observation Y i can be thought <strong>of</strong> as having a round-<strong>of</strong>f errorwhich is unobserved. This means that an observed value <strong>of</strong> Y i could arise from atrue value in the range (Y i −0.5) to (Y i +0.5). To capture this variability, I imposea lower bound <strong>of</strong> 0.5 on the estimate <strong>of</strong> σ for each component.With Poisson mixtures, there is only a mean parameter µ, so the variance constraintis not needed. However, Poisson mixtures sometimes have an identifiabilityproblem. If the µ j values for two components become too similar, then they aremodeling the same feature <strong>of</strong> the data and their mixture proportions become arbitrary.That is, if components A and B have the same mean, then P A and P Bare not uniquely defined.There is a milder problem with means in Gaussian mixtures, as well. If twocomponents have means which are close, then the component with the larger variancewill be split. That is, points close to the common mean <strong>of</strong> the two segmentswill be classified into the segment with smaller variance, since it has a higher likelihood.Points farther from the mean, both high and low, will be classified into thecomponent with larger variance; this component will then contain sets <strong>of</strong> pointswhich are disjoint in grey level. This problem was discussed in section 3.4.2.There is some question <strong>of</strong> how to determine when the EM algorithm has converged.Typically, one looks for convergence in the loglikelihood, so the algorithmis stopped when the change in loglikelihood from one iteration to the next is belowa certain threshold. It is not always clear how this threshold should be chosen.


101However, experience with the EM algorithm suggests that it makes large steps atfirst, and then takes longer to converge once it is in the vicinity <strong>of</strong> a solution. Forclassification purposes, we do not really need extremely accurate estimates <strong>of</strong> theparameters <strong>of</strong> each component (especially in light <strong>of</strong> the inherent uncertainty inthe data due to discretization, as discussed above). An adaptive threshold canbe found by considering the contribution <strong>of</strong> each pixel to the loglikelihood. Forinstance, if the change in loglikelihood (from iteration i to iteration i + 1) for eachpixel was less than 0.00001, then the overall change in loglikelihood would be lessthan 0.00001N, where N is the number <strong>of</strong> pixels. Of course, some pixels mighthave a larger or smaller change than 0.00001. My XV implementation uses thisapproach, so the convergence criterion for EM is a change in the loglikelihood <strong>of</strong>less than 0.00001N. If 0.00001N is larger than 1, then 1 is used as the criterioninstead. My Splus implementation runs much more slowly than XV, so I simplyallow a user-definable limit on the number <strong>of</strong> iterations for each execution <strong>of</strong> theEM algorithm. Inspection <strong>of</strong> output reveals that 20 iterations is usually sufficientfor the parameter estimates to stabilize, so this is the value that I typically usewhen running the algorithm in Splus.Final Marginal SegmentationOnce we have used EM to estimate the mixture model with K components, allthat remains is to classify each pixel into one <strong>of</strong> the K segments. In section3.4, I discussed two different methods for performing this classification: mixtureclassification and componentwise classification. In either method, we consider eachpixel (or each unique data value) in turn, and classify it into the segment for whichit has the highest likelihood (using the parameter estimates from the last iteration<strong>of</strong> EM). The difference between the two methods is that in mixture classificationthe mixture proportions are included in the likelihood, while in componentwise


102classification they are not. Each <strong>of</strong> these approaches to classification is optimalfor a particular utility function. I use componentwise classification for reasonsdiscussed in section 3.4.Let Ẑ i denote the estimate <strong>of</strong> the true classifcation for pixel i. The classificationis characterized by equation 5.80.Ẑ i = argmax j Φ(Y i | ˆθ j ) (5.80)In equation 5.80, j indexes the components (segments), Ẑi is the estimated classfor pixel i, Y i is the observed value <strong>of</strong> pixel i, and ˆθ j is the vector <strong>of</strong> estimatedparameters for component j.5.2.4 ICM and Pseudolikelihood BICThe final marginal segmentation is used as a starting point for the ICM algorithm.The ICM algorithm has two goals: to produce a final segmentation assuming aparticular value for K, and to estimate the parameters needed for inference aboutK. Section 5.1.2 describes the ICM algorithm in detail; in particular, it showshow to obtain estimates <strong>of</strong> the density parameters ˆθ, the Markov random fieldparameter ˆφ, and the hidden Markov random field ˆX.To perform inference for K, I use BIC P L (K), which is the BIC with the likelihoodintegrated over models near the posterior mode <strong>of</strong> X. Equations 5.81 to5.83 show how to compute this quantity for a given K. Recall that f(Y i |X i = g)is simply a Gaussian density with parameters µ g and σg 2, and p(X i = g|N( ˆX i ), ˆφ)is given by the Potts model in equation 5.2.⎛⎞log(f(Y i | ˆX −i , ˆφ))K∑= log ⎝ f(Y i |X i = g)p(X i = g|N( ˆX i ), ˆφ) ⎠ (5.81)g=1


103log L ˆX (Y |K) = ∑ ilog(f(Y i | ˆX −i , ˆφ)) (5.82)BIC P L (K) = 2 log(L ˆX (Y |K)) − D K log(N) (5.83)In equation 5.83, N is the number <strong>of</strong> pixels (excluding the boundary <strong>of</strong> theimage), and D K is the number <strong>of</strong> parameters in the K segment model. The parametersare φ, a mean and a variance for each segment, and a mixture proportionfor all but one segment; this results in D K = 3K.5.2.5 Determining the Number <strong>of</strong> ComponentsThe number <strong>of</strong> components is determined by simply comparing the BIC values forseveral choices <strong>of</strong> K. In some cases we are interested in whether there are anyfeatures at all (in addition to the background), so a reasonable starting value forK is 1. An obvious upper bound for K is the number <strong>of</strong> unique values in thedata; however, the segmentation which results when K takes on this value wouldnot be useful since it would be identical to the original image. It is also doubtfulthat the model would hold with very large K; for instance, a Gaussian mixture isnot a good model <strong>of</strong> discrete data when the number <strong>of</strong> components in the mixtureis close to the number <strong>of</strong> unique data values. In general, interest usually lies infinding just a few salient features in the data, so there is no need to explore theentire parameter space <strong>of</strong> K. Since models with smaller values <strong>of</strong> K can be fittedmore quickly than those with large values <strong>of</strong> K, it makes sense to start with smallK and then increase it. I begin with K = 1 and then increase K only as longas the BIC value increases. In other words, I am sequentially comparing a modelwith K components to a model with K +1 components, until the K +1 componentmodel fails to outperform the K component model, as judged by the BIC.


1045.2.6 Morphological Smoothing (Optional)Application <strong>of</strong> smoothing based on mathematical morphology can enhance thespatial contiguity <strong>of</strong> regions in the final segmentation. However, one must be warywhen looking for small features, since overuse <strong>of</strong> smoothing may remove the veryfeatures one is seeking. The smoothing step, although <strong>of</strong>ten extremely useful, isnonetheless something which must be carefully considered before application.Mathematical morphology is a local smoothing method which lends itself t<strong>of</strong>ast computation. In morphology, each pixel is examined and an operation isperformed on it based on the contents <strong>of</strong> a small neighborhood <strong>of</strong> pixels aroundit. In this section I will discuss the basic operations which can be used and thespecification <strong>of</strong> the pixel neighborhood.A structuring element defines the size and shape <strong>of</strong> the neighborhood <strong>of</strong> eachpixel. The structuring element is simply a specification <strong>of</strong> neighboring pixels inrelation to the pixel under consideration. For example, a commonly used structuringelement is a 3 pixel by 3 pixel square, with the pixel under consideration atthe center. With this structuring element, the neighborhood <strong>of</strong> a pixel consists <strong>of</strong>the pixel itself and its 8 adjacent neighbors. In my XV implementation <strong>of</strong> mathematicalmorphology, I use this 3 by 3 structuring element by default. Changingthe shape <strong>of</strong> the structuring element can enhance certain features in the data; forinstance, use <strong>of</strong> a structuring element consisting <strong>of</strong> a short vertical line <strong>of</strong> pixelswould accentuate vertical features in the image.The two basic morphology operations are erosion and dilation. Consider pixeli and its neighbors N(i). The neighborhood is defined by an arbitrary structuringelement; note that the neighborhood can include pixel i. Erosion consists <strong>of</strong> settingthe value <strong>of</strong> pixel i to the minimum <strong>of</strong> the pixels in N(i), while dilation (a dual <strong>of</strong>erosion) is accomplished by setting pixel i to the maximum <strong>of</strong> the pixels in N(i).Let Y denote the original image, and A is the image after the morphology


105operation. Then erosion can be expressed by equation 5.84, while dilation is givenby equation 5.85.A i = min(Y N(i) ) (5.84)A i = max(Y N(i) ) (5.85)These two basic morphology operations are usually not performed singly. Ifan image is eroded (or dilated) repeatedly, it will eventually become a solid colorequal to the minimum (or maximum) pixel value in the image. To retain themajor features in the image while smoothing the noise, the two operations canbe performed in sequence. An erosion followed by a dilation is called an opening,while a dilation followed by an erosion is called a closing. The opening and closingoperations are idempotent, that is, repeating an opening (or closing) operationwill have no effect (Serra, 1982).Other operations can be substitued in place <strong>of</strong> the minimum and maximum inequations 5.84 and 5.85. Two other common operations are median (the medianfilter) and mean (blurring).The use <strong>of</strong> smoothing operations is something which requires applicationspecificknowledge. In some cases, no smoothing at all may be the best wayto keep features <strong>of</strong> interest, while in other cases the structuring element can betailored to help detect certain feature shapes. In the examples I consider in section5.3, the morphological smoothing step consists <strong>of</strong> an opening followed by a closing,using a 3 pixel by 3 pixel structuring element.


1065.3 Image Segmentation Examples5.3.1 Simulated Two Segment ImageThis simulation illustrates the use <strong>of</strong> spatial information by BIC P L compared tothe use <strong>of</strong> only marginal information by BIC IND , where BIC IND denotes the usualBIC value: twice the loglikelihood (assuming spatial independence) minus thepenalty term, which is the number <strong>of</strong> degrees <strong>of</strong> freedom in the model multipliedby the log <strong>of</strong> the number <strong>of</strong> data points. A more complete description <strong>of</strong> thesegmentation algorithm is given in section 5.3.2.The simulated two-segment image shown in figure 5.1 is comprised <strong>of</strong> two solidbands, with mean greyscale values <strong>of</strong> 120 and 140. Independent Gaussian noise(mean = 0, variance = 100) is added to each pixel, and then the values are roundedto integers. Compare this to the scrambled image in figure 5.2, which is a randomreordering <strong>of</strong> the pixels from figure 5.1. Since these two images contain exactlythe same pixels (just in a different order), their marginal information will be thesame. For instance, a histogram <strong>of</strong> the values in one image will be the same as theother. Such a histogram is shown in figure 5.3.Although it is clear visually that there are two segments in figure 5.1, thereis enough noise in the image that a histogram <strong>of</strong> the greyscale values, shown infigure 5.3, is unimodal. Since the marginal information <strong>of</strong> the histogram is thebasis <strong>of</strong> BIC IND , it is not surprising that it selects only one segment, as shownby the values in table 5.1. Note that the BIC IND results are identical for thetwo-segment image and the scrambled image, since they have the same marginalinformation. However, the BIC P L values favor two segments for the two-segmentimage and one segment for the scrambled image. The estimated φ values used incomputing BIC P L are also shown in the table. The large φ for the two segmentfit <strong>of</strong> the two-segment image indicates that a large degree <strong>of</strong> spatial homogeneity


107is found; this results in a much higher value <strong>of</strong> BIC P L than is obtained for eitherone or three segments.Table 5.1: BIC P L and BIC IND results for the simulated two-segment image and the scrambledimage.Two-Segment ImageScrambled ImageSegments ˆφ BICP L BIC INDˆφ BICP L BIC IND1 – -11731.28 -12998.79 – -11756.49 -12998.792 2.06 -10784.69 -13002.17 0.05 -11843.80 -13002.173 0.09 -11115.64 -13021.64 0.00 -11932.16 -13021.644 0.00 -10989.76 -13062.93 0.00 -11660.82 -13062.93


1080 10 20 30 400 10 20 30 40Figure 5.1: Simulated two-segment image.


1090 10 20 30 400 10 20 30 40Figure 5.2: Scrambled version <strong>of</strong> figure 5.1.


1100.0 0.005 0.010 0.015 0.020 0.025 0.0300 50 100 150 200 250Figure 5.3: Marginal histogram <strong>of</strong> the simulated image.


1115.3.2 Simulated Three Segment ImageThe simulated image shown in figure 5.4 is comprised <strong>of</strong> three solid bands, withgreyscale values <strong>of</strong> 70, 140, and 210. Independent Gaussian noise (mean = 0,variance = 225) is added to each pixel, and then the values are rounded to integers.This simulation is meant to provide a simple illustration <strong>of</strong> the algorithm. It isvisually clear that there are three segments; examination <strong>of</strong> the marginal histogram(figure 5.5) reinforces this.From table 5.2, we see that BIC P L correctly chooses three segments for thisimage. The BIC penalty term plays an important role in this example; the logpseudolikelihoodis maximized at four segments, but the penalty term changes thechoice to three segments. One reason that the logpseudolikelihood is so similarbetween the three and four segment cases is that the final segmentation with foursegments is extremely similar to the three segment one; it simply has the middlesegment subdivided.Values <strong>of</strong> BIC IND are also shown in table 5.2. BIC IND is computed from themarginal (without spatial information) greyscale values <strong>of</strong> the image, using theparameters estimated by EM. For this simulation, BIC IND chooses 3 segments.One entry in the table, denoted by †, is missing because the final EM segmentationconverges to a classification with only 4 segments present, which is not a valid 5segment result.The parameter estimates shown in table 5.3 show that the true parametervalues are estimated quite accurately in the three segment solution.The initial segmentation by Ward’s method is shown in figure 5.6; after EM,the marginal segmentation is shown in figure 5.7. It is clear from the marginalhistogram in figure 5.5 that it would be very difficult for any other marginal methodto improve on these results. Significant improvement can only be found by takingaccount <strong>of</strong> spatial information. After refining the segmentation with ICM, only


112three pixels remain incorrectly classified. One <strong>of</strong> these is on the border betweentwo segments, where spatial information is less useful, and the other two are onedges <strong>of</strong> the image. The morphological smoothing step corrects these three pixels,resulting in perfect restoration <strong>of</strong> the true image.Table 5.2: Logpseudolikelihood and BIC P L results for the simulated image. A missing value,noted with †, is discussed in the text.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -18474.83 -36974.22 -39608.72 2.92 -15979.14 -32007.40 -38537.43 1.86 -13853.21 -27780.11 -37421.74 3.81 -13847.57 -27793.41 -37480.25 3.26 -13848.28 -27819.38 †6 0.21 -13970.14 -28087.67 -37532.87 0.00 -14086.51 -28344.99 -37565.48 0.00 -14062.93 -28322.39 -37590.0


1130 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.4: Simulation <strong>of</strong> a three segment image, before processing.


114Table 5.3: EM-based parameter estimates for the simulated image.8 segments.Means: 211.59 196.48 157.62 223.93 138.95 125.97 62.24 79.02SDs : 8.39 8.68 11.49 10.62 11.73 14.91 12.38 10.73Probs: 0.119 0.106 0.069 0.104 0.201 0.073 0.187 0.147 segments.Means: 215.42 197.42 157.37 138.89 125.95 62.25 79.02SDs : 12.83 9.84 11.05 11.69 14.89 12.38 10.73Probs: 0.24 0.091 0.068 0.201 0.073 0.187 0.146 segments.Means: 215.4 197.61 141.13 125.15 62.13 78.69SDs : 12.87 10.12 14.89 24 12.33 10.43Probs: 0.239 0.092 0.302 0.045 0.187 0.1355 segments.Means: 215.42 197.58 142.77 128.43 69.95SDs : 12.85 9.98 14.88 11.34 14.86Probs: 0.239 0.092 0.283 0.053 0.3334 segments.Means: 210.17 142.77 127.3 69.97SDs : 14.86 14.38 11.18 14.87Probs: 0.333 0.28 0.054 0.3333 segments.Means: 210.2 140.14 69.85SDs : 14.85 15.26 14.78Probs: 0.333 0.335 0.3322 segments.Means: 212.02 110SDs : 13.58 42.57Probs: 0.296 0.7041 segments.Means: 140.17SDs : 59.15Probs: 1


115Percent0.0 0.002 0.004 0.006 0.008 0.0100 50 100 150 200 250GreyscaleFigure 5.5: Marginal histogram <strong>of</strong> the simulated image, with the estimated 3 component mixturedensity.


1160 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.6: Initial segmentation <strong>of</strong> the simulated image by Ward’s method, using 3 segments.


1170 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.7: Segmentation <strong>of</strong> the simulated image into 3 segments after EM.


1180 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.8: Segmentation <strong>of</strong> the simulated image into 3 segments after ICM.


1190 10 20 30 40 50 600 10 20 30 40 50 60Figure 5.9: Segmentation <strong>of</strong> the simulated image into 3 segments after morphological smoothing(opening and closing, conditional on the edge pixels).


1205.3.3 Ice FloesFigure 5.10 shows an aerial image <strong>of</strong> ice floes. This is a 256-level greyscale image.From table 5.4, we see that the first maximum in BIC P L occurs at K = 2 segments.Parameter estimates for 1 to 8 segments are shown in table 5.5. For comparison,values <strong>of</strong> BIC IND are also given in table 5.4. These are based on the marginalsegmentation from the EM step, assuming spatial independence. A choice <strong>of</strong> 4segments is given by BIC IND .The marginal histogram <strong>of</strong> this image is shown in figure 5.11, along with theestimated mixture density for K = 2. The histogram is clearly bimodal, so the 2segment model makes sense intuitively. However, a large proportion <strong>of</strong> the datavalues occur between the modes, rather than outside the modes. In other words,each mode is skewed; this explains why the two fitted components appear not tobe centered on the modes.From the marginal histogram, it may appear that three segments should fitbetter than two. This may well be the case for the marginal values, but spatialinformation plays a large role in BIC P L in this example. The large changes inBIC P L values are attributable in part to the large changes in φ. For example,compare figures 5.14 and 5.15. The two segment version consists almost entirelylarge patches <strong>of</strong> solid color, whether white or black; the three segment version stillhas some large patches <strong>of</strong> white and black, but there is more clutter and the greysegment lacks spatial contiguity, resulting in a large decrease in the overall φ value.The segmentation process is illustrated in figures 5.12 to 5.16. The initialsegmentation by Ward’s method (figure 5.12) contains a bit <strong>of</strong> clutter in the water,with very little clutter in the ice floe interiors. After using the EM algorithm t<strong>of</strong>it a Gaussian mixture (figure 5.13), there is a small increase in clutter in the floeinteriors, but there is much less clutter in the water. The ICM refinement (figure5.14) does a very good job <strong>of</strong> eliminating clutter in both the water and the ice;


121only one possible melt pool is evident in the main floe in the image. The melt poolis smoothed out by the morphological smoothing step (figure 5.16), along withsome <strong>of</strong> the smaller bits <strong>of</strong> ice in the water; however, this step unfortunately linksthe main floe with the floe on the right side <strong>of</strong> the image. Depending on the goals<strong>of</strong> a particular analysis, one might end the processing before the morphologicalsmoothing step.Table 5.4: Logpseudolikelihood and BIC P L results for the ice floe image.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -55649.19 -111326.33 -115800.62 1.28 -46342.76 -92741.44 -110028.83 0.67 -59865.64 -119815.16 -110002.04 0.45 -42259.93 -84631.71 -109388.25 0.33 -41835.65 -83811.10 -109394.36 0.28 -40094.29 -80356.34 -109330.87 0.22 -39428.42 -79052.56 -109345.38 0.13 -39537.04 -79297.76 -109375.4


1220 20 40 60 80 1000 20 40 60 80 100Figure 5.10: Aerial image <strong>of</strong> ice floes.


123Table 5.5: EM-based parameter estimates for the ice floe image.8 segments.Means: 83.88 38.15 52.85 110.16 131.16 146.59 67.44 163.08SDs : 8.63 8.28 8.42 9.91 9.23 7.09 7.4 6.96Probs: 0.097 0.083 0.123 0.098 0.131 0.329 0.06 0.0797 segments.Means: 79.63 38.27 53.69 111.91 131.43 146.61 163.08SDs : 13.44 8.35 9.12 9.38 9.14 7.1 6.96Probs: 0.165 0.082 0.129 0.087 0.129 0.329 0.0796 segments.Means: 79.89 48.05 112.07 131.45 146.61 163.08SDs : 13.69 11.94 9.24 9.1 7.09 6.96Probs: 0.164 0.214 0.085 0.129 0.329 0.0795 segments.Means: 79.99 48.07 112.39 133.78 147.61SDs : 13.69 11.95 9.13 12.7 11.13Probs: 0.165 0.214 0.082 0.094 0.4454 segments.Means: 77.1 47.61 119.82 147.72SDs : 13.26 11.76 17.94 10.9Probs: 0.148 0.207 0.2 0.4453 segments.Means: 55.53 112.67 147.28SDs : 16.47 23.02 10.97Probs: 0.302 0.248 0.452 segments.Means: 67.82 143.73SDs : 24.93 13.77Probs: 0.431 0.5691 segment.Means: 110.99SDs : 42.3Probs: 1


124Percent0.0 0.005 0.010 0.015 0.020 0.0250 50 100 150 200 250GreyscaleFigure 5.11: Marginal histogram <strong>of</strong> the ice floe image, with the estimated 2 component mixturedensity.


1250 20 40 60 80 1000 20 40 60 80 100Figure 5.12: Initial segmentation <strong>of</strong> the ice floe image by Ward’s method, using 2 segments.


1260 20 40 60 80 1000 20 40 60 80 100Figure 5.13: Segmentation <strong>of</strong> the ice floe image into 2 segments after EM.


1270 20 40 60 80 1000 20 40 60 80 100Figure 5.14: Segmentation <strong>of</strong> the ice floe image into 2 segments after refinement by ICM.


1280 20 40 60 80 1000 20 40 60 80 100Figure 5.15: Segmentation <strong>of</strong> the ice floe image into 3 segments after refinement by ICM.


1290 20 40 60 80 1000 20 40 60 80 100Figure 5.16: Segmentation <strong>of</strong> the ice floe image into 2 segments after morphological smoothing(opening and closing, conditional on the edge pixels).


1305.3.4 Dog LungFigure 5.17 presents a PET image <strong>of</strong> a dog lung. This image was obtained fromDr. H. T. Robertson at the <strong>University</strong> <strong>of</strong> <strong>Washington</strong> Division <strong>of</strong> Pulmonary andCritical Care. Dr. Robertson also provided an expert examination <strong>of</strong> these results,finding that the final segmentation was sufficient for separating out the lung as aregion <strong>of</strong> interest for further analysis.It is clear in figure 5.17 that the actual image area is circular, with the corners<strong>of</strong> the image filled in with a constant grey value. This sort <strong>of</strong> artifact can beremoved quite easily with a mixture model; one <strong>of</strong> the components converges to aspike for that grey level, effectively separating it from the rest <strong>of</strong> the data. Thiscan be clearly seen in figure 5.18, and is also apparent in the parameter estimatesin table 5.7.The results shown in table 5.6 show that BIC P L chooses four segments for thisimage. In context <strong>of</strong> PET imagery, the choice <strong>of</strong> four segments is quite reasonablefor this image. Two segments are needed for the background in order to modelthe spike (due to the corner artifact) and the general background. Since the imageis constructed based on radioactive emissions from gas in the lung, it is not at allsurprising to see 2 segments for the lung itself to account for the high gas density inthe interior <strong>of</strong> the lung and the somewhat lower gas density around the periphery.For this case, BIC IND also chooses 4 segments using only the marginal (withoutspatial information) greyscale values from the image.The initial segmentation by Ward’s method is shown in figure 5.19, and thesegmentation after the EM algorithm is given in figure 5.20. In this case, Ward’smethod does a reasonable job <strong>of</strong> separating the lung from the background; wewould expect this since the lung is visually very bright compared to the relativelylow-level background. The EM step fills in some <strong>of</strong> the voids in the lung apparentin the initial segmentation, but at the cost <strong>of</strong> incorporating more <strong>of</strong> the background


131artifacts as well.The ICM refinement, shown in figure 5.21, does a good job <strong>of</strong> reducing clutterin the image, but it leaves an erroneus section sticking out <strong>of</strong> the top <strong>of</strong> the lung.Morphological smoothing, shown in figure 5.22, removes the worst <strong>of</strong> this artifact,as well as removing most <strong>of</strong> the other clutter in the image. The small spot separatefrom the lung and below is easily removed by simply considering the lung to bethe largest connected component. The small void in the center <strong>of</strong> the lung is notartifactual; it is real.This final segmentation shows that this segmentation method is sufficient forthe purposes <strong>of</strong> processing a database <strong>of</strong> lung images to separate out the region <strong>of</strong>interest (the lung) from the background. Currently, the most widely used methodfor this sort <strong>of</strong> segmentation is for a human expert to manually outline the lungwith an interactive computer program, a process which is quite tedious and cantake quite a long time for a large database <strong>of</strong> images. The automatic segmentationalgorithm can obviate the need for the manual process, only requiring humaninspection <strong>of</strong> the results.


132Table 5.6: Logpseudolikelihood and BIC P L results for the dog lung image.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -80641.02 -161311.13 -166007.12 1.14 -61081.56 -122221.33 -137193.03 1.09 -59536.76 -119160.82 -137019.84 0.52 -42147.64 -84411.69 -128746.25 0.71 -48914.07 -97973.64 -135323.56 0.69 -48508.12 -97190.83 -135333.67 0.47 -36602.26 -73408.23 -128800.68 0.42 -36254.37 -72741.54 -128825.80 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.17: PET image <strong>of</strong> a dog lung, before processing.


133Table 5.7: EM-based parameter estimates for the dog lung image.8 segments.Means: 41 36.24 46.28 78.28 120.06 147.19 174.8 214.25SDs : 0.5 9.14 9.25 18.22 15.01 9.15 13.93 15.13Probs: 0.241 0.285 0.336 0.04 0.028 0.019 0.036 0.0167 segments.Means: 41 36.26 46.27 78.25 135.39 175.21 216.28SDs : 0.5 9.13 9.26 20.39 23.58 16.97 14.58Probs: 0.241 0.285 0.335 0.042 0.05 0.034 0.0146 segments.Means: 40.86 42.67 90.68 137.66 175.45 216.31SDs : 3.2 12.29 16.57 22.71 17.05 14.59Probs: 0.398 0.479 0.029 0.048 0.033 0.0135 segments.Means: 40.86 42.65 90.23 138.2 182.43SDs : 3.2 12.27 17.38 25.81 28.69Probs: 0.398 0.478 0.029 0.046 0.054 segments.Means: 41 41.88 104.15 177.47SDs : 0.5 10.56 38.01 29.58Probs: 0.24 0.622 0.079 0.0593 segments.Means: 41.3 67.95 166.35SDs : 7.99 30.79 32.89Probs: 0.816 0.098 0.0862 segments.Means: 41.49 127.11SDs : 8.47 54.57Probs: 0.846 0.1541 segments.Means: 54.64SDs : 38.35Probs: 1


134Percent0.0 0.05 0.10 0.15 0.200 50 100 150 200 250GreyscaleFigure 5.18: Marginal histogram <strong>of</strong> the dog lung image, with the estimated 4 component mixturedensity.


1350 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.19: Initial segmentation <strong>of</strong> the dog lung image by Ward’s method, using 4 segments.


1360 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.20: Segmentation <strong>of</strong> the dog lung image into 4 segments after EM.


1370 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.21: Segmentation <strong>of</strong> the dog lung image into 4 segments after ICM.


1380 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.22: Segmentation <strong>of</strong> the dog lung image into 4 segments after morphological smoothing(opening and closing, conditional on the edge pixels).


1395.3.5 <strong>Washington</strong> CoastFigure 5.23 is a satellite image <strong>of</strong> a section <strong>of</strong> the Pacific coast <strong>of</strong> <strong>Washington</strong> state’sOlympic peninsula; this image, provided by the USGS as part <strong>of</strong> the NationalAerial Photography Program (NAPP), was obtained from the Terraserver webpage, terraserver.micros<strong>of</strong>t.com. The resolution is 8 meters per pixel, and it is a256-level greyscale image.Table 5.8 shows that the first maximum in BIC P L occurs at K = 6 segments,so we regard this as the optimal choice <strong>of</strong> K for this model. Parameter estimatesare given in table 5.9. The marginal histogram <strong>of</strong> this image is shown in figure5.24, along with the estimated mixture density for K = 6. The large spike is dueto the small variance in the greyscale value <strong>of</strong> the water.For comparison, BIC IND values are also shown in the table. The first localmaximum occurs at 4 segments. The BIC IND values are computed based onlyon marginal (without spatial information) greyscale data from the image. Two<strong>of</strong> the table entries, denoted by †, are missing from the table. The entry for 7segments is missing because the EM algorithm converges to a classification whichcontains fewer segments than the number <strong>of</strong> components in the mixture. In otherwords, we do not obtain a valid segmentation into 7 segments from the final EMclassification. The entry for 8 segments is missing due to slow convergence, whichis not surprising in light <strong>of</strong> the result for 7 segments.Figures 5.25 to 5.28 illustrate the steps <strong>of</strong> the segmentation algorithm for thisimage. An initial segmentation is created by using Ward’s method to cluster themarginal greyscale values <strong>of</strong> the image; this is shown in figure 5.25. The wateris quite well classified as a single segment, while the land is mostly containedin two segments. The tideline accounts for most <strong>of</strong> the variability in the image.Starting from the initial segmentation, the EM algorithm is used to fit a Gaussianmixture; the resulting segmentation is displayed in figure 5.26. Most <strong>of</strong> the pixels


140interior to the land which had initially been classified into the water segment havenow been properly classifed into one <strong>of</strong> the land segments. Figure 5.27 gives theICM refinement <strong>of</strong> the segmentation; we can see that much <strong>of</strong> the noise in theland interior has been smoothed out, while the tideline is still quite clear. Themorphological smoothing step, shown in figure 5.28, smooths the land further, butthe tideline becomes more obscure.In both the ICM classification (figure 5.27) and the post-morphology classification(figure 5.28), the water is well-characterized by the darkest segment. Thesecond-darkest segment corresponds to shallow tidewater near the bright beach, aswell as combining with the third-darkest segment to characterize most <strong>of</strong> the dryland. Although the second-darkest segment comprises both dry land and shallowtidewater, these two land types are spatially separated by the beach. The twobrightest segments correspond to the beach, which is quite reflective. The thirdbrightestsegment is transitional; it fills in between dry land and water in regionswhere the beach is not evident, as well as capturing the small hill in the upperright hand corner <strong>of</strong> the image.Comparison <strong>of</strong> the classifications in figures 5.27 and 5.28 provides a good example<strong>of</strong> the fact that one needs to know the goals <strong>of</strong> the analysis in order to knowhow much smoothing is reasonable. Figure 5.27 yields good spatial separation <strong>of</strong>the water and land, along with detail in the tidal area; however, the interior <strong>of</strong> theland and the water near the tideline are both a bit cluttered. Figure 5.28 smoothsout much <strong>of</strong> the clutter in the image while preserving the location <strong>of</strong> the tideline,but detail in the tideline is lost.


141Table 5.8: Logpseudolikelihood and BIC P L results for the <strong>Washington</strong> coast image. Missingvalues, noted with †, are discussed in the text.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -64521.78 -129072.20 -133475.02 1.23 -56721.70 -113500.70 -125273.03 0.74 -45882.79 -91851.53 -117607.84 0.58 -45303.55 -90721.70 -117528.65 0.45 -42517.74 -85178.71 -117564.96 0.39 -42312.05 -84795.99 -117537.57 0.34 -55201.01 -110602.56 †8 0.30 -51279.68 -102788.55 †0 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.23: Satellite image <strong>of</strong> <strong>Washington</strong> coast, before processing.


142Table 5.9: EM-based parameter estimates for the <strong>Washington</strong> coast image.8 segments.Means: 105.09 87.29 154.81 184.37 223.56 78.79 55.71 67.53SDs : 25.61 6.6 14.62 14.65 17.39 5.52 2.25 4.63Probs: 0.125 0.173 0.018 0.014 0.006 0.187 0.354 0.1237 segments.Means: 112.1 87.56 153.29 141.34 78.98 55.71 67.52SDs : 15.71 7.07 21.44 46.16 5.62 2.27 4.65Probs: 0.052 0.188 0.015 0.065 0.197 0.356 0.1286 segments.Means: 114.08 85.85 152.2 182.73 74.36 55.66SDs : 15.8 7.65 16.96 31.42 10.23 2.17Probs: 0.069 0.186 0.025 0.026 0.361 0.3345 segments.Means: 113.01 86.11 179.14 74.1 55.67SDs : 26.91 7.26 31 9.61 2.18Probs: 0.114 0.181 0.03 0.338 0.3364 segments.Means: 120.1 78.8 178.59 55.68SDs : 25.04 11.08 31.64 2.19Probs: 0.089 0.542 0.03 0.3383 segments.Means: 132.05 79.07 55.69SDs : 39.48 11.09 2.19Probs: 0.123 0.539 0.3382 segments.Means: 130.82 70.04SDs : 39.83 14.44Probs: 0.126 0.8741 segments.Means: 77.69SDs : 28.08Probs: 1


143Percent0.0 0.02 0.04 0.06 0.080 50 100 150 200 250GreyscaleFigure 5.24: Marginal histogram <strong>of</strong> the <strong>Washington</strong> coast image, with the estimated 6 componentmixture density.


1440 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.25: Initial segmentation <strong>of</strong> the <strong>Washington</strong> coast image by Ward’s method, using 6segments.


1450 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.26: Segmentation <strong>of</strong> the <strong>Washington</strong> coast image into 6 segments after EM.


1460 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.27: Segmentation <strong>of</strong> the <strong>Washington</strong> coast image into 6 segments after refinement byICM.


1470 20 40 60 80 100 1200 20 40 60 80 100 120Figure 5.28: Segmentation <strong>of</strong> the <strong>Washington</strong> coast image into 6 segments after morphologicalsmoothing (opening and closing, conditional on the edge pixels).


1485.3.6 BuoyFigure 5.29 is an aerial image <strong>of</strong> a buoy against a background <strong>of</strong> dark water. Thehorizontal scan lines from the imaging process form a quite visible artifact. Asimple detrending process is used to remove most <strong>of</strong> the scan line artifact. Thisdetrending consists <strong>of</strong> renormalizing each row <strong>of</strong> pixels to smooth the row means.Figure 5.30 shows the image after this simple detrending. Further analysis beginswith this detrended image. The need for an ad hoc preprocessing step in thisexample is meant to be illustrative <strong>of</strong> the common situation in image analysis inwhich the data contain known artifacts which must be removed prior to processingby more general methods.Table 5.10 shows that BIC P L chooses 6 segments; the parameter estimatesfor 1 to 8 segments are displayed in table 5.11. Figure 5.31 gives the marginalhistogram <strong>of</strong> greyscale values in the buoy image (figure 5.30), along with the fittedmixture density. At a large scale, it is clear that there are two main groups <strong>of</strong> valuesin the histogram, at grey values <strong>of</strong> approximately 90 and 210. These two groupscorrespond to the buoy and the water background in the image. It is important tonote that since we are using a Gaussian mixture, more than one component in themixture might be needed to model a single non-Gaussian segment; this is discussedin section 3.4.2. Additional components may be needed due to subtle features orartifacts. A mixture <strong>of</strong> more flexible distributions would be more appropriatethan the Gaussian mixture when we are interested in features which do not havea Gaussian distribution.Results using BIC IND , shown in table 5.10, give a choice <strong>of</strong> 3 segments. Thisis based on the marginal segmentation <strong>of</strong> the image, before ICM.Proceeding through the steps <strong>of</strong> the segmentation into 6 segments, we see thatthe buoy is quite clearly separated from the water in the initial segmentation byWard’s method (figure 5.32); some influence from the scan line artifact is still


149visible, both in jagged horizontal bits and in the darker background area runningdown the middle <strong>of</strong> the image. There is some slight improvement after the EM step,shown in figure 5.33. The ICM refinement does a good job <strong>of</strong> smoothing the edges<strong>of</strong> the buoy; the two brightest segments at this point would give a reasonablesegmentation <strong>of</strong> the buoy. However, the water background appears quite noisywhen in fact it does not contain any real features. The morphological smoothingstep seems unwarranted here, if not misleading; by reducing the amount <strong>of</strong> noisein the water segments, it may make the remaining artifacts appear to be features.This example shows some <strong>of</strong> the limitations <strong>of</strong> this method. With or withoutmorphological smoothing, the final segmentation is quite noisy and contains moresegments than are really needed to separate the buoy feature from the water background.This image has several difficult aspects. First, the water background isquite noisy, due largely to artifacts <strong>of</strong> the imaging process which are difficult toremove without extensive knowledge <strong>of</strong> the particular application at hand. Second,the level <strong>of</strong> spatial homogeneity is quite different between the feature andthe background. That is, the buoy is a single contiguous region, which on its ownmight have a very high φ value due to its high level <strong>of</strong> spatial homogeneity. Thebackground, on the other hand, would have a low φ value because it does not haveany large single-segment regions. In combination, it is difficult for the model to fitproperly, since the model only contains one φ parameter. Third, the buoy itself isa relatively small proportion <strong>of</strong> the image, so it is not surprising that even subtlevariations in the background would have a lot <strong>of</strong> influence on the model.


150Table 5.10: Logpseudolikelihood and BIC P L results for the buoy image.Segments ˆφ Logpseudolikelihood BICP L BIC IND1 – -47504.28 -95036.51 -98338.822 1.05 -32799.86 -65655.61 -72682.933 1.04 -31999.44 -64082.71 -72286.604 0.63 -30331.81 -60775.39 -72335.395 0.53 -29886.69 -59913.11 -72276.066 0.25 -29332.52 -58832.69 -72347.417 0.19 -29404.65 -59004.90 -72359.498 0.00 -29887.55 -59998.65 -72385.190 20 40 60 80 1000 20 40 60 80 100Figure 5.29: Aerial image <strong>of</strong> a buoy, before processing.


151Table 5.11: EM-based parameter estimates for the buoy image.8 segments.Means: 90.97 87.58 83.6 96.73 114.42 144.02 183.64 211.52SDs : 3.37 2.47 2.92 6.3 10.34 14.88 18.12 11.72Probs: 0.491 0.246 0.144 0.063 0.018 0.008 0.012 0.0177 segments.Means: 90.64 86.69 97.16 115.51 144.58 183.74 211.53SDs : 3.55 3.72 6.46 10.13 14.74 18.09 11.72Probs: 0.482 0.404 0.06 0.017 0.008 0.012 0.0176 segments.Means: 90.65 86.69 96.96 114.44 169.52 210.18SDs : 3.54 3.71 6.51 14 23.07 12.29Probs: 0.482 0.403 0.058 0.023 0.015 0.0195 segments.Means: 90.48 86.99 102.18 151.5 208.75SDs : 3.96 3.79 10.64 28.15 13.07Probs: 0.532 0.379 0.048 0.02 0.0214 segments.Means: 90.5 87.11 106.12 192.83SDs : 4.09 3.76 13.84 25.96Probs: 0.535 0.382 0.048 0.0343 segments.Means: 89.08 106.11 192.43SDs : 4.29 13.43 26.29Probs: 0.917 0.048 0.0342 segments.Means: 89.21 152.97SDs : 4.44 45.97Probs: 0.933 0.0671 segments.Means: 93.45SDs : 20.29Probs: 1


1520 20 40 60 80 1000 20 40 60 80 100Figure 5.30: Buoy image after initial smoothing to mitigate the scan line artifact.


153Percent0.0 0.02 0.04 0.06 0.080 50 100 150 200 250GreyscaleFigure 5.31: Marginal histogram <strong>of</strong> the buoy image, with the estimated 6 component mixturedensity.


1540 20 40 60 80 1000 20 40 60 80 100Figure 5.32: Initial segmentation <strong>of</strong> the buoy image by Ward’s method, using 6 segments.


1550 20 40 60 80 1000 20 40 60 80 100Figure 5.33: Segmentation <strong>of</strong> the buoy image into 6 segments after EM.


1560 20 40 60 80 1000 20 40 60 80 100Figure 5.34: Segmentation <strong>of</strong> the buoy image into 6 segments after ICM.


1570 20 40 60 80 1000 20 40 60 80 100Figure 5.35: Segmentation <strong>of</strong> the buoy image into 6 segments after morphological smoothing(opening and closing, conditional on the edge pixels).


Chapter 6CONCLUSIONSThis dissertation presents an automatic and unsupervised method for segmentingimages which attempts to be quite general and computationally fast. Theautomatic choice <strong>of</strong> the number <strong>of</strong> segments based on the BIC P L is an advanceover most other automatic segmentation methods. This procedure is supportedby a theoretical consistency result. Another contribution <strong>of</strong> this dissertation is thepresentation <strong>of</strong> a clustering procedure for spatial point processes with nonlinearfeatures.Clustering with open principal curves can be extended by considering otherdistributions for both the background noise and the distribution <strong>of</strong> feature pointsalong and about the curve. Although the BIC is used with some success for modelselection with principal curve clustering, theoretical results would provide a moresolid basis for its use. The principal curve clustering examples presented are all2-dimensional; the method can be extended to higher dimensions, though biasproblems in fitting the principal curves might increase in higher dimensions. Forinstance, a bias correction step was considered in two dimensions, which involvedsmoothing the residuals from the principal curve fit; however, this requires determination<strong>of</strong> which side <strong>of</strong> the curve the residuals are on, which is not defined inhigher dimensions. Note that this bias problem is due to principal curves in general,and is separate from the clustering method. Another way <strong>of</strong> characterizingprincipal curves is given by Delicado (1998); his approach generalizes more easilyto higher dimensions.


159The image segmentation examples in this dissertation all involve greyscale images.However, the methods presented here can be extended to color (3-band)images, or more generally to multiband images. This should yield improved sensitivityand specificity with regard to image features. Perhaps more importantly,it would allow the computer to “view” all bands <strong>of</strong> an image simultaneously; withmore than 3 bands, a human is forced to resort to flipping back and forth betweenmany images.This extension would require some modifications to the pseudolikelihood calculation,which I present here. Recall equation 5.16, the pseudolikelihood-basedBIC, and equation 5.15, the pseudolikelihood <strong>of</strong> the image. These are rewrittenhere as equation 6.1 and 6.2.BIC P L (K) = 2 log(L ˆX(Y |K)) − D K log(N) (6.1)L ˆX(Y |K) = ∏ iK∑f(Y i |X i = j)p(X i = j|N( ˆX i ), ˆφ) (6.2)j=1In order to compute BIC P L (K) with multiband images, the computation<strong>of</strong> L ˆX(Y |K) must reflect the multivariate nature <strong>of</strong> the image Y . Since thehidden states X are still univariate, the Markov random field term p(X i =j|N( ˆX i ), ˆφ) is unchanged. The Gaussian likelihood term f(Y i |X i = j) nowbecomes a multivariate Gaussian likelihood. On a practical level, this meansthat the EM algorithm used in the initial marginal segmentation would need touse multivariate Gaussian densities. Similarly, computation <strong>of</strong> the pseudolikelihoodin the ICM algorithm would require evaluation <strong>of</strong> a multivariate Gaussiandensity for each pixel. For an Splus implementation, it may be possibleto use the latest version <strong>of</strong> the “mclust” function (available on the web atwww.stat.washington.edu/fraley/mclust/home.html) to do EM, though some sort


160<strong>of</strong> subsampling would have to be used for all but the smallest images.Once the automatic unsupervised segmentation is accomplished using themethods presented here, other methods which make use <strong>of</strong> prior knowledge couldbe applied. My method would then play the role <strong>of</strong> reducing noise, which wouldassist shape-based feature finding methods, such as deformable templates. Forexample, in the dog lung example (section 5.3.4), a shape-based method couldstart from an initial estimate <strong>of</strong> the lung boundary given by the union <strong>of</strong> the twobrightest segments. An alternate approach would be to find a correlation peak betweenthe image and a template, which can be done very quickly with the Fouriertransform; a discussion <strong>of</strong> the Fourier transform can be found in most elementaryengineering texts, such as Oppenheim et. al. (1983).Shape-based methods would likely require extraction <strong>of</strong> the spatially connectedcomponents in the final segmentation; each connected component could then becompared to predefined templates (perhaps with parameters for rotation and scaling)to determine the presence or absence <strong>of</strong> features <strong>of</strong> interest in the image.If training data are available, the results <strong>of</strong> this segmentation method could beused to automatically search an image database for features <strong>of</strong> interest. The segmentationwould be performed on the training data first, perhaps with interactivechoice <strong>of</strong> the number <strong>of</strong> segments. Characteristics <strong>of</strong> the feature <strong>of</strong> interest wouldbe noted; these could include model parameters, such as mean and variance <strong>of</strong> aparticular component in the Gaussian mixture, or nonmodel observations, such asthe size and shape <strong>of</strong> one or more connected components representing the feature<strong>of</strong> interest. Then, new data would be segmented in an automatic mode. The modelparameters <strong>of</strong> the resulting segmentation could be compared to the training dataresults; connected components could also be extracted for analysis and comparisonto training data.With sufficient training data, one could fit a CART or logistic regression model


161to provide predictive inference for the presence <strong>of</strong> features <strong>of</strong> interest. For example,if there are many bright connected components in the training data, then a modelcan be formed to predict the probability that a single connected component is afeature <strong>of</strong> interest based on size, shape, intensity, and so on. This model can thenbe used with new data to assign a probability value to the presence <strong>of</strong> a feature <strong>of</strong>interest or to determine confidence limits on such predictions.If a feature <strong>of</strong> interest can be isolated into one segment by the segmentationmethod, then the rest <strong>of</strong> the image can be ignored in further analysis. This initself can be a purpose for segmentation, or it can be part <strong>of</strong> the analyses describedabove. For example, an analysis <strong>of</strong> the connected components present in asegmentation might be restricted to the connected components <strong>of</strong> only the brightestsegment. This can result in a considerable increase in speed when connectedcomponents are being analyzed.The morphological smoothing step can be tailored to specific applications. Asshown in the example <strong>of</strong> the <strong>Washington</strong> coast image, in which the morphologystep obscures the tideline, sometimes it is not appropriate to do any morphologicalsmoothing since it can smooth out features <strong>of</strong> interest. The size and shape <strong>of</strong>the structuring element can be adjusted to emphasize different types <strong>of</strong> features;for example, thin vertical or horizontal structuring elements can emphasize thinvertical or horizontal features. In the examples <strong>of</strong> section 5.3, I used a sequence<strong>of</strong> an opening followed by a closing, but other steps could be used, such as insertinga median smooth between the erosions and dilations. Customization <strong>of</strong> themorphological smooth is something <strong>of</strong> an art, and might be done interactively. Alternatively,Forbes and Raftery (1999) present a morphological smoothing methodcast in the context <strong>of</strong> ICM.


REFERENCESAllard, D., and Fraley, C. (1997) “Nonparametric Maximum Likelihood Estimation<strong>of</strong> Features in Spatial Point Processes Using Voronoi Tesselation,”Journal <strong>of</strong> the American Statistical Association, vol. 92, pp. 1485-1493.Ambroise, C., Dang, M., and Govaert, G. (1996), ”Clustering <strong>of</strong> Spatial Data bythe EM Algorithm,” unpublished manuscript.Ambroise, C., and Govaert, G. (1996) “Constrained Clustering and KohonenSelf-Organizing Maps” Journal <strong>of</strong> Classification, vol. 13, pp. 299-313.Banfield, J.D., and Raftery, A.E. (1992), “Ice Floe Identification in SatelliteImages Using Mathematical Morphology and Clustering about PrincipalCurves,” Journal <strong>of</strong> the American Statistical Association, 87, 7-16.Banfield, J.D., and Raftery, A.E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering,” Biometrics, 49, 803–821.Besag, J. (1974), ”Spatial Interaction and the Statistical Analysis <strong>of</strong> Lattice Systems,”Journal <strong>of</strong> the Royal Statistical Society, Series B, 6, 192-236.Besag, J. (1986), ”Statistical Analysis <strong>of</strong> Dirty Pictures,” Journal <strong>of</strong> the RoyalStatistical Society, Series B, 48, 259-302.Bovik, A.C., and Munson, D.C. (1986), “Edge Detection Using Median Comparisons,”Computer Vision, Graphics, and Image Processing, 33, 377-389.Burdick, H. (1997) Digital Imaging. McGraw-Hill: New York.Byers, S.D., and Raftery, A.E. (1998), “Nearest Neighbor Clutter Removal forEstimating Features in Spatial Point Processes,” Journal <strong>of</strong> the AmericanStatistical Association, 93, 577-584.


163Campbell, N.W., Mackeown, W.P.J., Thomas, B.T., Troscianko, T. (1997), “InterpretingImage Databases by Region Classification,” Pattern Recognition,30, 555-563.Canny, J. (1986) “A Computational Approach to Edge Detection,” IEEE Transactionson Pattern Analysis and Machine Intelligence, 8, 679-698.Carstensen, J. (1992) Description and Simulation <strong>of</strong> Visual Texture, PhD Thesis,Imsor: Denmark.Celeux, G., and Govaert, G. (1992), “A Classification EM Algorithm and TwoStochastic Versions,” Computational <strong>Statistics</strong> and Data Analysis, 14, 315-332.Chickering, D. M., and Heckerman, D. (1996), ”Efficient Approximations for theMarginal Likelihood <strong>of</strong> Bayesian Networks with Hidden Variables,” Proceedings<strong>of</strong> the 12th Conference on Uncertainty in Artificial Intelligence, 158-168.Cunningham, S., and MacKinnon, S. (1998) “Statistical Methods for Visual DefectMetrology” IEEE Transactions on Semiconductor Manufacturing, vol.11, pp. 48-53.Dasgupta, A., and Raftery, A.E. (1998), “Detecting Features in Spatial Point Processeswith Clutter via Model-Based Clustering,” Journal <strong>of</strong> the AmericanStatistical Association, 93, 294-302.Delicado, P. (1998) “Another Look at Principal Curves and Surfaces” WorkingPaper 309, Department d’Economia i Empresa, Universitat Pompeu Fabra.Dempster, A., Laird, N., and Rubin, D. (1977), “Maximum Likelihood fromIncomplete Data via the EM Algorithm (with Discussion),” Journal <strong>of</strong> theRoyal Statistical Society, Series B, 39, 1-38.Forbes, F., and Raftery, A. E., (1999) “Bayesian Morphology: Fast UnsupervisedBayesian Image Analysis, ” Journal <strong>of</strong> the American Statistical Association,


16494, 555-568.Fraley, C., and Raftery, A.E. (1998) “How many clusters? Which clusteringmethod? - Answers via Model-Based Cluster Analysis,” Computer Journal,41, 578-588.Fraley, C., and Raftery, A.E. (1999) “MCLUST: S<strong>of</strong>tware for Model-Based Clusteringand Discriminant Analysis,” Journal <strong>of</strong> Classification, to appear.Geman, S., and Graffigne, C. (1986), “Markov Random Field Image Models andTheir Applications to Computer Vision,” Proceedings <strong>of</strong> the InternationalCongress <strong>of</strong> Mathematicians, 1496-1517.Gray, R. M. (1988), Probability, Random Processes, and Ergodic Properties,Springer-Verlag: New York.Green, P. (1990), “On the Use <strong>of</strong> the EM Algorithm for Penalized LikelihoodEstimation,” Journal <strong>of</strong> the Royal Statistical Society, Series B, 52, 443-452.Green, P. (1995), “Reversible jump Markov chain Monte Carlo computation andBayesian model determination,” Biometrika, 82, 711-732.Guyon, X. (1995), Random Fields on a Network: Modeling, <strong>Statistics</strong>, and Applications,Springer-Verlag: New York.Hamilton, J. D. (1994), Time Series Analysis, Princeton <strong>University</strong> Press: Princeton.Hansen, F.R., and Elliott, H. (1982), “Image Segmentation Using Simple MarkovField Models,” Computer Graphics and Image Processing, 20, 101-132.Hansen, K.V., and T<strong>of</strong>t, P.A. (1996), “Fast Curve Estimation Using PreconditionedGeneralized Radon Transform,” IEEE Transactions on Image Processing,vol. 5, pp. 1651-1661.Haralick, R.M., and Shapiro, L. G. (1985), ”Survey: Image Segmentation Techniques,”Computer Vision, Graphics, and Image Processing, 29, 100-132.


165Hartigan, J. A. (1975). Clustering Algorithms. Wiley: New York.Hastie, T., and Stuetzle, W. (1989), “Principal Curves,” Journal <strong>of</strong> the AmericanStatistical Association, 84, 502-516.Hastie, T., and Tibshirani, R. (1990), Generalized Additive Models, Chapman &Hall: New York.Hathaway, R. (1986), ”Another interpretation <strong>of</strong> the EM algorithm for mixturedistributions,” Journal <strong>of</strong> <strong>Statistics</strong> and Probability Letters, 4, 53-56.Heijmans, H.J.A.M. (1994), “Mathematical Morphology: A Modern Approach inImage Processing Based on Algebra and Geometry,” SIAM Review, 37, 1-36.Hough, P.V.C., (1962), “A Method and Means for Recognizing Complex Patterns,”U.S. Patent 3,069,654.Hsiao, K. (1997) “Approximate Bayes Factors When a Mode Occurs on theBoundary,” Journal <strong>of</strong> the American Statistical Association, 92, 656-663.Illingworth, J., and Kittler, J. (1988) “A Survey <strong>of</strong> the Hough Transform,”CVGIP, vol. 44, pp. 87-116.Ji, C., and Seymour, L. (1996), ”A Consistent Model Selection Procedure forMarkov Random Fields Based on Penalized Pseudolikelihood,” Annals <strong>of</strong>Applied Probability, 6(2), 423-443.Johnson, V. (1994), ”A Model for Segmentation and Analysis <strong>of</strong> Noisy Images,”Journal <strong>of</strong> the American Statistical Association, 89, 230-241.Kass, R.E. and Raftery, A.E. (1995), “Bayes Factors,” Journal <strong>of</strong> the AmericanStatistical Association, 90, 773–795.Kass, R.E. and Wasserman, L. (1995), “A Reference Bayesian Test for NestedHypotheses and its Relationship to the Schwarz Criterion,” Journal <strong>of</strong> theAmerican Statistical Association, 90, 928–934.Kohonen, T. (1982) “Self-organized Formation <strong>of</strong> Topologically Correct Feature


166Maps,” Biological Cybernetics, vol. 43, pp. 59-69.Kundu, A. (1990), “Robust Edge Detection,” Pattern Recognition, 23, 423-440.Latham, G., and Anderssen, R. (1994), “Assessing Quantification for the EMalgorithm,” Linear Algebra and its Applications, 210, Oct, 89-122.Latham, G. (1995), “Existence <strong>of</strong> EMS Solutions and A-Priori Estimates,” SIAMJournal on Matrix Analysis and Applications, 16, 3, 943-953.LeBlanc, M., and Tibshirani, R. (1994), “Adaptive Principal Surfaces,” Journal<strong>of</strong> the American Statistical Association, 89, 425, 53-64.Letts, P.A. (1978), ”Unsupervised Classification in the Aries Image Analysis System,”Proceedings <strong>of</strong> the 5th Canadian Symposium on Remote Sensing, 61-71.Lu, W. (1995), “The Expectation-Smoothing Approach for Indirect Curve Estimation,”Unpublished manuscript.Masson, P., and Pieczinsky, W. (1993), ”SEM Algorithm and Unsupervised StatisticalSegmentation <strong>of</strong> Satellite Images,” IEEE Transactions on Geoscienceand Remote Sensing, 31(3), 618-633.Murtagh, F. (1995) “Interpreting the Kohonen Self-Organization Feature MapUsing Contiguity Constrained Clustering” Pattern Recognition Letters, vol.16, pp. 399-408.Nychka, D. (1990), “Some Properties <strong>of</strong> Adding a Smoothing Step to the EMAlgorithm,” <strong>Statistics</strong> and Probability Letters, 9, 187-193.Oppenheim, A. V., Willsky, A. S., and Young, I. T. (1983), Signals and Systems,Prentice Hall: Englewood Cliffs, New Jersey.Pal, N., and Pal, S. (1993) “A Review on Image Segmentation Techniques,”Pattern Recognition, 26, 1277-1294.Priestley, M.B. (1981) Spectral Analysis and Time Series, Academic Press: NewYork.


167Prim, R. (1957), “Shortest Connection Networks and some Generalizations,” BellSystem Technical Journal, 1389-1401.Raftery, A.E. (1995), “Bayesian Model Selection in Social Research,” in SociologicalMethodology 1995, ed. P.V. Marsden, Blackwells: Cambridge, 111-163.Richards, J.A. (1986), Remote Sensing Digital Image Analysis, Springer-Verlag:New York.Roeder, K. and Wasserman, L. (1997), “Practical Bayesian Density EstimationUsing Mixtures <strong>of</strong> Normals,” Journal <strong>of</strong> the American Statistical Association,92, 894-902.Schwarz, G. (1978), “Estimating the Dimension <strong>of</strong> a Model,” The Annals <strong>of</strong><strong>Statistics</strong>, 6, 461-464.Serra, J. (1982) Image Analysis and Mathematical Morphology, Academic Press:New York.Silverman, B., Jones, M., Wilson, J., and Nychka, D. (1990), “A Smoothed EMApproach to Indirect Estimation Problems, with Particular Reference toStereology and Emission Tomography (with Discussion),” Journal <strong>of</strong> theRoyal Statistical Society, Series B, 52, 271-324.Steger, C. (1998) “An Unbiased Detector <strong>of</strong> Curvilinear Structures,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 20, pp. 113-125.Tibshirani, R. (1992), “Principal Curves Revisited,” <strong>Statistics</strong> and Computing,2, 183-190.Tibshirani, R. and Hastie, T. (1987), “Local Likelihood Estimation,” Journal <strong>of</strong>the American Statistical Association, 82, 559-568.Ward, J. (1963), “Hierarchical groupings to optimize an objective function,” Journal<strong>of</strong> the American Statistical Association, 58, 234–244.Wold, S. (1974), “Spline Functions in Data Analysis,” Technometrics, 16, 1-11.


168Zahn, C. (1971), “Graph-Theoretical Methods for Detecting and DescribingGestalt Structures,” IEEE Transactions on Computers, C-20,1, 68-86.


Appendix ASOFTWARE DISCUSSIONA.1 XVThe main image segmentation methods discussed in this dissertation have beenadded to a modified version <strong>of</strong> XV, a popular and widely used UNIX image viewingprogram. XV provides an ideal platform for implementing image processingalgorithms, because code for new algorithms can be added in a relatively modularway. XV has an easy to undertand graphical user interface, with controls for mostaspects <strong>of</strong> the image display (size, colormap, cropping, and so on). It reads mostcommon image formats. The internal representation <strong>of</strong> images in XV is the sameregardless <strong>of</strong> the image file type, which alleviates the need to convert between formats.A few image processing algorithms are built in to XV; these can be accessedusing the “algorithms” button. I have added several new algorithms under thisbutton: threshold, erode, dilate, segment, and autosegment.The threshold algorithm is a simple thresholding <strong>of</strong> the image based on a userspecifiedthreshold value.Erode and dilate are algorithms which carry out the erode and dilate operations<strong>of</strong> mathematical morphology using a user-specified structuring element. Thestructuring element is specified by supplying a file which defines the structuringelement. The most common structuring element is probably the simple 3x3 square;a sample file specifying this structuring element is included in the same directoryas the executable, and the source code includes instructions on how to create a


170structuring element file.Segment and autosegment are implementations <strong>of</strong> automatic unsupervised imagesegmentation. They use a method based on histogram equalization to generatean initial segmentation. Then, EM is used to fit a Gaussian mixture with K components.For the Segment algorithm, K is specified by the user; for Autosegment,the user specifies a maximum value <strong>of</strong> K and the program chooses the best Kbased on a modified version <strong>of</strong> BIC (note that this does not use pseudolikelihood).All <strong>of</strong> these methods are designed for greyscale images; if the image is color,then the red color band is used as the image. These methods are quite fast;autosegment can examine values <strong>of</strong> K from 2 to 12 for a typical 256x256 imagein under a minute. There are some restrictions on image size due to memoryconstraints. These limits are hard-coded into the modified XV version, but theycan be altered by editing the source code and recompiling.A.2 C codegreyhist.c - This program finds the marginal histogram <strong>of</strong> an image. Input files:greyhistaux.txt (width, height), greyhistin.asc (ascii greyscale integers only). Outputfile: greyhistout.txt.covascii.c - This program computes the covariance matrix <strong>of</strong> the image and 4lags (the 4 adjacent neighbors preceding each pixel in raster scan order). Inputfiles: covinput.txt (width, height), covimagein.txt (ascii greyscale integers only).Output file: covoutput.txt.emheqpoisson.c - This program uses EM to fit a Poisson mixture model for segmentation,using histogram equalization as the initial segmentation. Final classificationis done without mixture probabilities. Input files: eminput.txt (width,height, # <strong>of</strong> segments) emimagein.asc (ascii greyscale integers only). Output files:


171emoutput.txt, emimageout.pgm.emnoinitpoisson.c - This program uses EM to fit a Poisson mixture model forsegmentation; a user-supplied initial segmentation is required. Final classificationis done without mixture probabilities. Input files: eminput.txt (width, height, #<strong>of</strong> segments) emimagein.asc (ascii greyscale integers only) eminitin.asc (integersindexed from zero). Output files: emoutput.txt, emimageout.pgm.emnoinit.c - This program uses EM to fit a Gaussian mixture model for segmentation;a user-supplied initial segmentation is required. Final classification is donewithout mixture probabilities. Minimum SD constraint is set to 0.25; emnoinit2.chas the constraint set to 0.5. Input files: eminput.txt (width, height, # <strong>of</strong> segments)emimagein.asc (ascii greyscale integers only) eminitin.asc (integers indexedfrom zero). Output files: emoutput.txt, emimageout.pgm.emseg.c - This program uses EM to fit a Gaussian mixture model for segmentation,using histogram equalization as the initial segmentation. Final classificationis done with mixture probabilities. Input files: eminput.txt (width, height, #<strong>of</strong> segments) emimagein.asc (ascii greyscale integers only). Output files: emoutput.txt,emimageout.pgm.emsegloop.c - This program uses EM to fit several Gaussian mixture models forsegmentation; iteration is done over values <strong>of</strong> K (the number <strong>of</strong> segments) from 2to 15. Histogram equalization is used to find the initial segmentation. Input files:eminput.txt (width, height, # <strong>of</strong> segments) emimagein.asc (ascii greyscale integersonly). Output file: emoutput.txt.mcmcpseudologlik.c - This program performs MCMC to compute pseudolikelihood.Input file: mcmcinput.txt Output file: mcmcoutput.txt


172A.3 Splus codehpcc - Hierarchical clustering on open principal curves is performed on point processdata; some plotting capability is also provided.clust.var.spline - This function is called by hpcc and is not intended for direct use.plot.pclust - This function is called by hpcc and is not intended for direct use.penalty - This function is called by hpcc and is not intended for direct use.vdist - This function is called by hpcc and is not intended for direct use.cempcc - This function refines a clustering on open principal curves by using theCEM algorithm; some plotting capability is also provided.calc.likelihood - This function is called by cempcc and is not intended for directuse.autoseg8 - This function fits greyscale image segmentation models for 1-8 segments.It calls segment.marginal, estimate.icm.phi0, and pseudol.unorder. It returns theinitial segmentation (by Ward’s method), the EM segmentation, and the ICMsegmentation, as well as all parameter estimates.segment.marginal - This function finds an initial segmentation by using Ward’smethod (calling ward.initclass) and then refines this marginal segmentation withthe EM algorithm (calling emfast.alg).ward.initclass - This function begins takes a matrix and a specified number <strong>of</strong> segmentsas input and returns a classification with the requisite number <strong>of</strong> segmentsbased on Ward’s method.emfast.alg - This function uses the EM algorithm to fit a Gaussian mixture. Aninitial classification <strong>of</strong> the data is required. The run time is linear in the number <strong>of</strong>


173unique data values, so it can handle large images in reasonable time as long as thenumber <strong>of</strong> unique data values is not large (as is the case with 256-level greyscaleimages).estimate.icm.phi0 - This function refines a segmentation by using ICM with parameterestimation. The phi parameter is constrained to be non-negative. Thisfunction calls the icm function.estimate.icm - The same as estimate.icm.phi0 but without the constraint on phi.icm - This function performs ICM without parameter estimation.pseudol.unorder - This function computes a pseudolikelihood for the unorderedcolors model conditional on parameter estimates.linehist - This function produces a histogram using vertical lines, which is usefulfor image data.mixplot - This function plots a Gaussian mixture density.flip - This function flips a matrix (not transpose) for plotting.morphit - This function performs a greyscale opening and closing on an integermatrix using a hard-coded 3x3 structuring element.


VITA1993 B.S. Mathematics, Harvey Mudd College1993 M.S. Mathematics, Harvey Mudd College and Claremont Graduate <strong>University</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!