11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

L. Billard 325bounce around, so that an apparent rate of 64 (say) may really be 64 ± 2, i.e.,the interval [62, 66]. There are numerous examples.Detailed descriptions and examples can be found in Bock and Diday (2000)and Billard and Diday (2006). A recent review of current methodologies isavailable in Noirhomme-Fraiture and Brito (2011), with a non-technical introductionin Billard (2011). The original concept of symbolic data was introducedin Diday (1987). Note that symbolic data are not the same as fuzzy data;however, while they are generally different from the coarsening and groupingconcepts of, e.g., Heitjan and Rubin (1991), there are some similarities.The major issue then is how do we analyse these intervals (or, histograms,...)? Taking classical surrogates, such as the sample mean of aggregatedvalues for each category and variable, results in a loss of information.For example, the intervals [10, 20] and [14, 16] both have the same midpoint;yet they are clearly differently valued observations. Therefore, it is importantthat analytical methods be developed to analyse symbolic data directly so asto capture these differences. There are other underlying issues that pertainsuch as the need to develop associated logical dependency rules to maintainthe integrity of the overall data structure; we will not consider this aspectherein however.29.3 IllustrationsExample 29.1 Table 29.1 displays (in two parts) a data set of histogramvalued observations, extracted from Falduti and Taibaly (2004), obtained byaggregating by airline approximately 50,000 classical observations for flightsarriving at and departing from a major airport. For illustrative simplicity,we take three random variables Y 1 = flight time, Y 2 = arrival-delay-time,and Y 3 = departure-delay-time for 10 airlines only. Thus, for airline u =1,...,10 and variable j =1, 2, 3, we denote the histogram valued observationby Y uj = {[a ujk ,b ujk ),p ujk : k =1,...,s uj } where the histogram sub-interval[a ujk ,b ujk )hasrelativefrequencyp ujk with ∑ k p ujk = 1. The number ofsubintervals s uj can vary across observations (u) and across variables (j).Figure 29.1 shows the tree that results when clustering the data by aWard’s method agglomerative hierarchy algorithm applied to these histogramdata when the Euclidean extended Ichino–Yaguchi distance measure is used;see Ichino and Yaguchi (1994) and Kim and Billard (2011, 2013), for details.Since there are too many classical observations to be able to build an equivalenttree on the original observations themselves, we resort to using classicalsurrogates. In particular, we calculate the sample means for each variableand airline. The resulting Ward’s method agglomerative tree using Euclideandistances between the means is shown in Figure 29.2.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!