12.07.2015 Views

Visualization of Diversity in Large Multivariate Data Sets

Visualization of Diversity in Large Multivariate Data Sets

Visualization of Diversity in Large Multivariate Data Sets

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1056 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2010A(0)B(0)C(1)D(0)E(0)A(0)B(0)C(1)D(0)E(1)A(0)B(0)C(5)D(0)E(1)A(0)B(2)C(5)D(0)E(1)A(5)B(2)C(16)D(25)E(7)Start: 1 “C”Add 1 “E” Add 4 “C” Add 2 “B”Add 5 “A”, 11 “C”, 25 “D”, 6 “E”Fig. 2. The process <strong>of</strong> visualiz<strong>in</strong>g a s<strong>in</strong>gle attribute us<strong>in</strong>g the <strong>Diversity</strong>Map. The depicted attribute has five possible values (A, B, C, D, and E).The visualization beg<strong>in</strong>s with a s<strong>in</strong>gle object with attribute value “C,” andobjects with other attribute values are added <strong>in</strong> subsequent steps. Ateach step, the number <strong>of</strong> objects <strong>in</strong> each bucket is shown <strong>in</strong> parenthesesnext to the bucket’s label and the opacity (α-value) <strong>of</strong> each bucket iscalculated as described <strong>in</strong> the text. Note that, while it is <strong>in</strong>structive toillustrate the process step-wise, as above, our implementation simplyaggregates object counts and computes opacity values <strong>in</strong> a s<strong>in</strong>gle step.Also, note that for a multivariate data set, every object would contributeto each <strong>of</strong> the parallel attribute axes <strong>in</strong> the same way as depicted above,result<strong>in</strong>g <strong>in</strong> a visualization as depicted <strong>in</strong> Fig. 1(b).4.1 Design ConsiderationsAs <strong>in</strong>dicated earlier, our primary goal <strong>in</strong> design<strong>in</strong>g the <strong>Diversity</strong> Mapwas to make easily apparent the richness <strong>of</strong> variety and the evenness<strong>of</strong> abundance <strong>of</strong> the attribute values exhibited <strong>in</strong> the data set be<strong>in</strong>gvisualized. While we do not explicitly calculate or assign values forrichness and evenness, we consider them to be quantitative features <strong>of</strong>the data, <strong>in</strong> that we can th<strong>in</strong>k <strong>of</strong> one data set as be<strong>in</strong>g more or less richor even than another. For this reason, we have chosen visual encod<strong>in</strong>gsthat are known to be effective for convey<strong>in</strong>g quantitative <strong>in</strong>formation.Specifically, we encode variety us<strong>in</strong>g spatial position by assign<strong>in</strong>g adist<strong>in</strong>ct 2D location, or bucket, to each <strong>of</strong> the possible discrete attributevalues that can be taken by objects <strong>in</strong> the visualized data set, and weencode abundance us<strong>in</strong>g opacity, with each semi-transparent rectanglerepresentation <strong>of</strong> an object’s attribute value contribut<strong>in</strong>g a constant,fractional amount <strong>of</strong> opacity to the bucket <strong>in</strong> which it is placed. Underthis encod<strong>in</strong>g, more abundant regions <strong>of</strong> attribute space are <strong>in</strong>dicatedby visual regions <strong>of</strong> higher opacity.Because spatial position ranks <strong>in</strong> the literature as the most effectiveencod<strong>in</strong>g for quantitative <strong>in</strong>formation [22, 5], it is easy to justify itsuse <strong>in</strong> our design. However, several other quantitative encod<strong>in</strong>gs, suchas length, angle, slope, and area, rank higher than opacity <strong>in</strong> [22] and[5]. Unfortunately, these encod<strong>in</strong>gs appear to conflict with our chosenspatial encod<strong>in</strong>g. In contrast, we found that opacity serves as anatural complement to the spatial encod<strong>in</strong>g and allows us to elegantlyconvey both the richness <strong>of</strong> variety and the evenness <strong>of</strong> abundance <strong>of</strong>the visualized data. In particular, under this comb<strong>in</strong>ation <strong>of</strong> encod<strong>in</strong>gs,“occlusions” <strong>in</strong> the 2D visual plane that result from one or moreobjects shar<strong>in</strong>g a certa<strong>in</strong> attribute value serve simply to <strong>in</strong>crease theopacity <strong>of</strong> that visual region, thereby <strong>in</strong>dicat<strong>in</strong>g <strong>in</strong>creased abundance.In the <strong>Diversity</strong> Map representation, richness <strong>of</strong> variety is expressedby the number <strong>of</strong> buckets with non-zero opacity, and evenness <strong>of</strong> abundanceis expressed by the uniformity <strong>of</strong> the color distribution acrossthe buckets <strong>of</strong> a s<strong>in</strong>gle attribute, as well as over the entire visualization.In other words, the more rich is the variety <strong>of</strong> a given data set,the more non-transparent buckets it will yield, and the more even isthe abundance across the data set, the more uniform will be the colors<strong>of</strong> the buckets.The overall diversity <strong>of</strong> a given data set—that is, the comb<strong>in</strong>ed diversity<strong>of</strong> all its attributes—is communicated by the <strong>Diversity</strong> Map asthe overall color density <strong>of</strong> the entire visual region: as the visualizeddata set becomes more richly various and more evenly abundant, morebuckets will exhibit a similar non-transparent color. In the limit <strong>of</strong>“perfect” diversity, where all possible values <strong>of</strong> each attribute are representedequally, the entire visual region will be a solid, completelyopaque color. Conversely, a set with little diversity will produce a visualizationwith regions <strong>of</strong> very high contrast. As examples <strong>of</strong> thesephenomena, consider the synthetic data sets with zero and near-perfectdiversity visualized us<strong>in</strong>g a <strong>Diversity</strong> Map <strong>in</strong> Figs. 1 (a) and (c).F<strong>in</strong>ally, we note that the <strong>Diversity</strong> Map is specifically designed toprovide a holistic overview <strong>of</strong> the population sample <strong>of</strong> <strong>in</strong>terest. AsShneiderman notes, [32], provid<strong>in</strong>g an overview <strong>of</strong> the data is an importantpart <strong>of</strong> a visualization system, as overviews help the user builda mental model <strong>of</strong> how the data covers the attribute space. This model<strong>in</strong> turn helps the user formulate actions such as queries [28]. Indeed,a good overview representation should serve as a gateway by allow<strong>in</strong>gthe user to <strong>in</strong>teract with the visualization <strong>in</strong> order to <strong>in</strong>vestigatethe data based on the mental model he or she has formed. While wereserve deeper <strong>in</strong>vestigation <strong>of</strong> this matter for future work, we simplynote that the <strong>Diversity</strong> Map is designed to serve as just such a gateway.5 USER STUDY DESIGNIn this section, we describe a formal user study designed to measure agiven visualization’s ability to communicate diversity <strong>in</strong>formation. Inparticular, the study is a controlled user study <strong>in</strong>tended to be conducted<strong>in</strong> a laboratory sett<strong>in</strong>g, and it is designed to compare the visualization<strong>of</strong> <strong>in</strong>terest aga<strong>in</strong>st a given basel<strong>in</strong>e visualization. There are two importantcomponents to this design: 1) a method for generat<strong>in</strong>g syntheticdata sets with controllable, vary<strong>in</strong>g levels <strong>of</strong> diversity and 2) a set <strong>of</strong>questions, each <strong>of</strong> which is meant to assess a study participant’s abilityto comprehend a particular aspect <strong>of</strong> diversity us<strong>in</strong>g each <strong>of</strong> thevisualizations under comparison. We describe these components next.5.1 Synthetic <strong>Data</strong> GenerationWhile, ideally, we would use a real data set to evaluate a visualization,we require data with specific distributions <strong>of</strong> values over attributes.S<strong>in</strong>ce it is difficult to f<strong>in</strong>d data sets that can accommodate this requirement,we developed a technique for creat<strong>in</strong>g synthetic data sets<strong>of</strong> controllable, vary<strong>in</strong>g diversity over a set <strong>of</strong> <strong>in</strong>dependent attributes.In particular, our procedure generates synthetic sets <strong>of</strong> objects overa manually def<strong>in</strong>ed set <strong>of</strong> attributes and attribute values, where therichness <strong>of</strong> variety and evenness <strong>of</strong> abundance over each attribute iscontrolled and measured.Our data generation procedure is based on the Shannon <strong>in</strong>dex, orShannon entropy, a measure <strong>of</strong> diversity that is widely used <strong>in</strong> ecology[30, 37, 18]. Shannon entropy is also used <strong>in</strong> other fields, such as<strong>in</strong>formation theory, where it is used to measure the amount <strong>of</strong> <strong>in</strong>formationconta<strong>in</strong>ed <strong>in</strong> a coded message. In its general form, the entropy <strong>of</strong>a s<strong>in</strong>gle random variable, X (<strong>in</strong> biodiversity, X corresponds to species;<strong>in</strong> the more general case, it could correspond to any s<strong>in</strong>gle attribute) isSH(X) =− ∑ p(x i )log p(x i ), (1)i=1where {x 1 ,...,x S } is the set <strong>of</strong> possible values <strong>of</strong> X and p(x i ) isthe probability that X takes value x i . In biodiversity, for example,x 1 ,...,x S represent the possible species and p(x i ) represents the probability<strong>of</strong> observ<strong>in</strong>g one particular species x i . In practice, we computep(x i ) as the ratio <strong>of</strong> the number n i <strong>of</strong> <strong>in</strong>stances <strong>of</strong> value x i to the totalnumber N <strong>of</strong> <strong>in</strong>dividuals <strong>in</strong> the set, i.e. p(x i )= n iN . In other words,p(x i ) represents the relative abundance <strong>of</strong> value x i <strong>in</strong> the total set.H(X) is directly proportional to the level <strong>of</strong> diversity with<strong>in</strong> a s<strong>in</strong>gleattribute, <strong>in</strong> that higher values <strong>of</strong> H(X) correspond to richer variety andmore even abundances. Unfortunately, it is difficult to compare values<strong>of</strong> H(X) across attributes, s<strong>in</strong>ce it is scaled to the number <strong>of</strong> possiblevalues <strong>of</strong> the attribute be<strong>in</strong>g measured. This implies that an attributewith many possible attribute values (e.g. the home state <strong>of</strong> a student)may be considered more diverse under entropy than an attribute withfew possible values (e.g. the gender <strong>of</strong> a student), even if it is not.In order to meet our requirement from Section 2 that all attributesare considered as equal, we have adapted a variant <strong>of</strong> the Shannon<strong>in</strong>dex known as the evenness measure [27], which normalizes the value<strong>of</strong> H(X) by its maximum possible value:S 1H max (X) =− ∑i=1 S log 1 = logS. (2)S

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!