12.07.2015 Views

Visualization of Diversity in Large Multivariate Data Sets

Visualization of Diversity in Large Multivariate Data Sets

Visualization of Diversity in Large Multivariate Data Sets

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1054 IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 16, NO. 6, NOVEMBER/DECEMBER 2010design for a formal user study <strong>in</strong>tended to understand the effectiveness<strong>of</strong> a visual representation <strong>in</strong> communicat<strong>in</strong>g diversity <strong>in</strong>formation. Weevaluate the <strong>Diversity</strong> Map by us<strong>in</strong>g this study design to compare itto Pearlman et al.’s visualization [25], the only other representationspecifically designed to visualize diversity. In compar<strong>in</strong>g user performancebetween Pearlman et al.’s representation and the <strong>Diversity</strong>Map, we show that users can consistently and as or more accuratelyjudge elements <strong>of</strong> diversity us<strong>in</strong>g the <strong>Diversity</strong> Map.The rest <strong>of</strong> this paper is organized as follows. We beg<strong>in</strong> <strong>in</strong> Section2 with a precise def<strong>in</strong>ition <strong>of</strong> diversity derived from the def<strong>in</strong>ition<strong>of</strong> species diversity used by ecologists, and we lay out a set <strong>of</strong> requirementsfor diversity visualizations based on this def<strong>in</strong>ition. Wethen discuss related work <strong>in</strong> Section 3 <strong>in</strong> the context <strong>of</strong> the presenteddef<strong>in</strong>ition and requirements. In Section 4, we present and discuss the<strong>Diversity</strong> Map representation <strong>in</strong> the context <strong>of</strong> what is known about humanperception. In Section 5, we describe a formal study designed tounderstand the effectiveness <strong>of</strong> a visual representation <strong>in</strong> communicat<strong>in</strong>gdiversity <strong>in</strong>formation, and <strong>in</strong> Section 6, we evaluate the <strong>Diversity</strong>Map representation us<strong>in</strong>g this study design. F<strong>in</strong>ally, we discuss themerits and shortcom<strong>in</strong>gs <strong>of</strong> the <strong>Diversity</strong> Map representation, suggestdirections for future work, and draw our conclusions.2 DEFINING DIVERSITYBefore discuss<strong>in</strong>g its visualization further, we must first establish amore thorough def<strong>in</strong>ition <strong>of</strong> diversity. With this <strong>in</strong> place, the requirementsfor a successful diversity visualization will become more clear.The data sets <strong>in</strong> which we are <strong>in</strong>terested represent samples <strong>of</strong> populations<strong>of</strong> objects (e.g. students, moths, stocks, etc.) that are describedby multiple variables, or attributes (e.g. GPA, ethnicity, gender, etc.).To def<strong>in</strong>e the diversity <strong>of</strong> such a set, we borrow from the establishedfield <strong>of</strong> Ecology, where biological diversity is def<strong>in</strong>ed as “the varietyand abundance <strong>of</strong> species <strong>in</strong> a def<strong>in</strong>ed unit <strong>of</strong> study” [23].Two measures <strong>of</strong> diversity are used <strong>in</strong> Ecology: richness, which issimply the number <strong>of</strong> species <strong>in</strong> the unit <strong>of</strong> study represented out <strong>of</strong>all possible species; and evenness, which describes the variability <strong>in</strong>species abundances [23]. Generaliz<strong>in</strong>g from Ecology, we say that apopulation sample is diverse with respect to a specific attribute if itexhibits a rich variety <strong>of</strong> values <strong>of</strong> that attribute and if each <strong>of</strong> thosevalues is evenly abundant. In other words, high diversity correspondsto a uniform distribution <strong>of</strong> objects across all possible values <strong>of</strong> an attribute.We extend the def<strong>in</strong>ition <strong>of</strong> diversity to sets <strong>of</strong> arbitrary objectsdescribed by many different attributes by simply def<strong>in</strong><strong>in</strong>g overall diversityas the aggregated diversity over all attributes be<strong>in</strong>g considered.As an example <strong>of</strong> how this def<strong>in</strong>ition is applied, consider analyz<strong>in</strong>gthe diversity <strong>of</strong> a university’s potential <strong>in</strong>com<strong>in</strong>g freshman class. Inparticular, if we are consider<strong>in</strong>g the diversity <strong>of</strong> different populations<strong>of</strong> applicants with respect to their <strong>in</strong>come levels, then a very diversepopulation will conta<strong>in</strong> a similar number <strong>of</strong> applicants (i.e. even abundances)<strong>in</strong> each <strong>of</strong> many possible <strong>in</strong>come brackets (i.e. a rich variety).In contrast, a very non-diverse population might conta<strong>in</strong> applicants <strong>in</strong>only a s<strong>in</strong>gle <strong>in</strong>come bracket (i.e. no variety) or mostly applicants <strong>in</strong>a s<strong>in</strong>gle <strong>in</strong>come bracket with very few applicants <strong>in</strong> each <strong>of</strong> the others(i.e. very uneven abundances). The diversity <strong>of</strong> other attributes, suchas GPA, ethnicity, gender, etc., would also contribute to the overalldiversity <strong>of</strong> a particular population <strong>of</strong> applicants.Beyond our def<strong>in</strong>ition <strong>of</strong> diversity, we also borrow several conventionsfrom the study <strong>of</strong> biodiversity. Specifically, we adopt <strong>in</strong>dividualobjects as our unit <strong>of</strong> measure, and, as <strong>in</strong> the study <strong>of</strong> biodiversity, wetreat all possible values <strong>of</strong> an attribute and all <strong>in</strong>dividuals <strong>in</strong> a populationsample as equal. Additionally, s<strong>in</strong>ce we have extended thedef<strong>in</strong>ition to account for diversity over many attributes, we adopt theadded convention that all attributes are treated as equal.In order to adequately convey diversity as def<strong>in</strong>ed above, a visualizationshould possess the follow<strong>in</strong>g properties:• Communicates the attributes <strong>of</strong> <strong>in</strong>terest, the richness <strong>in</strong> variety<strong>of</strong> the values <strong>of</strong> each attribute, and the evenness <strong>of</strong> abundance <strong>of</strong>the population sample <strong>of</strong> <strong>in</strong>terest over the values <strong>of</strong> each attributewhile consider<strong>in</strong>g all attributes and objects equally.• Scales well to large multivariate data sets, i.e. ones conta<strong>in</strong><strong>in</strong>gmany objects (> 1000) and many attributes (> 5).• Enables users to make judgments about diversity with little effortthrough an efficient perceptual encod<strong>in</strong>g (while ideally, the visualizationshould be designed so that the user perceives diversitypreattentively, i.e. without focused attention [35], we understandthat this is difficult for large attribute spaces).3 RELATED WORKIn this section, we review a subset <strong>of</strong> exist<strong>in</strong>g multivariate visualizationtechniques, emphasiz<strong>in</strong>g those that apply to the problem <strong>of</strong> explor<strong>in</strong>gthe diversity <strong>of</strong> a set <strong>of</strong> objects, as def<strong>in</strong>ed earlier. We focusonly on representation methods and organize our review based on thetaxonomy proposed by Keim et al. [16].3.1 Standard 2D/3D DisplaysTechniques such as scatter plots, box plots, bar charts, and histogramseffectively support tasks such as f<strong>in</strong>d<strong>in</strong>g outliers, gaps, clusters, andcorrelations over a small number <strong>of</strong> attributes [29]. However, whilethe box plot is well suited to display<strong>in</strong>g evenness <strong>of</strong> abundance, it fails<strong>in</strong> communicat<strong>in</strong>g richness <strong>of</strong> variety and is not applicable to categoricaldata. Likewise, without additional encod<strong>in</strong>g, the scatter plotmay lead to ambiguous communication <strong>of</strong> evenness <strong>of</strong> abundance dueto occlusions caused by data overlap. A rectangular heatmap can beviewed as a special case <strong>of</strong> the scatter plot where a value is plotted forevery comb<strong>in</strong>ation <strong>of</strong> the two mapped attribute values and a po<strong>in</strong>t isreplaced by a colored square. Like the scatter plot, heatmaps are limitedto display<strong>in</strong>g diversity over only the two attributes be<strong>in</strong>g mapped.However, occlusion is no longer a problem. The histogram, <strong>in</strong> particular,is well suited to show<strong>in</strong>g richness <strong>in</strong> variety and the evenness <strong>of</strong>distribution <strong>of</strong> objects over a s<strong>in</strong>gle attribute. As noted, all <strong>of</strong> theseapproaches typically display only one or two attributes <strong>of</strong> <strong>in</strong>terest.The use <strong>of</strong> small multiples may solve some <strong>of</strong> these problems. Forexample, scatter plot matrices may provide useful representations <strong>of</strong>diversity, especially for high and low diversity cases, but <strong>in</strong>termediatevalues may be difficult to disambiguate due to data overlap. While jitter<strong>in</strong>gtechniques may help alleviate this problem, they may give themislead<strong>in</strong>g appearance <strong>of</strong> evenness when it is not actually present. Amatrix <strong>of</strong> heatmaps would avoid the data overlap issue and could bean <strong>in</strong>terest<strong>in</strong>g approach to view<strong>in</strong>g diversity (both richness and evenness).Small multiples <strong>in</strong> matrix form, however, require screen spacethat grows with the square <strong>of</strong> the number <strong>of</strong> attributes. Small multiples<strong>of</strong> histograms could be a powerful method for diversity visualization,s<strong>in</strong>ce these appear capable <strong>of</strong> convey<strong>in</strong>g both richness <strong>of</strong> variety andevenness <strong>of</strong> abundance. However, it is not clear how well overall diversityis communicated by multiple spatially separated histograms.The <strong>Diversity</strong> Map representation, described <strong>in</strong> Section 4, is <strong>in</strong> fact asmall multiple histogram representation with an alternative encod<strong>in</strong>gthat facilitates communication <strong>of</strong> overall diversity.Alternatively, rank/abundance—or Whittaker—plots [37] are commonlyused by ecologists to visualize species abundance distribution.The representation is a variation <strong>of</strong> the scatter plot <strong>in</strong> which species areranked from most to least abundant and then plotted along the x axis,while the y axis shows the relative abundance <strong>of</strong> species. The shape<strong>of</strong> the result<strong>in</strong>g curve provides <strong>in</strong>sight <strong>in</strong>to species evenness (or dom<strong>in</strong>ance).Although this approach is specific to species abundance, it andthe other standard approaches serve as a start<strong>in</strong>g po<strong>in</strong>t for explor<strong>in</strong>gtechniques for visualiz<strong>in</strong>g distributions <strong>of</strong> data over many dimensions.3.2 Geometrically Transformed DisplaysGeometrically transformed displays map one object to a set <strong>of</strong> po<strong>in</strong>tsand l<strong>in</strong>es <strong>in</strong> 2D or 3D space [16]. This category <strong>in</strong>cludes graph visualizationsand coord<strong>in</strong>ate-based visualizations. While graph-basedvisualizations are important <strong>in</strong> many doma<strong>in</strong>s, we do not discuss thembecause we assume that limited (or no) explicit relationship <strong>in</strong>formationis present <strong>in</strong> the data sets we consider.Coord<strong>in</strong>ate-based visualizations extend standard 2D/3D displays byperform<strong>in</strong>g geometric transformations and projections <strong>of</strong> data ontocoord<strong>in</strong>ate axes. <strong>Data</strong> attributes are typically preserved and treated

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!