02.07.2014 Views

Lecture Notes on Compositional Data Analysis - Sedimentology ...

Lecture Notes on Compositional Data Analysis - Sedimentology ...

Lecture Notes on Compositional Data Analysis - Sedimentology ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

34 Chapter 5. Exploratory data analysis<br />

5.2 Centre, total variance and variati<strong>on</strong> matrix<br />

Standard descriptive statistics are not very informative in the case of compositi<strong>on</strong>al<br />

data. In particular, the arithmetic mean and the variance or standard deviati<strong>on</strong> of<br />

individual comp<strong>on</strong>ents do not fit with the Aitchis<strong>on</strong> geometry as measures of central<br />

tendency and dispersi<strong>on</strong>. The skeptic reader might c<strong>on</strong>vince himself/herself by doing<br />

exercise 5.1 immediately. These statistics were defined as such in the framework of<br />

Euclidean geometry in real space, which is not a sensible geometry for compositi<strong>on</strong>al<br />

data. Therefore, it is necessary to introduce alternatives, which we find in the c<strong>on</strong>cepts<br />

of centre (Aitchis<strong>on</strong>, 1997), variati<strong>on</strong> matrix, and total variance (Aitchis<strong>on</strong>, 1986).<br />

Definiti<strong>on</strong> 5.1. A measure of central tendency for compositi<strong>on</strong>al data is the closed<br />

geometric mean. For a data set of size n it is called centre and is defined as<br />

with g i = ( ∏ n<br />

j=1 x ij) 1/n , i = 1, 2, . . ., D.<br />

g = C [g 1 , g 2 , . . .,g D ] ,<br />

Note that in the definiti<strong>on</strong> of centre of a data set the geometric mean is c<strong>on</strong>sidered<br />

column-wise (i.e. by parts), while in the clr transformati<strong>on</strong>, given in equati<strong>on</strong> (4.3), the<br />

geometric mean is c<strong>on</strong>sidered row-wise (i.e. by samples).<br />

Definiti<strong>on</strong> 5.2. Dispersi<strong>on</strong> in a compositi<strong>on</strong>al data set can be described either by the<br />

variati<strong>on</strong> matrix, originally defined by (Aitchis<strong>on</strong>, 1986) as<br />

⎛<br />

⎞<br />

t 11 t 12 · · · t 1D<br />

t 21 t 22 · · · t 2D<br />

(<br />

T = ⎜<br />

⎝<br />

.<br />

. . ..<br />

⎟ . ⎠ , t ij = var ln x )<br />

i<br />

,<br />

x j<br />

t D1 t D2 · · · t DD<br />

or by the normalised variati<strong>on</strong> matrix<br />

⎛<br />

⎞<br />

t ∗ 11 t ∗ 12 · · · t ∗ 1D<br />

T ∗ t ∗ 21 t ∗ 22 · · · t ∗ (<br />

2D<br />

1<br />

= ⎜<br />

⎝ . .<br />

..<br />

⎟ . . ⎠ , t∗ ij = var √2 ln x )<br />

i<br />

.<br />

x j<br />

t ∗ D1 t ∗ D2 · · · t ∗ DD<br />

Thus, t ij stands for the usual experimental variance of the log-ratio of parts i and j,<br />

while t ∗ ij stands for the usual experimental variance of the normalised log-ratio of parts<br />

i and j, so that the log ratio is a balance.<br />

Note that<br />

t ∗ ij = var ( 1 √2 ln x i<br />

x j<br />

)<br />

= 1 2 t ij ,<br />

and thus T ∗ = 1 2<br />

T. Normalised variati<strong>on</strong>s have squared Aitchis<strong>on</strong> distance units (see<br />

Figure 3.3).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!