Lecture Notes on Compositional Data Analysis - Sedimentology ...
Lecture Notes on Compositional Data Analysis - Sedimentology ...
Lecture Notes on Compositional Data Analysis - Sedimentology ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
34 Chapter 5. Exploratory data analysis<br />
5.2 Centre, total variance and variati<strong>on</strong> matrix<br />
Standard descriptive statistics are not very informative in the case of compositi<strong>on</strong>al<br />
data. In particular, the arithmetic mean and the variance or standard deviati<strong>on</strong> of<br />
individual comp<strong>on</strong>ents do not fit with the Aitchis<strong>on</strong> geometry as measures of central<br />
tendency and dispersi<strong>on</strong>. The skeptic reader might c<strong>on</strong>vince himself/herself by doing<br />
exercise 5.1 immediately. These statistics were defined as such in the framework of<br />
Euclidean geometry in real space, which is not a sensible geometry for compositi<strong>on</strong>al<br />
data. Therefore, it is necessary to introduce alternatives, which we find in the c<strong>on</strong>cepts<br />
of centre (Aitchis<strong>on</strong>, 1997), variati<strong>on</strong> matrix, and total variance (Aitchis<strong>on</strong>, 1986).<br />
Definiti<strong>on</strong> 5.1. A measure of central tendency for compositi<strong>on</strong>al data is the closed<br />
geometric mean. For a data set of size n it is called centre and is defined as<br />
with g i = ( ∏ n<br />
j=1 x ij) 1/n , i = 1, 2, . . ., D.<br />
g = C [g 1 , g 2 , . . .,g D ] ,<br />
Note that in the definiti<strong>on</strong> of centre of a data set the geometric mean is c<strong>on</strong>sidered<br />
column-wise (i.e. by parts), while in the clr transformati<strong>on</strong>, given in equati<strong>on</strong> (4.3), the<br />
geometric mean is c<strong>on</strong>sidered row-wise (i.e. by samples).<br />
Definiti<strong>on</strong> 5.2. Dispersi<strong>on</strong> in a compositi<strong>on</strong>al data set can be described either by the<br />
variati<strong>on</strong> matrix, originally defined by (Aitchis<strong>on</strong>, 1986) as<br />
⎛<br />
⎞<br />
t 11 t 12 · · · t 1D<br />
t 21 t 22 · · · t 2D<br />
(<br />
T = ⎜<br />
⎝<br />
.<br />
. . ..<br />
⎟ . ⎠ , t ij = var ln x )<br />
i<br />
,<br />
x j<br />
t D1 t D2 · · · t DD<br />
or by the normalised variati<strong>on</strong> matrix<br />
⎛<br />
⎞<br />
t ∗ 11 t ∗ 12 · · · t ∗ 1D<br />
T ∗ t ∗ 21 t ∗ 22 · · · t ∗ (<br />
2D<br />
1<br />
= ⎜<br />
⎝ . .<br />
..<br />
⎟ . . ⎠ , t∗ ij = var √2 ln x )<br />
i<br />
.<br />
x j<br />
t ∗ D1 t ∗ D2 · · · t ∗ DD<br />
Thus, t ij stands for the usual experimental variance of the log-ratio of parts i and j,<br />
while t ∗ ij stands for the usual experimental variance of the normalised log-ratio of parts<br />
i and j, so that the log ratio is a balance.<br />
Note that<br />
t ∗ ij = var ( 1 √2 ln x i<br />
x j<br />
)<br />
= 1 2 t ij ,<br />
and thus T ∗ = 1 2<br />
T. Normalised variati<strong>on</strong>s have squared Aitchis<strong>on</strong> distance units (see<br />
Figure 3.3).