18.12.2012 Views

Myeloid Leukemia

Myeloid Leukemia

Myeloid Leukemia

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

230 Kohlmann et al.<br />

predictions for each sample (so that each sample is classified once in the n<br />

iterations).<br />

3.4.2.2. 10-FOLD CV<br />

Ten-fold CV is another method used to estimating the apparent accuracy,<br />

i.e., the overall rate of correct predictions of the complete data set. This classification<br />

task means that the data set is divided into 10 equally sized subsets,<br />

balanced for the respective subclasses of the data. Then, differentially expressed<br />

genes are identified in the training set (9 subsets), and a model is trained<br />

based on the top genes that demonstrate differential expression between each<br />

of the respective subclasses in the training set. This model is used to generate<br />

predictions for the remaining subset. This training and prediction process has<br />

to be repeated 10 times to include predictions for each subset (so that each<br />

sample is classified once in the 10 iterations).<br />

3.4.2.3. RESAMPLING ANALYSIS<br />

A resampling approach can be used to assess the robustness of class predictions.<br />

Here again, the data set is randomly (but balanced for the respective<br />

subtypes) split into a training set, consisting of two-thirds of samples, and an<br />

independent test set with the remaining third. Differentially expressed genes<br />

are identified in the training set, a support vector machine (SVM) model is<br />

built from the training set, and predictions are made in the test set. This complete<br />

process is repeated 100 times. By this means, 95% confidence intervals<br />

for accuracy, sensitivity, and specificity can also be estimated. Sensitivity and<br />

specificity are calculated as follows:<br />

Sensitivity = (number of positive samples predicted)/(number of true-positives)<br />

Specificity = (number of negative samples predicted)/(number of true-negatives)<br />

3.4.3. Hierarchical Clustering<br />

Two-dimensional hierarchical cluster analysis is a popular method of organizing<br />

expression data, i.e., arranging genes and patients according to similarity<br />

in their patterns of gene expression (Fig. 3). This method helps to organize<br />

but not to alter tables containing the primary expression data. The output format<br />

is a graphic display that allows the clustering and the underlying expression<br />

data to be conveyed in an intuitive form to biologists (14). By adopting a<br />

mathematical description of similarity, the object of this algorithm is to compute<br />

a dendrogram that assembles all elements into a single tree. For any set of<br />

n genes, an upper-diagonal similarity matrix is computed by the Euclidean distance<br />

metric, which contains similarity scores for all pairs of genes. The matrix

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!