18.12.2012 Views

2012 EDUCATIONAL BOOK - American Society of Clinical Oncology

2012 EDUCATIONAL BOOK - American Society of Clinical Oncology

2012 EDUCATIONAL BOOK - American Society of Clinical Oncology

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENOMIC AND PROTEOMIC TECHNOLOGIES AND BIOMARKERS<br />

Fig 1. Analysis steps. Genomics and proteomics analyses from all measurement platforms should follow steps (a)–(f) to obtain clinically<br />

relevant genomic signatures. (a) Sample experimental design accounting for technical artifacts in batches (e.g., processing date, lab technician,<br />

etc). (b) Sample raw output from Affymetrix microarray. (c) Comparing the distribution <strong>of</strong> genomic data in each sample before and after<br />

normalization, at which point measurements for each sample should be on the same scale. (d) Dendrogram for clustering <strong>of</strong> all data colored by<br />

batch to identify artifacts in the data. (e) Heatmap <strong>of</strong> genes that the analysis identified as distinguishing tumor and normal samples (green<br />

represents low measurement values and red high). (f) State predicted based on gene signature from the analysis in (e) for a new test set <strong>of</strong> tumor<br />

and normal samples.<br />

Approaches to Genomic and Proteomic Data Analyses<br />

All genomic and proteomic analyses generally follow the<br />

stages outlined in Figure 1. We will discuss these stages<br />

using genomic analyses since they are better developed for<br />

genomic data, although the various approaches apply to<br />

proteomic datasets as well. First, careful experimental design<br />

is required to ensure that the measurements collected<br />

can quantify the hypothesized genomic state or differences<br />

in tumors and have large enough number <strong>of</strong> samples to<br />

provide sufficient statistical power. Moreover, technical elements<br />

<strong>of</strong> the experimental design, including processing date,<br />

lab technician, and processing center, can introduce significant<br />

artifacts known as batch effects into the data. 11 These<br />

batch effects are present in all genomics data, including<br />

sequencing technologies. Therefore, the experimental design<br />

should ensure that samples for each <strong>of</strong> the biologic conditions<br />

are processed in each <strong>of</strong> the technical conditions (Fig.<br />

1A). If such representation is impossible because <strong>of</strong> the size<br />

or scope <strong>of</strong> the experiment, the experiments should adopt a<br />

randomized block design. Once collected, the raw data (Fig.<br />

1B) must be preprocessed to convert the raw output <strong>of</strong> the<br />

platforms to identifiable genomic or proteomic measurements<br />

that are comparable across the experiments (Fig. 1C),<br />

including statistical removal <strong>of</strong> or control for batch effects<br />

(Fig. 1D). Only after that can computational algorithms be<br />

applied to infer gene-expression signatures pertinent to<br />

cancer (Fig. 1E). 6 Finally, as with any statistical analyses,<br />

genomic analyses require careful cross-validation to ensure<br />

the relevance <strong>of</strong> the inferred signatures to future datasets<br />

collected from different biologic samples in similar experimental<br />

conditions (Fig. 1F). Further, directed experimentation<br />

is required to prove the functional mechanisms<br />

suggested by the inferred signatures.<br />

In general, there are three main analysis approaches:<br />

unsupervised, supervised, and class-prediction algorithms.<br />

Unsupervised (or class-discovery) algorithms seek gene- or<br />

protein-expression signatures that distinguish samples<br />

without regard for the biologic conditions in the experimental<br />

design. Hierarchical clustering has become perhaps the<br />

most widely adopted algorithm for unsupervised genomic<br />

analysis. 12 These algorithms identify distances between sets<br />

<strong>of</strong> genes or samples. Visualization <strong>of</strong> these clustering across<br />

samples in dendrograms (Fig. 2A) can compare biologic<br />

samples measured under different technical conditions,<br />

making them particularly useful for identifying batch effects<br />

and success <strong>of</strong> algorithms removing those effects (Fig. 1D).<br />

Once removed, the clustering will classify samples to ideally<br />

distinguish patients with different clinical outcomes. Likewise,<br />

genes that clustered together can be hypothesized to<br />

function similarly. Similar utility can be gained from another<br />

form <strong>of</strong> clustering, k-means clustering. This algorithm<br />

divides genes or samples from the data into a predetermined<br />

379

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!