12.07.2015 Views

View - ResearchGate

View - ResearchGate

View - ResearchGate

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Modeling Transcription Factor Target Promoters 137class representation in that group. The resulting model is a highly interpretabledecision tree, which helps design further experiments. Some of the principal limitationsof CART are low accuracy (because of the use of piece-wise, constantapproximations) and high variance or instability. In particular, when the number ofvariables (TFBSs) is much larger than the number of observations (promoters),CART would fail to give a robust classification model (see Note 3).In order to limit the number of variables for CART analysis, one can use theRandom Forest program (65) to preselect the most discriminative variablesfrom a large number of input variables. Random forest is an ensemble of manydecision trees, such that each tree depends on the values of a random vectorsampled independently and with the same distribution for all trees in the forest.To classify a new object from an input vector, the algorithm applies the inputvector to each tree of the forest. Each tree is a separate classification model, andthe tree “votes” for that class. The forest then chooses the classification havingthe most votes over all of the trees in the forest. The forest error rate dependson the correlation between any two trees in the forest (increasing the correlationincreases the forest error rate) and the strength of each individual tree in theforest (a tree with a low error rate is a strong classifier and increasing thestrength of the individual trees decreases the forest error rate).Random Forest can handle thousands of input variables without variableselection and gives estimates of what variables are important in the classification.Although Random Forest is a robust classifier, the black box nature of thealgorithm makes it impracticable to infer the decision rules from thousands oftrees. In the present case, it is critical to understand the interaction of variables(TFs) that provide the predictive accuracy. Hence, the use of Random Forest forvariable selection followed by application of CART algorithm is recommended.The commercially available CART program (66) is perhaps the best and isuser-friendly, and the authors have used it in their earlier studies (3,24). If thecommercial program is not available, the user may use rpart, a free implementationof CART in the R statistical package. Similarly, the freely availableimplementation of Random Forest in R can be used for variable selection. Theauthor suggests “Gini” method as the splitting method for growing the tree andthe 10-fold cross-validation to obtain the optimal minimal tree. TFBSs predictedby MATCH and conserved in the human and mouse orthologous promoterscan be used as predictor variables, wherein each binding site may beconsidered as a binary variable, such that it was either 1 or 0, depending onits presence or absence within a specified region.3.2. Worked ExampleVarious methods are discussed to predict TFBSs in a given promoter anddecision tree classification methods in the previous sections. Now will be

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!