Computational Analysis and Paleogenomics of Interspersed ...
Computational Analysis and Paleogenomics of Interspersed ...
Computational Analysis and Paleogenomics of Interspersed ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Computational</strong> <strong>Analysis</strong> <strong>of</strong> <strong>Interspersed</strong> Repeats | 39the biological termini <strong>of</strong> repeats (Bao <strong>and</strong> Eddy, 2002), this remains an inherentlydifficult problem for the downstream steps <strong>of</strong> genome annotation, <strong>and</strong> inparticular for the automated classification <strong>of</strong> the repeats. In theory, short seedextension approaches such as those used by RepeatScout or ReaS should beless sensitive to this problem <strong>and</strong> generate more faithfully defined consensussequences. Nevertheless, these programs still appear to suffer from the problem<strong>of</strong> repeat fragmentation, in particular when low coverage whole genome shotgunsequences are used as input (N. Ranganathan <strong>and</strong> C. F., unpublished observation).Classification <strong>of</strong> the repeatsRecent advances in the development <strong>of</strong> de novo repeat identification tools havebeen fueled by the pressing need to automate the construction <strong>of</strong> repeat librariesto assist in the assembly <strong>of</strong> the multitude <strong>of</strong> genome sequencing projects.All <strong>of</strong> the programs described above, in particular when used in combination(Quesneville et al., 2005), have the potential to greatly accelerate the construction<strong>of</strong> consensus repeat libraries <strong>and</strong> as such they should largely fulfill the needsfor raw repeat masking <strong>and</strong> assisting in genome assembly. However, the outputproduced by programs such as RECON (Bao <strong>and</strong> Eddy, 2002) or RepeatScout(Price et al., 2005) provide little if any biological information on the repeatsthemselves <strong>and</strong> therefore cannot be used to annotate <strong>and</strong> classify the repeats. Atpresent, there is no program to completely automate this tedious yet biologicallyimportant phase <strong>of</strong> genome annotation.Developing such s<strong>of</strong>tware represents a very challenging issue in computationalbiology. Not only are there currently only a few experts whose knowledge<strong>of</strong> the elements can cover the extreme diversity <strong>of</strong> TEs, but also new groups <strong>of</strong>elements are continuously being discovered <strong>and</strong> existing groups being reclassified,merged, or subdivided. Consequently, the body <strong>of</strong> TE literature is enormous<strong>and</strong> rapidly exp<strong>and</strong>ing. As <strong>of</strong> September 2005 a single keyword searchin Medline using the combination “(retro)transposable OR (retro)transposon”retrieves over 21 000 publications, including ~1500 review articles! Thus, synthesizingthe diversity <strong>of</strong> TEs into a rational <strong>and</strong> comprehensive classificationupon which one could develop an automatic system <strong>of</strong> repeat annotation isextremely difficult.As mentioned earlier, TEs are divided into two classes depending on theirtransposition intermediate. Class 1 elements transpose via reverse-transcription<strong>of</strong> a transcribed RNA intermediate, while class 2 elements move directly througha DNA intermediate (Finnegan, 1989). The two classes are further dividedinto distinct subclasses <strong>and</strong> a plethora <strong>of</strong> clades <strong>and</strong> superfamilies using otherstructural features related to other aspects <strong>of</strong> their transposition mechanism.For example, non-LTR retrotransposons used a target-primed mechanism <strong>of</strong>transposition where reverse-transcription <strong>and</strong> integration are coupled, whileLTR retrotransposons are reverse transcribed within a viral-like particle <strong>and</strong> theresulting cDNA integrated in the genome as a DNA intermediate.Uncorrected pro<strong>of</strong>s — not for distribution