13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

TRANSCRIPTIONAL UNITS AND GENE PAIRS 465Table 2. Clustering <strong>of</strong> Genes and TUs Involved in Unconventional Gene Pairs at 5q31Interval [DAMS-TRP7] (TRP7-KLHL3) [KLHL3-SIL1] (SIL1-25P15.TU1) [25P15.TU1-HARSL]Size (kb) 225 1260 1580 1200 300Transcript models 6 6 32 42 16Transcript models per 100 kb 2.7 0.5 2.0 3.5 5.0Transcript models involved in UGPs 4 0 18 4 11% <strong>of</strong> transcript models in UGPs 67% 0% 56% 10% 69%<strong>The</strong> term “transcript models” refers to both genes and TUs.domly distributed along the genomic sequence in a waythat is independent from gene density. Fourth, relative togenes, TUs are enriched in expressed repetitive elements,including primate-specific Alu and Mer1 repeats.An analysis <strong>of</strong> UGP distribution in the 5q31 region exclusive<strong>of</strong> the PCDH clusters demonstrated that UGPscluster in well-defined genomic intervals (Table 2). <strong>The</strong>intervals differ in the proportion <strong>of</strong> expressed featuresparticipating in UGPs, which represent the majority <strong>of</strong>features in some intervals and a very small minority inothers. <strong>The</strong> UGP-enriched genomic intervals containmultiple types <strong>of</strong> genomic complexity. For example, theinterval containing 38I10.TU1 also contained four consecutivefeatures, each <strong>of</strong> which was oriented opposite toits neighbors and which formed three antisense pairs (Fig.1B), a rare arrangement analogous to, but even morecomplex than, that seen in the human Surfeit locus(Duhig et al. 1998).Table 3 indicates differences between known genesand novel TUs in ~4.5 Mb <strong>of</strong> 5q31. <strong>The</strong> biological reality<strong>of</strong> TUs is suggested by canonical transcript processingand the presence <strong>of</strong> multiple ESTs. Nonetheless, TUs representa radically different fraction <strong>of</strong> the transcriptome.For example, BLASTN and TBLASTX analysis <strong>of</strong> allTUs mapping to this 5q31 region, performed against theNT, EST, GSS, and HTGS databases, found nonhumanhomology for most <strong>of</strong> the genes but for less than half <strong>of</strong>the TUs.LARGE-SCALE VERIFICATION OF TRENDSFROM CHR. 5q31 ON CHR. 22Perl-based High-throughput TU Discoveryand UGP Analysis PipelineWe hypothesized that these four trends are global andnot region-specific. <strong>The</strong>refore, we utilized 5q31 as atraining set for chromosome-scale analysis <strong>of</strong> TUs andUGPs. We codified criteria developed for the 5q31 annotationinto a three-stage Perl-based high-throughput automatedannotation pipeline (Fig. 2). We modified thebioperl.org open-source BLAST parsers (Stajich et al.2002) to make them aware <strong>of</strong> the transcriptional orientation<strong>of</strong> cDNAs and ESTs matching genomic sequences.<strong>The</strong> first stage utilized these parsers to analyze all cDNAand EST BLAST matches against the query genomic sequenceand determined which nongenic EST matcheswere TU-worthy (i.e., completely and precisely satisfiedour operational definition <strong>of</strong> a TU). Only primary ESTand cDNA evidence was used. We did not use third-partyannotations or any curated reference transcripts bearingNM and XM designations. In the second stage, all cDNAand TU-worthy EST matches were subjected to BLASTNagainst the entire human genomic sequence in the NR andHTGS databases. Matches with homologies to genomicregions other than the region in which they were originallyidentified, and with equal or higher BLAST scoresassociated with homologies to those other regions, wereautomatically eliminated due to their putatively segmentallyduplicated or pseudogenic nature. <strong>The</strong> third stageautomatically compiled complete exon–intron structuresfor every gene and TU, quoting exact coordinates <strong>of</strong> everyelement <strong>of</strong> the structure on the genomic sequence, accessionnumbers <strong>of</strong> cDNAs or ESTs supporting eachexon <strong>of</strong> each gene and TU, and the extent <strong>of</strong> apparent involvement<strong>of</strong> the gene or TU in UGPs in a table suitablefor manual curation. We selected human chr. 22 for chromosome-scalevalidation <strong>of</strong> 5q31 trends because chr. 22is small and thoroughly annotated, facilitating comparisonswith other algorithms (Collins et al. 2003) and representative<strong>of</strong> other chromosomes in terms <strong>of</strong> segmentalduplications (Bailey et al. 2002), low-copy repeats (Mc-Dermid and Morrow 2002), and gene family expansions(Coggan et al. 1998; Jarmuz et al. 2002).Table 3. Differences between Known Genes and Novel TUs at 5q31P-value fordifference <strong>of</strong>Known genes Novel TUs genes and TUsN 54 47With homology to nonhuman DNA 52 17 0.0001Length (amino acid) <strong>of</strong> longest sense-strand ORF 496 ± 47.3 64 ± 4.4 0.0001% <strong>of</strong> reference transcript in expressed repeats 5.3 ± 1.6 21.4 ± 3.8 0.0011ESTs per gene or TU 201 ± 26.7 6 ± 1.0 0.0001Standard errors (calculated by SPSS v10, GLM parameter estimates module) are given after ±.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!