Views
4 years ago

PDF - White Rose Etheses Online

PDF - White Rose Etheses Online

PDF - White Rose Etheses

Clustering Large Raw DNA Sequencing Datasets by Species of Origin using Signature Features of Genomic Sequence Composition Tobias Hodges PhD University of York Biology July 2012

  • Page 3: Abstract The establishment of high-
  • Page 6 and 7: Table of Contents simLC............
  • Page 8 and 9: Table of Contents Chapter 5: A comp
  • Page 10 and 11: List of Tables and Figures Chapter
  • Page 12 and 13: Figure 3.10 Proportion of sequencin
  • Page 14 and 15: Acknowledgements I would like to ex
  • Page 17 and 18: 1 Abstract Introduction The size an
  • Page 19 and 20: Chapter 1 - Context research that h
  • Page 21 and 22: increases the throughput accordingl
  • Page 23 and 24: The SOLiD system uses sets of label
  • Page 25 and 26: in intensity of emitted fluorescenc
  • Page 27 and 28: species. Reads are aligned to the r
  • Page 29 and 30: of fragments to be determined as qu
  • Page 31 and 32: SNP analysis The large-scale identi
  • Page 33 and 34: common reference point between geno
  • Page 35 and 36: Example Metagenomic Projects and Da
  • Page 37 and 38: Many other studies have been carrie
  • Page 39 and 40: One of the major challenges associa
  • Page 41 and 42: elieve to be a novel bacterial and
  • Page 43 and 44: for Biotechnology Information (NCBI
  • Page 45 and 46: for comparison is limited by the av
  • Page 47 and 48: these supervised methods suffer fro
  • Page 49: Grouping of sequencing datasets usi
  • Page 52 and 53:

    Introduction The major theme of thi

  • Page 54 and 55:

    GC content shows little variation t

  • Page 56:

    1st letter Chapter 2 - Introduction

  • Page 59 and 60:

    ways to maximise the information pr

  • Page 61 and 62:

    epeat x r Trim sequence to length C

  • Page 64 and 65:

    egions in DNA sequences. A binary i

  • Page 66 and 67:

    First, a simplistic dataset was use

  • Page 68:

    The mean length of sequencing reads

  • Page 71 and 72:

    Materials and Methods Preparation o

  • Page 73 and 74:

    Precision (Pr) and recall (Rc) of c

  • Page 75 and 76:

    Results Clustering of Arabidopsis t

  • Page 77 and 78:

    Sequences in cluster 15000 11250 75

  • Page 79 and 80:

    Table 2.2 Mean precision values of

  • Page 81 and 82:

    Effect of increasing sequence size

  • Page 83 and 84:

    Clustering of simLC Clustering of r

  • Page 85 and 86:

    Clustering of randomised sequences

  • Page 87 and 88:

    Table 2.5 Mean precision values of

  • Page 89 and 90:

    This clustering analysis applied to

  • Page 91 and 92:

    taxonomic grouping throughout the d

  • Page 93 and 94:

    This hybrid classification was used

  • Page 95 and 96:

    GC Cluster 1 Cluster 2 Cluster 4 Cl

  • Page 97 and 98:

    OFDEG Cluster 1 Cluster 4 Cluster 2

  • Page 99 and 100:

    GC + IND Cluster 1 Cluster 4 Cluste

  • Page 101 and 102:

    GC + TNF Cluster 1 Cluster 4 Cluste

  • Page 103 and 104:

    IND + TNF Cluster 1 Cluster 4 Clust

  • Page 105 and 106:

    GC + IND + OFDEG Cluster 1 Cluster

  • Page 107 and 108:

    GC + OFDEG + TNF Cluster 1 Cluster

  • Page 109 and 110:

    GC + IND + OFDEG + TNF Cluster 1 Cl

  • Page 111 and 112:

    clusters produced (or three, in the

  • Page 113 and 114:

    product of the large size of this c

  • Page 115 and 116:

    Discussion Dataset 1 At this early

  • Page 117 and 118:

    vectors. simLC The levels of succes

  • Page 119 and 120:

    separately, of non-proteobacterial

  • Page 121 and 122:

    Limitations of the sequence feature

  • Page 123:

    Taking these factors into considera

  • Page 126 and 127:

    Introduction In the previous chapte

  • Page 128 and 129:

    species including A. thaliana (Sosn

  • Page 132 and 133:

    TaqMan assay for each pathogen, and

  • Page 134 and 135:

    exonuclease activity of the Taq pol

  • Page 136 and 137:

    Table 3.1 Reagents and volumes used

  • Page 138 and 139:

    Preparation of cDNA for sequencing

  • Page 140 and 141:

    SSAHA2 is a sequence alignment prog

  • Page 142 and 143:

    those from samples inoculated by ru

  • Page 144 and 145:

    Table 3.7 Mean threshold fluorescen

  • Page 146 and 147:

    qRT-PCR Analysis of RNA Extracted w

  • Page 148 and 149:

    The COX assay-derived Ct values obs

  • Page 150 and 151:

    Two conclusions were drawn from the

  • Page 152 and 153:

    acterial/viral DNA or RNA in the ex

  • Page 154 and 155:

    qRT-PCR analysis of RNA extracts in

  • Page 156 and 157:

    quantify these differences between

  • Page 158 and 159:

    qRT-PCR Analysis of Untreated Plant

  • Page 160 and 161:

    qRT-PCR Analysis of CMV Inoculated

  • Page 162 and 163:

    Table 3.10 Number of sequencing rea

  • Page 164 and 165:

    Untreated Plant DNA Sequence Breakd

  • Page 166 and 167:

    Proportion of sequence reads assign

  • Page 168 and 169:

    Table 3.11 Number of sequencing rea

  • Page 170 and 171:

    Untreated Plant cDNA Sequence Break

  • Page 172 and 173:

    Proportion of sequence reads assign

  • Page 174 and 175:

    material present in relatively smal

  • Page 176 and 177:

    Conclusion and future work The aim

  • Page 179 and 180:

    4 A comparison of genomic signature

  • Page 181 and 182:

    eads, as similar as possible to tha

  • Page 183 and 184:

    0.8237 (0.2116). The Phred30 score

  • Page 185 and 186:

    Materials and Methods Dataset prepa

  • Page 187 and 188:

    Generation of feature vectors GC, I

  • Page 189 and 190:

    If perfect clustering were to be ac

  • Page 191 and 192:

    IND OFDEG Cluster 1 UT (A. thaliana

  • Page 193 and 194:

    GC + OFDEG GC + TNF Cluster 1 UT (A

  • Page 195 and 196:

    OFDEG + TNF Cluster 1 GC + IND + OF

  • Page 197 and 198:

    IND + OFDEG + TNF Cluster 1 UT (A.

  • Page 199 and 200:

    grouped into a cluster with the maj

  • Page 201 and 202:

    Figure 4.3(i) - 4.3(xv) Comparative

  • Page 203 and 204:

    IND + TNF Cluster 1 Cluster 4* Clus

  • Page 205 and 206:

    GC + IND + OFDEG (xi) Cluster 1 Clu

  • Page 207 and 208:

    GC + OFDEG + TNF (xiii) Cluster 1 C

  • Page 209 and 210:

    GC + IND + OFDEG + TNF Cluster 1 Cl

  • Page 211 and 212:

    clusters was almost identical: the

  • Page 213 and 214:

    protection against local variabilit

  • Page 216 and 217:

    5 A comparison of clustering method

  • Page 218 and 219:

    k-Means and other partitioning clus

  • Page 220 and 221:

    clustering where the number of grou

  • Page 222 and 223:

    Hierarchical clustering Unlike part

  • Page 224 and 225:

    The use of distance between one poi

  • Page 226 and 227:

    grid ‘learns’ the data, with di

  • Page 228 and 229:

    As mentioned previously, TNF featur

  • Page 230 and 231:

    Materials and Methods Generation of

  • Page 232 and 233:

    Results Parameter selection for FCM

  • Page 234 and 235:

    Table 5.2 Results of FCM clustering

  • Page 236 and 237:

    The Pr and Rc values plotted in Fig

  • Page 238 and 239:

    Figure 5.2 Results of KASP spectral

  • Page 240 and 241:

    The plots of these statistics in Fi

  • Page 242 and 243:

    Figure 5.3 The number of sequencing

  • Page 244 and 245:

    Figure 5.4 The number of sequencing

  • Page 246 and 247:

    Figure 5.5 The number of sequencing

  • Page 248 and 249:

    From these results it was clear tha

  • Page 250 and 251:

    Table 5.3 Precision and recall stat

  • Page 252 and 253:

    Table 5.5 Precision and recall stat

  • Page 254 and 255:

    The results indicated that very sim

  • Page 256 and 257:

    Pseudomonas reads was associated wi

  • Page 258:

    and increases in computing power al

  • Page 261 and 262:

    Introduction The previous work in t

  • Page 263 and 264:

    et al. 2002), as are any sequences

  • Page 265 and 266:

    Table 6.1 Summary statistics of dat

  • Page 267 and 268:

    Materials and Methods Clustering of

  • Page 269 and 270:

    Table 6.2 Details of contigs produc

  • Page 271 and 272:

    Where the UT+Psp2126 dataset was as

  • Page 273 and 274:

    the number of reads, the read and g

  • Page 275 and 276:

    Figure 6.1 The effect of increasing

  • Page 277 and 278:

    Figure 6.2 The effect of increasing

  • Page 279 and 280:

    Across the range of size ratios, wi

  • Page 281 and 282:

    The marked decrease in combined len

  • Page 283 and 284:

    Table 6.3 Details of contigs produc

  • Page 285 and 286:

    Table 6.4 Details of contigs produc

  • Page 287 and 288:

    Table 6.5 Details of contigs and is

  • Page 289 and 290:

    Table 6.6 CPU time in seconds taken

  • Page 291 and 292:

    the TNF/k-means approach, compared

  • Page 293:

    Effect of clustering on speed of as

  • Page 296:

    Once again, the existence of a true

  • Page 299 and 300:

    Discussion The development of new t

  • Page 301 and 302:

    metagenomic studies, which require

  • Page 303 and 304:

    Future directions Sequence features

  • Page 305 and 306:

    clustered using SOMs previously tra

  • Page 307 and 308:

    cluster variance), prior knowledge

  • Page 309 and 310:

    Appendix A Annotated reproductions

  • Page 311 and 312:

    Appendix A-2 shortSeqCutter.pl #! /

  • Page 313 and 314:

    } my $tcount = (($seq->seq) =~ tr/T

  • Page 315 and 316:

    } close OUTPUTFH; #subroutine for w

  • Page 317 and 318:

    #check that no unknown feature type

  • Page 319 and 320:

    } else { } $tetrahash{$tetraseq} =

  • Page 321 and 322:

    #if sequence is longer than the sho

  • Page 323 and 324:

    $subtetrahash{$tetraseq}; $revsubte

  • Page 325 and 326:

    my @distVector; my $numSeqs = 0; my

  • Page 327 and 328:

    } } $distCheck++; #calculate observ

  • Page 329 and 330:

    } } } } print STATSFILE ("\n"); #..

  • Page 331 and 332:

    my @values; #read files and prepare

  • Page 333 and 334:

    Appendix A-6 claraAnalysisMulti.pl

  • Page 335 and 336:

    } print "\nCluster Sizes:\n"; my $c

  • Page 337 and 338:

    my $i = 1; my %clusMaxima; my %spec

  • Page 339 and 340:

    #calculate total number of reads in

  • Page 341 and 342:

    } #calculate mean precision and rec

  • Page 343 and 344:

    #read lines from SAM file my ($samL

  • Page 345 and 346:

    Appendix A-10 partClustering.pl #!

  • Page 347 and 348:

    if ($method eq "fuzzyk" || $method

  • Page 349 and 350:

    } my $junkHeading = shift(@idLines)

  • Page 351 and 352:

    } } elsif ($reads{$id} eq "Singleto

  • Page 353 and 354:

    {$contigID}{"Length"}; {$contigID}{

  • Page 355 and 356:

    Appendix A-12 randomSeqFetcher.pl #

  • Page 357 and 358:

    my $c2count = 0; #randomly assign a

  • Page 359 and 360:

    Appendix B • Appendix B-1: a tabl

  • Page 361 and 362:

    Taxon Genome size Reads used Total

  • Page 363 and 364:

    Taxon Genome size Reads used Total

  • Page 365 and 366:

    Taxon Genome size Reads used Total

  • Page 367 and 368:

    Taxon Genome size Reads used Total

  • Page 369 and 370:

    Taxon Genome size Reads used Total

  • Page 371 and 372:

    Taxon Genome size Reads used Total

  • Page 373 and 374:

    Appendix B-2 A table detailing the

  • Page 375 and 376:

    Species Genus Family Order Class Ph

  • Page 377 and 378:

    Species Genus Family Order Class Ph

  • Page 379 and 380:

    Species Genus Family Order Class Ph

  • Page 381 and 382:

    Species Genus Family Order Class Ph

  • Page 383 and 384:

    Species Genus Family Order Class Ph

  • Page 385 and 386:

    Species Genus Family Order Class Ph

  • Page 387 and 388:

    Species Genus Family Order Class Ph

  • Page 389 and 390:

    Species Genus Family Order Class Ph

  • Page 391 and 392:

    Abbreviation Term Definition FCM Fu

  • Page 393 and 394:

    Bibliography Abe, T., S. Kanaya, et

  • Page 395 and 396:

    Computational Molecular Biology, Pr

  • Page 397 and 398:

    Halkidi, M., Y. Batistakis, et al.

  • Page 399 and 400:

    Le, S. Q. and R. Durbin (2011). "SN

  • Page 401 and 402:

    Ng, A. Y., M. I. Jordan, et al. (20

  • Page 403 and 404:

    Schloss, P. D. and J. Handelsman (2

  • Page 405 and 406:

    Microbial Genomics: Bioinformatics

  • Page 407:

    Zhang, R. and C.-T. Zhang (2004). "

The Archaeology of Medieval Europe - White Rose Research Online
See PDF version here. - Blue & White Online
See PDF version here. - Blue & White Online
See PDF version here. - Blue & White Online
Read Online (PDF) Discrete Chaos, Second Edition: With Applications in Science and Engineering - Read Unlimited eBooks and Audiobooks
Read Online (PDF) Draplin Design Co.: Pretty Much Everything - All Ebook Downloads
Read Editorial online - pdf file - Laboratory equipment manufacturers
Download PDF Surgical Critical Care: For the MRCS OSCE Free download and Read online
Download Brochure (PDF) - Platea Online
Read Online (PDF) Dental Terminology (Book Only) - Read Unlimited eBooks and Audiobooks
Read Online (PDF) INFANTS TODDLERS CAREGIVERS:CURRICULUM RELATIONSHIP - All Ebook Downloads
Read Editorial online - pdf file - Laboratory equipment manufacturers
Part 1 Number 2 2011 - Never Give Up (PDF 1MB) - Literacy Online
See PDF version here - Blue & White Online
[PDF] HBR s 10 Must Reads on Teams (with featured article The Discipline of Teams, by Jon R. Katzenbach and Douglas K. Smith) Download by - Harvard Business Review
(September 1952) [PDF] “Ponderosa Pines,” 31(9) - Yosemite Online
[PDF] Why Should White Guys Have All the Fun?: How Reginald Lewis Created a Billion-Dollar Business Empire [With DVD] Download by - Reginald F. Lewis
See PDF version here. - Blue & White Online
Read Magazine Online In pdf reader - Mellow Magazine
Read Online (PDF) Health Policy Issues: An Economic Perspective, Sixth Edition - Read Unlimited eBooks and Audiobooks
Online [PDF] Inquiry and Leadership: A Resource for the DNP Project - Read Unlimited eBooks and Audiobooks
Read Online (PDF) Maternal Newborn Nursing Care Plans - Read Unlimited eBooks and Audiobooks
Read Online (PDF) The Making of a Tropical Disease: A Short History of Malaria (Johns Hopkins Biographies of Disease) - Read Unlimited eBooks and Audiobooks
Online [PDF] The Hidden Half of Nature: The Microbial Roots of Life and Health - Read Unlimited eBooks and Audiobooks