Views
4 years ago

PDF - White Rose Etheses Online

PDF - White Rose Etheses Online

Project summary The aim

Project summary The aim of this project is to investigate the capability of composition-based methods of sequence comparison for the grouping and separation of reads from massively parallel high-throughput sequencing of multi-species environmental samples, according to the genome from which they originate. The types of dataset of particular interest in this project are those produced from an environment containing a few different species. Such a dataset could be obtained from a sample of infected tissue, where sequencing reads would be expected to originate from the genomes of the host (e.g. a plant or insect), and the pathogen(s) (e.g. a bacterium, virus, or fungus). These datasets will generally be referred to as ‘multi-species’, rather than metagenomic, because metagenomics is typically associated with an environmental system or community on a much larger scale. The motivation for this research is to establish the possible benefits of such a phylogenetic grouping of reads. An effective clustering may allow the identification of the species present in a sequenced sample allowing, for example, the identification of a particular pathogen, and the isolation of sequences belonging to the genome of a particular species in the sample. The isolation of pathogen reads from a dataset may facilitate the study of the genome of the pathogen where more conventional laboratory methods have failed. Grouping reads originating from a single genome may also improve the performance of sequence assembly methods applied to the dataset. Chapter 1 - Project summary By using supervised methods of grouping to provide a reference comparison with known pathogens, the nature or specific identity of the infectious agent(s) in the sample can be predicted. As in metagenomics, this approach of sequencing genetic material sampled directly from the tissue, removes the requirement for the pathogen to be isolated and cultured prior to analysis. The grouping of reads by genome can also remove contaminants within the dataset, to allow the genome of the host or pathogen species to be studied more easily. The alignment-free approach to grouping of sequences does not rely on the prior availability of a full genome for either species in order to predict the reads that belong to each group, which allows the study of potentially novel pathogens, in host species that are themselves not well-characterised. 33

Grouping of sequencing datasets using genomic signature features usually takes place after the reads have undergone assembly into longer contigs (Chen and Pachter 2005). The use of longer sequences improves the accuracy of the grouping obtained but, as has been discussed already, the increased burden of larger datasets adversely affects the performance and runtime of the assembly process. Where a dataset contains a large number of individual reads originating from multiple genomes - as is the case for an environmental sample sequenced directly - it may be beneficial in terms of assembly speed and quality to separate these reads according to a prediction of shared origin prior to assembly. Grouping reads according to the genome from which they originate and assembling the groups of reads separately reduces the time taken to assemble these reads into longer stretches of sequence, relative to considering all reads in the dataset at once. It should also result in fewer erroneous, chimeric sequences being assembled from reads obtained from multiple genomes with regions of homology. The following work describes the investigation of a range of published sequence features and the capacity for these features to group and separate sequencing reads in a multi-species sequencing dataset. The features and their combinations are compared to determine the set that best differentiates between reads based on their species of origin. This optimal set of features is used as a basis for the comparison of a range of clustering methods in order to find the optimal combination of feature and method to group a dataset. The extent of grouping and separation of reads that was achieved is then discussed in the context of the datasets studied. The effect of such a grouping on the performance of a sequence assembly algorithm is then investigated, to determine the benefit of such an approach. Chapter 1 - Project summary 34

  • Page 1: Clustering Large Raw DNA Sequencing
  • Page 5 and 6: Table of contents Table of Contents
  • Page 7 and 8: Table of Contents Extraction of DNA
  • Page 9 and 10: Table of Contents Contig assembly..
  • Page 11 and 12: List of Tables and Figures Table 3.
  • Page 13 and 14: Figure 5.5 The number of sequencing
  • Page 15: Declaration • The implementation
  • Page 18 and 19: Context Differences between the gen
  • Page 20 and 21: DNA sequencing - an overview Sanger
  • Page 22 and 23: sequence, which can then be assembl
  • Page 24 and 25: nucleotides to a strand is detected
  • Page 26 and 27: separated based on size (typically
  • Page 28 and 29: example, a pair of reads produced f
  • Page 30 and 31: The recent improvements in sequenci
  • Page 32 and 33: Chapter 1 - Metagenomics and sequen
  • Page 34 and 35: complement metagenomics and provide
  • Page 36 and 37: such as sampling time and location,
  • Page 38 and 39: aims of the HMP are described as:
  • Page 40 and 41: As with the larger and more complex
  • Page 42 and 43: Methods of sequence comparison Alig
  • Page 44 and 45: where local alignments can identify
  • Page 46 and 47: Ladunga (1994), led to the coining
  • Page 51 and 52: 2 A comparison of genomic signature
  • Page 53 and 54: GC content The GC content of DNA, t
  • Page 55 and 56: In order to ascertain the likelihoo
  • Page 58 and 59: To illustrate this point further, i
  • Page 60 and 61: words in each sequence, is benefici
  • Page 62: values, collected as the sample siz
  • Page 65 and 66: On a related note, the authors of t
  • Page 67 and 68: would likely be more closely relate
  • Page 70 and 71: Breakdown of simLC by reads-per-spe
  • Page 72 and 73: corresponding ‘true’ dataset. T
  • Page 74 and 75: if the dataset contains 50 sequence
  • Page 76 and 77: Figure 2.5 Clustering of sequences
  • Page 78 and 79: Clustering of Dataset 1 Tables 2.2
  • Page 80 and 81: Table 2.3 Mean recall values of clu
  • Page 82 and 83: use of OFDEG features was only marg
  • Page 84 and 85: Table 2.4 Mean precision and recall
  • Page 86 and 87: sequences from that genome in the d
  • Page 88 and 89: Table 2.6 Mean recall values of clu
  • Page 90 and 91: different proportions were grouped.
  • Page 92 and 93: R. palustris Bradyrhizobium BTAi1 C
  • Page 94 and 95: Figure 2.7(i) - 2.7(xv) Comparative
  • Page 96 and 97: IND Cluster 1 Cluster 4 Cluster 2 C
  • Page 98 and 99:

    TNF Cluster 1 Cluster 4 Cluster 2 C

  • Page 100 and 101:

    GC + OFDEG Cluster 1 Cluster 4 Clus

  • Page 102 and 103:

    IND + OFDEG Cluster 1 Cluster 4 Clu

  • Page 104 and 105:

    OFDEG + TNF Cluster 1 Cluster 4 Clu

  • Page 106 and 107:

    GC + IND + TNF Cluster 1 Cluster 4

  • Page 108 and 109:

    IND + OFDEG + TNF Cluster 1 Cluster

  • Page 110 and 111:

    When compared to the distribution o

  • Page 112 and 113:

    to that achieved with GC feature ve

  • Page 114 and 115:

    Table 2.8 Time taken (in seconds) t

  • Page 116 and 117:

    platforms, all with typical lengths

  • Page 118 and 119:

    feature vectors, which were found t

  • Page 120 and 121:

    However, because the sequencing rea

  • Page 122 and 123:

    single-variable GC content feature.

  • Page 125 and 126:

    3 Preparation and analysis of high-

  • Page 127 and 128:

    elonging to either the host species

  • Page 129:

    Materials and Methods Inoculation o

  • Page 133 and 134:

    Assay sequences: • Cucumber mosai

  • Page 135 and 136:

    Analysis of extracted RNA by qRT-PC

  • Page 137 and 138:

    Analysis of extracted DNA by qPCR E

  • Page 139 and 140:

    Table 3.5 Amount of DNA sequenced f

  • Page 141 and 142:

    Results Comparison of bacterial ino

  • Page 143 and 144:

    Mean Ct Value Mean Ct Value (a) (b)

  • Page 145 and 146:

    the CMV assay. If the poor amplific

  • Page 147 and 148:

    40 30 20 Mean Ct Value (COX Assay)

  • Page 149 and 150:

    qRT-PCR Analysis of Viral Treatment

  • Page 151 and 152:

    qPCR analysis of DNA extracts in pr

  • Page 153 and 154:

    Table 3.8 Mean Ct values observed i

  • Page 155 and 156:

    The full inoculation method involve

  • Page 157 and 158:

    Table 3.9 Mean Ct values observed i

  • Page 159 and 160:

    qRT-PCR Analysis of Dummy Inoculate

  • Page 161 and 162:

    Results of high-throughput DNA sequ

  • Page 163 and 164:

    Figure 3.10 Proportion of sequence

  • Page 165 and 166:

    Figure 3.11 Proportion of sequence

  • Page 167 and 168:

    • Viral treatment groups Table 3.

  • Page 169 and 170:

    Figure 3.12 Proportion of sequence

  • Page 171 and 172:

    Figure 3.13 Proportion of sequence

  • Page 173 and 174:

    Discussion Datasets produced from b

  • Page 175 and 176:

    Datasets produced from viral treatm

  • Page 177:

    present in the samples. The use of

  • Page 180 and 181:

    Introduction An evaluation of the p

  • Page 182 and 183:

    discussed elsewhere, the length of

  • Page 184 and 185:

    single clustering method, CLARA. Th

  • Page 186 and 187:

    UT (A. thaliana) UT (unassigned) UT

  • Page 188 and 189:

    Results The scope for the four feat

  • Page 190 and 191:

    Chapter 4 - Results GC (i) Cluster

  • Page 192 and 193:

    TNF GC + IND Cluster 1 UT (A. thali

  • Page 194 and 195:

    IND + OFDEG IND + TNF Cluster 1 UT

  • Page 196 and 197:

    GC + IND + TNF Cluster 1 GC + OFDEG

  • Page 198 and 199:

    Coherent with clustering results ob

  • Page 200 and 201:

    UT+Psp2126 - five clusters Figure 4

  • Page 202 and 203:

    IND + OFDEG (viii) Cluster 1 Cluste

  • Page 204 and 205:

    OFDEG + TNF Cluster 1 Cluster 4* Cl

  • Page 206 and 207:

    GC + IND + TNF (xii) Cluster 1 Clus

  • Page 208 and 209:

    IND + OFDEG + TNF (xiv) Cluster 1 C

  • Page 210 and 211:

    Once again, clustering results prod

  • Page 212 and 213:

    Discussion Several trends were iden

  • Page 214:

    produce large numbers of these feat

  • Page 217 and 218:

    Introduction Previous chapters have

  • Page 219 and 220:

    1981). Partitioning around mediods

  • Page 221 and 222:

    strength (Tibshirani and Walther 20

  • Page 223 and 224:

    Where these linkage metrics are mea

  • Page 225 and 226:

    many of the methods described previ

  • Page 227 and 228:

    Data can be grouped with an SOM in

  • Page 229 and 230:

    separation of data in each case. So

  • Page 231 and 232:

    here). Euclidean distance, the defa

  • Page 233 and 234:

    Beyond this general pattern within

  • Page 235 and 236:

    1.00 0.80 0.60 0.40 0.20 0 Chapter

  • Page 237 and 238:

    Parameter selection for spectral cl

  • Page 239 and 240:

    1.0 0.8 0.6 0.4 0.2 KASP Clustering

  • Page 241 and 242:

    HHSOM When originally published by

  • Page 243 and 244:

    No. of sequences assigned to node 3

  • Page 245 and 246:

    No. of sequences assigned to node 3

  • Page 247 and 248:

    No. of sequences assigned to node 5

  • Page 249 and 250:

    Comparison of partitioning clusteri

  • Page 251 and 252:

    Table 5.4 Precision and recall stat

  • Page 253 and 254:

    Cluster Species 1 2 3 4 5 6 7 A. th

  • Page 255 and 256:

    een grouped into the cluster. Preci

  • Page 257 and 258:

    Discussion The level of accuracy ac

  • Page 260 and 261:

    6 A comparison of de novo sequence

  • Page 262 and 263:

    where a pairwise comparison is made

  • Page 264 and 265:

    performed at random. This also impr

  • Page 266 and 267:

    Dataset Organism Genome Size Genome

  • Page 268 and 269:

    Results UT+Psp2126 The UT+Psp2126 d

  • Page 270 and 271:

    Metric Contigs Combined length (bp)

  • Page 272 and 273:

    As such, the increase in total leng

  • Page 274 and 275:

    As such, the predictions of mapping

  • Page 276 and 277:

    Combined length (bp) 450000 425000

  • Page 278 and 279:

    Combined length (bp) 80000 60000 40

  • Page 280 and 281:

    unclustered reads, for random clust

  • Page 282 and 283:

    Sample 1 - blackberry + suspected b

  • Page 284 and 285:

    Sample 2 - ivy + supected bacterial

  • Page 286 and 287:

    Sample 3 - tomato + Pepino mosaic v

  • Page 288 and 289:

    Speed of assembly The time taken fo

  • Page 290 and 291:

    Discussion UT+Psp2126 In previous c

  • Page 292 and 293:

    of the dataset before and after clu

  • Page 295 and 296:

    The UT+Psp2126 dataset cannot be th

  • Page 298 and 299:

    7 Abstract Discussion and future di

  • Page 300 and 301:

    pathogen material extracted from th

  • Page 302 and 303:

    would be beneficial in spite of the

  • Page 304 and 305:

    investigation might be made into wh

  • Page 306 and 307:

    Sequence assembly As new sequencing

  • Page 308 and 309:

    al. 2012). This method of character

  • Page 310 and 311:

    Appendix A-1: Use of perl scripts i

  • Page 312 and 313:

    Appendix A-3 randomSeqWriter.pl #!

  • Page 314 and 315:

    if (@alphabet < @names) { } foreach

  • Page 316 and 317:

    Appendix A-4 featureWriter.pl #! /u

  • Page 318 and 319:

    } print "GC content done...\n"; #ge

  • Page 320 and 321:

    } } #OFDEG if ($seqLength < $shorte

  • Page 322 and 323:

    } } else { } $revtethash{$tetraseq}

  • Page 324 and 325:

    } } $iteration++; @wordSizeArray =

  • Page 326 and 327:

    } } else { } if ($Odist ne "") { }

  • Page 328 and 329:

    } push (@CEF_array, $CEF); #calcula

  • Page 330 and 331:

    Appendix A-5 featureComboWriter.pl

  • Page 332 and 333:

    {$feat}}) { } } } else { } print OU

  • Page 334 and 335:

    \n"; } $rangeSplit[0] = 2; $rangeUL

  • Page 336 and 337:

    Appendix A-7 claraResultsSummariser

  • Page 338 and 339:

    $speciesPresent{$species}; } if (ex

  • Page 340 and 341:

    Appendix A-8 avePRwriter.pl #! /usr

  • Page 342 and 343:

    Appendix A-9 SAMseqAssigner.pl #! /

  • Page 344 and 345:

    } else { } close OUTFH; close PAFH;

  • Page 346 and 347:

    if ($method eq "fuzzyk" || $method

  • Page 348 and 349:

    Appendix A-11 contigInfo.pl #! /usr

  • Page 350 and 351:

    } #grep lists of reads used in each

  • Page 352 and 353:

    } $meanLength = $totalLength/$numCt

  • Page 354 and 355:

    } unless ($spCumLength > ($spSumCon

  • Page 356 and 357:

    } } else { } $seqLine = $_; chomp $

  • Page 358 and 359:

    } } if ($clusters{$ID} == $clusterN

  • Page 360 and 361:

    Appendix B-1 A table detailing the

  • Page 362 and 363:

    Taxon Genome size Reads used Total

  • Page 364 and 365:

    Taxon Genome size Reads used Total

  • Page 366 and 367:

    Taxon Genome size Reads used Total

  • Page 368 and 369:

    Taxon Genome size Reads used Total

  • Page 370 and 371:

    Taxon Genome size Reads used Total

  • Page 372 and 373:

    Taxon Genome size Reads used Total

  • Page 374 and 375:

    Species Genus Family Order Class Ph

  • Page 376 and 377:

    Species Genus Family Order Class Ph

  • Page 378 and 379:

    Species Genus Family Order Class Ph

  • Page 380 and 381:

    Species Genus Family Order Class Ph

  • Page 382 and 383:

    Species Genus Family Order Class Ph

  • Page 384 and 385:

    Species Genus Family Order Class Ph

  • Page 386 and 387:

    Species Genus Family Order Class Ph

  • Page 388 and 389:

    Species Genus Family Order Class Ph

  • Page 390 and 391:

    Table of Abbreviations Abbreviation

  • Page 392 and 393:

    Abbreviation Term Definition PAM Pa

  • Page 394 and 395:

    Bernardi, G. and G. Bernardi (1986)

  • Page 396 and 397:

    Eisen, J. A. (2007). "Environmental

  • Page 398 and 399:

    Kannan, R., S. Vempala, et al. (200

  • Page 400 and 401:

    Mavromatis, K., N. Ivanova, et al.

  • Page 402 and 403:

    Rico, A., S. L. McCraw, et al. (201

  • Page 404 and 405:

    Teeling, H., J. Waldmann, et al. (2

  • Page 406 and 407:

    Wendl, M. C. (2006). "A general cov

The Archaeology of Medieval Europe - White Rose Research Online
[+]The best book of the month Rose Red And Snow White [NEWS]
See PDF version here. - Blue & White Online
See PDF version here. - Blue & White Online
See PDF version here. - Blue & White Online
See PDF version here - Blue & White Online
See PDF version here. - Blue & White Online
Best [PDF] Girl Boss - She Designed A Life She Loved: 6x9 Blank Lined Journal For Business Women: Chic Inspirational Notebook - Floral Roses Black and White Stripes: Volume 1 (Boss Lady Gifts) Best Sellers Rank : #2 For Iphone#D#
[+][PDF] TOP TREND Baby Animals Black and White [FULL]
Download Brochure (PDF) - Platea Online
Read Online (PDF) Dental Terminology (Book Only) - Read Unlimited eBooks and Audiobooks
Read Online (PDF) Discrete Chaos, Second Edition: With Applications in Science and Engineering - Read Unlimited eBooks and Audiobooks
Read Editorial online - pdf file - Laboratory equipment manufacturers
Read Online (PDF) INFANTS TODDLERS CAREGIVERS:CURRICULUM RELATIONSHIP - All Ebook Downloads
Read Online (PDF) Draplin Design Co.: Pretty Much Everything - All Ebook Downloads
[+][PDF] TOP TREND Fly Guy Presents: The White House (Scholastic Reader, Level 2) [FULL]
Read Magazine Online In pdf reader - Mellow Magazine
[+][PDF] TOP TREND Passive Income: 25 Proven Business Models To Make Money Online From Home (Passive income ideas) [READ]
Read Editorial online - pdf file - Laboratory equipment manufacturers
[+][PDF] TOP TREND Banana: The Fate of the Fruit That Changed the World [READ]
[+][PDF] TOP TREND The Great White Shark Scientist (Scientists in the Field (Hardcover)) [READ]
Part 1 Number 2 2011 - Never Give Up (PDF 1MB) - Literacy Online
Download PDF Surgical Critical Care: For the MRCS OSCE Free download and Read online
[+][PDF] TOP TREND HBR s 10 Must Reads on Leadership (with featured article "What Makes an Effective Executive," by Peter F. Drucker) [NEWS]