Data integration in microbial genomics ... - Jacobs University

More documents

Recommendations

Info

40 3. CDinFusion Genomic Standards Consortium (GSC) promotes checklists and standards to better describe our sequence data collection and to promote the capturing, exchange and integration of sequence data with contextual data. In a recent community effort the GSC has developed a series of recommendations for contextual data that should be submitted along with sequence data. To support the scientific community to significantly enhance the quality and quantity of contextual data in the public sequence data repositories, specialized software tools are needed. In this work we present CDinFusion, a web-based tool to integrate contextual and sequence data in (Multi)FASTA format prior to submission. The tool is open source and available under the Lesser GNU Public License 3. A public installation is hosted and maintained at the Max Planck Institute for Marine Microbiology at http://www.megx.net/cdinfusion. The tool may also be installed locally using the open source code available at http://code.google.com/p/cdinfusion. 3.2 Introduction The introduction of the first deoxyribonucleic acid (DNA) sequencing methods in 1977 marked a major breakthrough in life science [Gilbert and Maxam, 1973,Sanger et al., 1977]. Subsequently, developments in these technologies allow the routine sequencing of organismal genomes, metagenomes and marker genes from all domains of life. Genomic information can be seen as the ’blueprint’ of life and being able to decode and to interpret it, grants insight into life’s fundamental mechanisms [Moxon and Higgins, 1997, Henry et al., 2010]. However, microbes pose a challenge to genomic description as the vast majority of microbial life cannot readily be isolated in pure cultures [Amann et al., 1995,Curtis et al., 2002]. The rise of cultivation independent approaches like metagenomic and sequencing of marker genes addresses this limitation [Handelsman, 2004]. In these approaches, bulk DNA is extracted from an environmental sample and either specific genes are amplified and sequenced or random sequencing is performed. Thus, a fragmented, but cultivation-independent, overview of an environment’s biological diversity and functional potential is provided [Pruesse et al.,
3.2. Introduction 41 2007, Ratnasingham and Hebert, 2007]. Early on, scientists recognized the necessity to share sequence data to facilitate reuse, reproducibility and comparisons. This has become an integral part of the research and publication process. In the ’Bermuda Principles’, on the first international strategy meeting on human genome sequencing in 1996, it was agreed upon, that all human genomic sequence information, generated by centers funded for largescale human sequencing, should be freely available in the public domain to encourage research and to maximize its benefits to society. In the Fort Lauderdale meeting in 2003 organized by the Wellcome Trust, it was finally agreed to deposit all kinds of sequencing data that are analyzed in scientific publications in public databases. Over the past two decades, the amount of sequence data submitted to the world’s largest public nucleotide sequence data repository INSDC (International Nucleotide Sequence Database Collaboration, comprising of DDBJ (DNA Data Bank of Japan), ENA (European Nucleotide Archive), and Gen- Bank) has grown exponentially [Stratton et al., 2009]. Recently, Next Generation Sequencing (NGS) technologies [Mardis, 2008] allow even faster and more economical sequence generation, resulting in an unprecedented sequence accumulation. Despite the impressive magnitude of sequence data generation, numerous life science studies have shown that contextual (meta)data (CD) are crucial for their interpretation [DeLong et al., 2006,Fuhrman et al., 2006, Schriml et al., 2010]. CD are metadata about features such as the environmental origin and the processing steps that were applied to obtain the sequences. These ranges from data about the geographic location (latitude, longitude), sampling time, habitat, to experimental procedures used to obtain the sequences up to video data recorded during sampling. The fact however that e.g. latitude, longitude (INSDC: lat lon), and time (INSDC: collection date), which can be submitted to the public repositories since years, have so far only been reported in 7.3% and 7.2% of all submissions [Hankeln et al., 2010], strongly implies that the procedure to deposit these data is hampered. Common reasons are: 1) no clear descriptors exist to guide the submitters which metadata should be deposited and 2) no appropriate tools exist that support the combined submission of sequence data and CD. These concerns have recently prompted the Genomic Standards Con-
Page 1: Max Planck Institute for Marine Mic
Page 4 and 5: List of abbreviations BLAST Basic L
Page 6 and 7: Thesis abstract Deoxyribonucleic ac
Page 8 and 9: 4.2 Introduction . . . . . . . . .
Page 10 and 11: 2 1. Introduction syntax: data leve
Page 12 and 13: 4 1. Introduction used to derive ne
Page 14 and 15: 6 1. Introduction croorganisms thri
Page 16 and 17: 8 1. Introduction the middle at the
Page 18 and 19: 10 1. Introduction marker genes hel
Page 20 and 21: 12 1. Introduction what kind and wh
Page 22 and 23: 14 1. Introduction possibilities li
Page 24 and 25: 16 1. Introduction tual data acquis
Page 26 and 27: 18 1. Introduction Doug Wendel, Owe
Page 29 and 30: CHAPTER 2 METABAR A tool for consis
Page 31 and 32: 2.2. Background 23 molecular sequen
Page 33 and 34: 2.2. Background 25 printing of data
Page 35 and 36: 2.4. Results 27 KML export function
Page 37 and 38: 2.4. Results 29 Figure 2.3: Screens
Page 39 and 40: 2.5. Discussion 31 taking samples i
Page 41 and 42: 2.5. Discussion 33 Handlebar MetaBa
Page 43 and 44: 2.6. Conclusion 35 rine Microbiolog
Page 45: 2.6. Conclusion 37 RK advised progr
Page 50 and 51: 42 3. CDinFusion sortium (GSC), an
Page 52 and 53: 44 3. CDinFusion tiFASTA file with
Page 54 and 55: 46 3. CDinFusion offered the option
Page 56 and 57: 48 3. CDinFusion AMD Opteron TM pro
Page 58 and 59: 50 3. CDinFusion able separately an
Page 60 and 61: 52 3. CDinFusion and upload it with
Page 63 and 64: CHAPTER 4 MIMARKS The Minimum infor
Page 65 and 66: 4.2. Introduction 57 edge generatio
Page 67 and 68: 4.3. Development of MIMARKS 59 in t
Page 69 and 70: 4.4. Survey of published parameters
Page 71 and 72: 4.5. The MIMARKS checklist 63 Resul
Page 73 and 74: 4.6. Conclusions and call for actio
Page 75 and 76: CHAPTER 5 MEGX.NET Integrated datab
Page 77 and 78: 5.3. New database structure and con
Page 79 and 80: 5.3. New database structure and con
Page 81 and 82: 5.4. User Access 73 5.4 User Access
Page 83 and 84: 5.4. User Access 75 tal context, co
Page 85: 5.5. Summary 77 access charge: Max
Page 88 and 89: 80 6. Domains of unknown function 6
Page 90 and 91: 82 6. Domains of unknown function F
Page 92 and 93: 84 6. Domains of unknown function p
Page 95 and 96: CHAPTER 7 SUMMARY AND DISCUSSION Th
Page 97 and 98: 7.2. GSC standards development 89 T
Page 99 and 100:
7.2. GSC standards development 91 o
Page 101 and 102:
7.4. In silico Hypothesis Generatio
Page 103:
7.5. Getting the most out of the da
Page 106 and 107:
98 8. Conclusion and outlook nities
Page 108 and 109:
100 8. Conclusion and outlook plina
Page 110 and 111:
102 BIBLIOGRAPHY [Beynon-Davies, 20
Page 112 and 113:
104 BIBLIOGRAPHY oceanic regions by
Page 114 and 115:
106 BIBLIOGRAPHY [Giovannoni et al.
Page 116 and 117:
108 BIBLIOGRAPHY [Kitano, 2002] Kit
Page 118 and 119:
110 BIBLIOGRAPHY pathways in marine
Page 120 and 121:
112 BIBLIOGRAPHY [Ramette, 2007] Ra
Page 122 and 123:
114 BIBLIOGRAPHY (2006). Microbial
Page 124 and 125:
116 BIBLIOGRAPHY [Williamson et al.
Page 127:
CHAPTER 9 ACKNOWLEDGEMENTS First, I
show all

Data integration in microbial genomics ... - Jacobs University

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?