12.07.2015 Views

Molecular Biology Databases - CNB - Protein Design Group

Molecular Biology Databases - CNB - Protein Design Group

Molecular Biology Databases - CNB - Protein Design Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MOLECULAR BIOLOGYDATABASESJuan Carlos Sánchez FerreroCentro Nacional de Biotecnología, CSICJuly 2008


GROWING NUMBER OF DATA• <strong>Molecular</strong> biology data explosion in the “omics” era: genome sequencing,high-throughput proteomics, structural genomics, functional genomics7000000045000Nucleotide sequences (GenBank)60000000500000004000000030000000200000001000000001990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006Year4000035000300002500020000150001000050000<strong>Protein</strong> structures (PDB)• Bioinformatics: to uncover relevant biological information hidden in thehuge amount of biological data• All that information (primary data and derived data) is stored in databases


Ouellette, 2000


PRIMARY BIOLOGICAL DATA SOURCES• Genome sequencing projects: genome sequences• Functional genomics: gene expression data• Proteomics: protein catalogues, postranslational modifications• Structural genomics: protein structures• Functional interactions between cellular components:gene regulatory networks• Physical interactions between proteins: protein interaction networks• Diverse experimental data: non structured information=>publications


GROWING NUMBER OF DATABASESThe number of molecular biology databases is also highly increasing• Because of huge data amount and new types of biological data (less)• Due to database specialization (most), e.g. by taxonomic group


Environmental genome shotgun sequencing ofthe Sargasso Sea.Venter et al.Science March 2004.● 1.045 billion nucleotides● 1.2 million new genes● 782 new rhodopsin like photoreceptors● 1800 genomic species


HOW TO ACCESS DATABASES<strong>Databases</strong> are organized:• As flat files with all the information in a single text file• As relational databases (based on MySQL, Oracle, etc)Users can:• Query a database through a web form and get the information onthe screen (HTML) or download it as text (e.g. sequences) and otherformats such as XML.• Download all or the most relevant information of a database as text or XML


CLASSIFICATION OF MOLECULARBIOLOGY DATABASESFirst classification scheme:Classification scheme based on the source of the database content (core data):•Primary DBs: the content consists of experimentally obtained data•Secondary DBs: the content is the result of analyses of data in primary DBs


CONTENT OF PRIMARY DATABASESExperimentally derived information:•Nucleic acid sequences: complete genomes, cloned genome fragments,cDNAs, ESTs, SNPs, small RNAs•<strong>Protein</strong> or nucleic acid structures: atomic coordinates obtained by NMRand X-ray crystallography•Transcript or protein expression data: obtained in microarray experimentsand by proteomic approaches, respectively•Cellular processes, such as experimentally determined metabolic orregulatory pathways


CONTENT OF SECONDARY DATABASESPredictions or interpretations based on information contained inprimary databases• <strong>Protein</strong> sequences, deduced from nucleotide sequences• Alignments of protein or nucleic acid sequences• <strong>Protein</strong> families, inferred by sequence similarity or by the presence ofcommon motifs or domains• <strong>Protein</strong> families, inferred by structural similarity• Reconstructed (predicted) cellular processes, such as metabolic pathways


CLASSIFICATION OF MOLECULARBIOLOGY DATABASESSecond classification scheme:Follows the well known layout to describe the levels of organization ofprotein structures.● Primary DBs: the content consists of SEQUENCES (primarystructure of nucleic acids or proteins).● Secondary DBs: the content consists of PATTERNS (regions oflocal regularity, for example, conserved motifs or domains).● Structure DBs: the content consists of sets of ATOMICCOORDINATES (three-dimensional packing of secondary structureelements).This scheme applies only to biological molecules (for example, a databaseof results from micro-array expression experiments would not fit).


DATABASE CONTENT &ANNOTATIONSAdditional information complementing the DB content informationAnnotations can refer to:• Authorship of the entry• Experimental conditions• Source of the biological material• Sub-cellular location• <strong>Molecular</strong> function or cellular process• Bibliographic references• Cross-references: entry attributes that make reference to related entriesin other databases.Some annotations consist of information of secondary type, since they arepredictions, and some of them have been transferred from other databases


DATABASE SEARCHES:QUERY TYPESTEXT QUERIESThese type of fixed­form queries are searches performed against theANNOTATIONS, which, almost by definition, consist of texts.It is usual to allow the combination of words with Boolean operators (and, or,not) the use of wild cards (*), and also, to specify the search field, forexample:ponB and ayala* [AUTH]would result in a search with the word ponB, in any field, combined with asearch of the word ayala* in the author field.


DATABASE SEARCHES:QUERY TYPESQUERIES by CONTENTThis type of fixed­form query refers to searches performed against theCONTENT or CORE DATA, which, almost by definition, consists in abstractrepresentations:● Strings of characters that represent nucleotide or protein sequences.● Tables of atomic coordinates that represent three dimensional objects● Bitmap files that represent 2D gel images.


DATABASE SEARCHES:QUERY TYPESQUERIES by CONTENTFor example, in the case of a sequence database, we may be asking:Does a sequence exactly like:LLLIHRLHor similar to it, exists in the database?Because biological sequences change along time, or between individuals,this type of search is not a matter of finding exact matches in strings ofcharacters. On the contrary, it must consider the principles of molecularevolution.A number of algorithms have been developed to cope with that task, andBLAST is the most popular.


NUCLEOTIDE DATABASES• International collaboration among main three nucleotide sequencedatabases: NCBI GenBank, EMBL Nucleotide Sequence Databaseand DNA Data Bank of Japan (DDBJ)• Every sequence that is cited in a publication must be submitted to oneof these databases and made publicly available• Submissions from individual laboratories and batch submissions fromlarge-scale sequencing projects• Daily exchange of information between databases


NUCLEOTIDE DATABASESGenBank• Maintained at the National Center for Biotechnology Information (NCBI)• More than 61 million nucleotide sequences from ~240,000 organisms• Diverse sources: expressed sequence tag (EST), high throughput genomic(HTG), environmental sample (ENV), whole genome shotgun (WGS), ...• Features with biological significance such as coding regions and theirtranslations, transcription units, repeat regions, and sites of mutations• Complete bimonthly releases and daily updates• Accessible through Entrez (NCBI's search and retrieval system) and FTP• RefSeq: comprehensive, integrated, non-redundant set of sequencesincluding genomic DNA, transcripts and proteins


NUCLEOTIDE DATABASESEMBL Nucleotide Sequence Database• Maintained at the European Bioinformatics Institute (EBI)• 80.5 million entries from ~260,000 organisms (as of Sept. 2006)• Entry types include standard (STD), constructed (CON), third partyannotation (TPA), whole genome shotgun (WGS), annotated constructed(ANN) and mass genome annotation library (MGA)• Complete releases every three months• Accessible through the EBI Sequence Retrieval System (SRS),other web services and FTP


GENOME DATABASESEnsemblEnsembl, EBI: mostly vertebratesEntrez Genome, NCBI: eukaryotes, prokaryotes and virusesUCSC Genome Browser: mostly animalsThe Institute for Genomics Research:bacteria, fungi, parasites, plants


GENOME DATABASESEnsembl• Joint project between EMBL-EBI and the Sanger Institute• Comprehensive source of annotation of (mostly) chordate genome sequences• Automated annotation system of genes from unannotated species• Currently 37 genomes (Feb. 2008)• Features include:• Chromosome maps• Contig views• Gene predictions• Annotations• Gene structures (exons-introns)• Regulatory regions• SNPs• mRNAs• Peptides• Comparative genomics and evolutionary treesHubbard et al. (2007) Nucl. Acids Res. 35:D610-D617


GENOME DATABASESEntrez Genome• Includes complete chromosomes, organelles and plasmids as well asdraft genome assemblies• Chromosome views, contig maps, sequence maps (NCBI Map Viewer)• Integrated with Entrez Nucleotide and Entrez protein• Entrez Genome Project: 22 eukaryotic and 646 prokaryotic completegenomes (Feb. 2008)


GENOME DATABASESOTHER SPECIALIZED DATABASES• Rat Genome Database: laboratory rat, Rattus norvegicus• VectorBase: invertebrate vectors of human pathogens (e.g. Anopheles)• FlyBase: genus Drosophila• BeetleBase: genus Tribolium• WormBase: C. elegans and C. briggsae• PlasmoDB: Plasmodium falciparum• TAIR: Arabidopsis thaliana• Candida Genome Database: Candida albicans• Saccharomyces Genome Database, CYGD: Saccharomyces cerevisiae• EcoCyc: Escherichia coli• PBRC: family Poxviridae


GENE-CENTERED DATABASESComplete information about a gene: from genomic location to protein functionIntegrated with NCBI Entrez, contains genes frommultiple genomesOnly human genes, maintained by the WeizmannInstitute of Science


GENE-CENTERED DATABASES• Genomic context• Transcripts and links to Entrez Nucleotide• <strong>Protein</strong>s and links to Entrez <strong>Protein</strong>• Functional information extracted from Pubmed• <strong>Protein</strong> interactions• Gene Ontology terms


PROTEIN DATABASESNon-redundant data from SwissProt , PIR, PDB, and translations ofall coding sequences present in the EMBL/GenBank/DDBJSources: protein databases including SwissProt, PIR, PRF, PDB, andtranslations from annotated coding regions in GenBank and RefSeq


PROTEIN DATABASES• Consortium by the European Bioinformatics Institute (EBI), the <strong>Protein</strong>Information Resource (PIR) and the Swiss Institute of Bioinformatics (SIB)• The world's most comprehensive catalog of information on protein


PROTEIN DATABASESUniProtKB/Swiss-Prot• Manually curated protein sequences from all organisms• Functional information with bibliographic references• Domains and sites• Secondary and tertiary structures• Posttranslational modifications• 356,194 entries in UniProtKB/Swiss-Prot(Feb. 2008)


PROTEIN DATABASESUniProtKB/TrEMBL• Translations of all coding sequences present in the EMBL/GenBank/DDBJ• Automatic annotation and classification• <strong>Protein</strong>s that get manually annotated go to UniProtKB/Swiss-Prot• 5,395,414 entries in UniProtKB/TrEMBL (Feb. 2008)


STRUCTURE DATABASES<strong>Protein</strong> Data Bank• RCSB PDB (USA), MSD-EBI (Europe), PDBj (Japan) and BMRB (USA)• To maintain a single <strong>Protein</strong> Data Bank Archive of macromolecular structuraldata that is freely and publicly available• <strong>Protein</strong> and nucleic acid structures• X-ray crystallography, NMR and electron microscopy• More than 49000 structures (Feb 2008)


STRUCTURE DATABASES<strong>Protein</strong> structure classification databases• Structural Classification of <strong>Protein</strong>s• Folds, superfamilies and families of structural domains• Classified by manual inspection• Currently more than 34,000 PDB structures (Feb 2008)• Hierarchical classification of protein domain structures• Class(C), Architecture(A), Topology(T) and Homologous(H) superfamily• Combination of automated and manual procedures• More than 30,000 classified PDB structures (Feb 2008)


PROTEIN DOMAINS DATABASES<strong>Protein</strong> domains• Elements that usually fold independently of the rest of the protein chain• They often play an important role in the biological function of the proteinDomains defined with sequence patterns and profilesDomains defined with hidden Markov models (HMM)Domains defined with HMMs, and Pfam domainsDomain definitions from multiple databasesDomain definitions from Pfam, SMART and COG


PROTEIN DOMAIN DATABASES• Around ~9,300 protein families/domains• Covers 73% of proteins in UniProtKB• Tool for searching domains in protein sequences• Domains are defined with HMM profiles• Multiple sequence alignments• <strong>Protein</strong> domain architectures• Species distribution• Cross-references with protein sequence and structure databases• Maintained at the Sanger Institute


GENE EXPRESSION DATABASESGene expression profiling studies, array comparative genomic hybridization,chromatin-immunoprecipitation on arrays (ChIP-chip) studies, etcExperiment-centered and gene-centered viewsArrayExpress• Maintained by NCBI• More than 8,000 experiments• Maintained by the EBI• Around 3,300 experiments• Link to Expression Profiler,an online data analysis tool


PROTEIN INTERACTION DATABASESInformation on interaction partners, interaction type and experimental evidence• Experimental protein-protein interactions (literature)• Direct (physical) and indirect (functional) associations• More than 62,000 proteins and 162,000 interactions• Maintained at the EBI• Known and predicted protein-protein interactions• Direct and indirect associations• More than 1.5 million proteins from 373 organisms• At EMBL and UniZH


PATHWAY DATABASESDisplay metabolic and signaling pathways and how genes are functionallyconnectedComprehensive catalogue of genes and pathwaysat Kyoto UniversityHuman signaling pathways database by Natureand National Cancer InstituteMetabolic pathways mostly in microorganisms and plants


BIBLIOGRAPHIC DATABASESPublished experimental information• Over 16 million citations from MEDLINE and other life science journals• Links to full text articles• Links to related articles• Links to gene information and other NCBI databases• US National library of Medicine


ONTOLOGIES IN MOLECULAR BIOLOGYControlled and structured vocabularies, constructed with two purposes• Proposing standard collections of terms for annotation• Organizing the knowledge of a given field around its languageExamples of ontologies• Enzyme Commission Nomenclature: EC numbers for enzymes• MeSH (Medical Subject Headings) terms: NLM controlled vocabulary• Gene Ontology: controlled vocabulary to describe gene and gene productattributes in any organism


ONTOLOGIES IN MOLECULAR BIOLOGY● Enzyme Commission Nomenclature.EC 1. ­. ­.­ Oxidoreductases.EC 1. 1. ­.­ Acting on the CH­OH group of donors.EC 1. 1. 1.­ With NAD(+) or NADP(+) as acceptor.EC 1. 1. 2.­ With a cytochrome as acceptor.


GENE ONTOLOGYDeveloped and maintained by the GO Consortium of laboratories andinstitutions involved in molecular biology database managementGene Ontology terms have been grouped in three ontologies:• <strong>Molecular</strong> functions• Biological processes• Cellular componentsMost molecular biology databases have joined this initiative, and haveincluded annotations following this standard


GENE ONTOLOGYAnnotators•Database curators•Gene Ontology Annotation (GOA) database: central repository, applies-this controlled vocabulary to a non-redundant set of proteinsGO browsers• Search GO terms for a gene/protein• Retrieve genes/proteins associated to a GO term


GenBank and EMBL● GenBank and the EMBL nucleotide databases were founded in 1986.● Contain ALL DNA sequences ever published .● Submissions are made by the laboratories or centers that obtain them.● Depositing sequences in GenBank / EMBL is a requisite of most journalsto accept manuscripts in which new sequences are reported.● In August 2003 they contained:● 18,197.000 sequence files● 22.617,000.000 nucleotides● Every two months a new complete version of the database is published.


Organismal Divisions in GenBankPRI PrimateROD RodentMAM MammalianVRT VertebrateINV InvertebratePLN PlantBCT BacterialRNA StructuralVRL ViralPHG PhageSYN SyntheticUNA UnannotatedFuncional Divisions in GenBankPAT PatentEST Expressed Sequence TagsSTS Sequence Tagged SitesGSS Genome Survey SequencesHTG High Throughput Genome


• Entrez is the indexing and data retrieval system developed by the NCBI• The Entrez Global Query page searches NCBI Entrez databases,either individually or globally


MAIN DATABASESACCESIBLE WITH ENTREZPubMed: Bibliographic references in <strong>Molecular</strong> <strong>Biology</strong> and Medicine.Nucleotide: Composite database of DNA sequences from Genbank, EMBL andDDBJ, plus other databases or projects such as RefSeq.RefSeq contains nucleotide sequences from the Nucleotide databasethat have been curated or re­annotated, by the NCBI.<strong>Protein</strong>: <strong>Protein</strong>s sequences derived from translation of DNA sequences inGenBank, EMBL y DDBJ, plus sequences from PIR, SWISSPROT and <strong>Protein</strong>Data Bank (PDB).Genome: Complete genomes, chromosomes, contig maps, physical maps.


MAIN DATABASESACCESIBLE WITH ENTREZEntrez Gene: locus centered database that integrates information from otherdatabases (replaces LocusLink).Structure: (<strong>Molecular</strong> Modeling Database, MMDB) experimentally obtainedstructures from <strong>Protein</strong> Data Bank (PDB).Taxonomy: Names and taxonomy of organisms that have at least onesequence at the NCBI databases.OMIM: Online Mendelian Inheritance in Man, catalog of human mutations andassociated diseases.


SEQUENCE FILESFLAT FILE FORMATSFor sequence databases, themain formats are:FASTAGenBankEMBL and SwissPro


FASTA format> essentialGIAccession.versionLocusAdditional information>gi|1941915|emb|X93081.1|BSBOFCGEN B.subtilis bofC geneCTGCAGCGGCTGACAATAGCAGGCCGACAACGGTTGAGGTGTCAACAGCTGATTTTGTGATGAAGGATAAACCGCATTTCTTTTTCCTTGAACGCTATAAGGATTCATATGAGGAGGAGATTCTCCGTTTTGCAGAAGCGATCGGCACAAACCAGGAGACTCCCTGCACCGGCAATGACGGTTTACAGGCCGGGAGGATCGCCAGAGCAGCACAGCAATCGCTTGCTTTTGGCATGCCTGTTAGCATTGAGCACACTGAAAAAATCGCTTTTTAATCTAACAGGATTACAATTCAGCAAGCTTGGGTATATACTCCATTGATACTTTAAGTAGGCGGTGGAGAAAATGAATACAGTACATGCTAAAGGAAATGTTTTGAACAAAATCGGAATTCCTTCTCACATGGTTTGGGGTTATATTGGCGTTGTCATCTTTATGGTTGGAGACGGCCTCGAACAAGGCTGGCTGTCTCCTTTTCTCGTTGATCATGGTCTCAGTATGCAGCAATCCGCATCGTTATTTACCATGTACGGCATTGCTGTCACCATCTCAGCTTGGCTTTCAGGAACGTTTGTGGAAACTTGGGGGCCGAGAAAAACGATGACTGTCGGATTGCTTGCATTTATCCTC>CTGCAGCGGCTGACAATAGCAGGCCGACAACGGTTGAGGTGTCAACAGCTGATTTTGTGATGAAGGATAAACCGCATTTCTTTTTCCTTGAACGCTATAAGGATTCATATGAGGAGGAGATTCTCCGTTTTGCAGAAGCGATCGGCACAAACCAGGAGACTCCCTGCACCGGCAATGACGGTTTACAGGCCGGGAGGATCGCCAGAGCAGCACAGCAATCGCTTGCTTTTGGCATGCCTGTTAGCATTGAGCACACTGAAAAAATCGCTTTTTAATCTAACAGGATTACAATTCAGCAAGCTTGGGTATATACTCCATTGATACTTTAAGTAGGCGGTGGAGAAAATGAA


GBFF: HEADERLOCUS BSBOFCGEN 2664 bp DNA linear BCT 15­APR­1997DEFINITION B.subtilis bofC, orf1, csbX, and orf4 genes.ACCESSION X93081VERSION X93081.1 GI:1941915KEYWORDS bofC gene; csbX gene; ORF1; ORF4.SOURCE Bacillus subtilisORGANISM Bacillus subtilisBacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus.REFERENCE 1AUTHORS Gomez,M. and Cutting,S.M.TITLE BofC encodes a putative forespore regulator of the Bacillussubtilis sigma K checkpointJOURNAL Microbiology 143 (Pt 1), 157­170 (1997)MEDLINE 97177783PUBMED 9025289REFERENCE 2 (bases 1 to 2664)AUTHORS Cutting,S.M.TITLE Direct SubmissionJOURNAL Submitted (14­NOV­1995) S.M. Cutting, Dept. of Microbiology,University of Pennsylvania School of Medicine, 346 JohnsonPavillon, 3610 Hamilton Walk, Philadelphia, PA 19104­6076, USA


GBFF: FEATURESFEATURESLocation/Qualifierssource 1..2664/organism="Bacillus subtilis"/strain="PY79"/isolate="168"/db_xref="taxon:1423"/germlinegene 1..275/gene="orf1"CDS


GBFF: SEQUENCEBASE COUNT 670 a 518 c 690 g 786 tORIGIN1 ctgcagcggc tgacaatagc aggccgacaa cggttgaggt gtcaacagct gattttgtga61 tgaaggataa accgcatttc tttttccttg aacgctataa ggattcatat gaggaggaga121 ttctccgttt tgcagaagcg atcggcacaa accaggagac tccctgcacc ggcaatgacg181 gtttacaggc cgggaggatc gccagagcag cacagcaatc gcttgctttt ggcatgcctg241 ttagcattga gcacactgaa aaaatcgctt tttaatctaa caggattaca attcagcaag301 cttgggtata tactccattg atactttaag taggcggtgg agaaaatgaa tacagtacat361 gctaaaggaa atgttttgaa caaaatcgga attccttctc acatggtttg gggttatatt421 ggcgttgtca tctttatggt tggagacggc ctcgaacaag gctggctgtc tccttttctc481 gttgatcatg gtctcagtat gcagcaatcc gcatcgttat ttaccatgta cggcattgct541 gtcaccatct cagcttggct ttcaggaacg tttgtggaaa cttgggggcc gagaaaaacg601 atgactgtcg gattgcttgc atttatcctc ggttcggccg cttttatcgg ctgggcgatt661 cctcatatgt attatccggc tctcttgggc agctatgctc ttagaggctt gggatatccg721 ctgtttgcat actcttttct cgtatgggtg tcatacagca cctctcaaaa tattcttgga781 aaagccgtcg gctggttttg gtttatgttt acgtgcggcc ttaacgtgct cggtccgttc841 tattccagct atgcagttcc ggcctttgga gaaatcaata cgctttggag cgctttactg901 tttgtggcgg caggcggaat tcttgcctta ttttttaaca aagataaatt tactccgata961 caaaaacaag atcagccgaa atggaaagaa ctgtcgaagg catttacgat tatgtttgaa1021 aaccctaagg taggcatcgg cggagtggtc aagacgatta atgcgatagg acaatttgga1081 tttgccatct ttcttcctac ttatttagca cgatacgggt attcggtttc ggaatggctg1141 caaatatggg ggactctgtt ttttgtgaat//LOCUS BSOTHERGENE 4356 bp DNA linear BCT 15­APR­1997


EMBL AND SwissProt FORMATThe field name is specified by a TWO CHARACTER keyword, in the beginningof each line, what makes easier to parse the file to extract specific information.ID BSBOFCGEN standard; genomic DNA; PRO; 2664 BP.XXAC X93081;XXSV X93081.1XXDT 15-APR-1997 (Rel. 51, Created)DT 15-APR-1997 (Rel. 51, Last updated, Version 12)XXDE B.subtilis bofC, orf1, csbX, and orf4 genesXXKW bofC gene; csbX gene; ORF1; ORF4.XXOS Bacillus subtilisOC Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus.XX


EMBL AND SwissProt FORMATRN [1]RX MEDLINE; 97177783.RX PUBMED; 9025289.RA Gomez M., Cutting S.M.;RT "BofC encodes a putative forespore regulatorRT of the Bacillus subtilis sigma K checkpoint";RL Microbiology 143:157-170(1997).XXXXDR GOA; O05389.DR GOA; O05390.DR GOA; O05391.DR GOA; O05392.DR SWISS-PROT; O05389; YRBE_BACSU.DR SWISS-PROT; O05390; CSBX_BACSU.DR SWISS-PROT; O05391; BOFC_BACSU.DR SWISS-PROT; O05392; RUVA_BACSU.


EMBL AND SwissProt FORMATFT source 1..2664FT /db_xref="taxon:1423"FT /germlineFT /mol_type="genomic DNA"FT /organism="Bacillus subtilis"FT /strain="PY79"FT /isolate="168"FT CDS


EMBL AND SwissProt FORMATXXSQ Sequence 2664 BP; 670 A; 518 C; 690 G; 786 T; 0 other;ctgcagcggc tgacaatagc aggccgacaa cggttgaggt gtcaacagct gattttgtga 60tgaaggataa accgcatttc tttttccttg aacgctataa ggattcatat gaggaggaga 120ttctccgttt tgcagaagcg atcggcacaa accaggagac tccctgcacc ggcaatgacg 180gtttacaggc cgggaggatc gccagagcag cacagcaatc gcttgctttt ggcatgcctg 240ttagcattga gcacactgaa aaaatcgctt tttaatctaa caggattaca attcagcaag 300cttgggtata tactccattg atactttaag taggcggtgg agaaaatgaa tacagtacat 360gctaaaggaa atgttttgaa caaaatcgga attccttctc acatggtttg gggttatatt 420ggcgttgtca tctttatggt tggagacggc ctcgaacaag gctggctgtc tccttttctc 480//


• Sequence Retrieval System is a data warehouse developed by Lion and EBI• It allows the connection of related information from many databases


SEQUENCE RETRIEVAL SYSTEM• Developed at different successive institutions.• 1990 ­ Started by Thure Etzold, at the EMBL.• 1997 ­ EBI (Cambridge), financed in part by EMBnet.• 1998 ­ Lion Biosciences.• It is not a database. It is a data warehouse. SRS uses flat text file versionsof many databases: EMBL, Swiss­Prot, MEDLINE, etc., which are copiedand indexed, to allow the connection of related information from severaldatabases.• There are many mirrors installed. The most popular is, probably, the one atthe EBI.• SRS offers access to more than 700 databases.


ACKNOWLEDGMENTSThis presentation contains material from previous presentations by:- Manuel J. Gómez, Centro de Astrobiología, CSIC-INTA- Rodrigo López, European Bioinformatics Institute

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!