Appendix 2 â Data Submission Checklist

Appendix 2 – Submission to public databases and compliance to 

appropriate data standards 

1. Submission of biological data to public repositories.................................... 1 

2. Primary Sequence Data.............................................................................. 1 

2.1 EMBL..................................................................................................... 1 

2.2 GenBank................................................................................................ 2 

3.Transcriptomics Data.................................................................................... 2 

3.1 Microarray Experiments ........................................................................ 2 

4.Proteomic data.............................................................................................. 3 

4.1 Protein Sequence Data.......................................................................... 3 

5.References.................................................................................................... 3 

1. Submission of biological data to public repositories 

This section reviews specific information for the handling and submission of 

common biological data types to public repositories including any standards 

or formats that must be adhered to. Web links for submission sites and 

programs are given where applicable. This list is based on data types 

currently being generated by EG Awardees and will be extended as 

necessary over time. 

Most public databases allow the data submitter to specify date of public 

release of the data and will provide accession numbers. Please make this 

the public release data available to us along with the accession numbers of 

your submission for inclusion in the data catalogue. 

EGTDC staff are available to help with data submissions, if you have further 

queries please contact helpdesk@envgen.nox.ac.uk. 

2. Primary Sequence Data 

Primary nucleotide sequence data is generally submitted directly to one of the 

three international repositories, EMBL, GenBank and DDBJ. Submission of 

EST sequences to the dbEST database is discussed below. 

2.1 EMBL 

Information for submitters to EMBL: 

http://www.ebi.ac.uk/embl/Documentation/information_for_submitters.html 

Webin (http://www.ebi.ac.uk/embl/Submission/webin.html) is the preferred 

system for submitting nucleotide sequence and biological annotation to 

EMBL (Kanz C. et al. 2005). Single, multiple, or large numbers of 

sequences can be submitted through this interface. 

1

Please note that EMBL stopped accepting email submissions on January 

1, 2003. 

If you will produce a large volume of genome sequence over an extended 

period of time, please contact the EMBL database administrators at 

datasubs@ebi.ac.uk 

2.2 GenBank 

Submissions to GenBank can be done using the BankIt web submission 

tool (http://www.ncbi.nlm.nih.gov/BankIt/) or the Sequin tool 

(http://www.ncbi.nlm.nih.gov/Sequin/index.html). For simple submissions, 

BankIt is recommended (Dennis A. et al. 2005). Sequin is available on 

Bio-Linux. 

2.3 Expressed Sequence Tag (EST) sequence 

EST sequences can be submitted to the public EST repository dbEST 

(http://www.ncbi.nlm.nih.gov/dbEST/). The trace2dbEST software 

developed by the EGTDC and available on Bio-Linux can be used for EST 

processing and direct submission to dbEST 

(http://envgen.nox.ac.uk/est.html) 

3. Transcriptomics Data 

3.1 Microarray Experiments 

Microarray experiment descriptions and results should be annotated to 

MIAME standard and submitted to a public repository such as ArrayExpress. 

(http://www.ebi.ac.uk/arrayexpress/). Further details on the MIAME standard 

can be found at http://envgen.nox.ac.uk/miame/index.html and on the 

MIAME/Env data standard at 

http://envgen.nox.ac.uk/miame/miame_env.html. 

We recommend the maxdLoad2 software, developed by the EGTDC and 

installed on Bio-Linux for annotation and preparation of a file in MAGEML 

format suitable for submission to ArrayExpress. 

The EGTDC works closely with ArrayExpress. As of March 2005, the EGTDC 

recommends that microarray data be submitted via the EGTDC and a copy of 

the annotated data will be held at the data centre as well as in ArrayExpress. 

Reasons for this include: 

• Functions for data retrieval and searching across datasets held in 

ArrayExpress are still under development and holding the data locally 

enables us to provide accessibility and functionality not currently 

supported by the public repository 

• Partial datasets can be submitted and held 

• Potential for searching across datasets of other types held in 

compatible databases being developed by the EGTDC 

2

Hence, the recommended process for submission of microarray expression 

data is as follows:- 

1. Use maxdLoad2 to export your experiment as maxdML 

2. Submit the maxdML file to the EGTDC 

3. The EGTDC stores the data in a central maxd database and arranges 

submission to ArrayExpress with all communications between the EBI and 

the EGTDC open to the researcher providing the dataset 

You will be issued an ArrayExpress accession number for your dataset. By 

default ArrayExpress holds back your data from public release until you have 

published the data. 

If you have made other arrangements to store your data and submit your data 

to the EBI please advise us of the accession number to complete your data 

catalogue entry. 

For further information please see the EGTDC MIAME-Compliance Guide 

(http://envgen.nox.ac.uk/envgen/software/archives/000527.html) 

4. Proteomic data 

4.1 Protein Sequence Data 

Directly-sequenced protein/peptide sequences can be submitted to Uni-Prot 

(Universal Protein Resource). The recommended method for direct 

submission to Uni-Prot is via the Spin website 

(http://www.ebi.ac.uk/swissprot/Submissions/submissions.html). 

The EGTDC is currently evaluating solutions for proteomics data. If you have 

data to submit please contact the EGTDC (helpdesk@envgen.nox.ac.uk). 

5. References 

Carola Kanz, Philippe Aldebert, Nicola Althorpe, et al. Title: The EMBL Nucleotide 

Sequence Database. Nucl. Acids Res. 2005. Full Text at 

http://nar.oupjournals.org/cgi/content/full/33/suppl_1/D29 

Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and 

David L. Wheeler. Title: GenBank. Nucl. Acids Res. 2005. Full text at 

http://nar.oupjournals.org/cgi/content/full/33/suppl_1/D34 

3

Appendix 2 â Data Submission Checklist

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?

Appendix 2 â Data Submission Checklist