Appendix 2 â Data Submission Checklist
Appendix 2 â Data Submission Checklist
Appendix 2 â Data Submission Checklist
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Appendix</strong> 2 – <strong>Submission</strong> to public databases and compliance to<br />
appropriate data standards<br />
1. <strong>Submission</strong> of biological data to public repositories.................................... 1<br />
2. Primary Sequence <strong>Data</strong>.............................................................................. 1<br />
2.1 EMBL..................................................................................................... 1<br />
2.2 GenBank................................................................................................ 2<br />
3.Transcriptomics <strong>Data</strong>.................................................................................... 2<br />
3.1 Microarray Experiments ........................................................................ 2<br />
4.Proteomic data.............................................................................................. 3<br />
4.1 Protein Sequence <strong>Data</strong>.......................................................................... 3<br />
5.References.................................................................................................... 3<br />
1. <strong>Submission</strong> of biological data to public repositories<br />
This section reviews specific information for the handling and submission of<br />
common biological data types to public repositories including any standards<br />
or formats that must be adhered to. Web links for submission sites and<br />
programs are given where applicable. This list is based on data types<br />
currently being generated by EG Awardees and will be extended as<br />
necessary over time.<br />
Most public databases allow the data submitter to specify date of public<br />
release of the data and will provide accession numbers. Please make this<br />
the public release data available to us along with the accession numbers of<br />
your submission for inclusion in the data catalogue.<br />
EGTDC staff are available to help with data submissions, if you have further<br />
queries please contact helpdesk@envgen.nox.ac.uk.<br />
2. Primary Sequence <strong>Data</strong><br />
Primary nucleotide sequence data is generally submitted directly to one of the<br />
three international repositories, EMBL, GenBank and DDBJ. <strong>Submission</strong> of<br />
EST sequences to the dbEST database is discussed below.<br />
2.1 EMBL<br />
Information for submitters to EMBL:<br />
http://www.ebi.ac.uk/embl/Documentation/information_for_submitters.html<br />
Webin (http://www.ebi.ac.uk/embl/<strong>Submission</strong>/webin.html) is the preferred<br />
system for submitting nucleotide sequence and biological annotation to<br />
EMBL (Kanz C. et al. 2005). Single, multiple, or large numbers of<br />
sequences can be submitted through this interface.<br />
1
Please note that EMBL stopped accepting email submissions on January<br />
1, 2003.<br />
If you will produce a large volume of genome sequence over an extended<br />
period of time, please contact the EMBL database administrators at<br />
datasubs@ebi.ac.uk<br />
2.2 GenBank<br />
<strong>Submission</strong>s to GenBank can be done using the BankIt web submission<br />
tool (http://www.ncbi.nlm.nih.gov/BankIt/) or the Sequin tool<br />
(http://www.ncbi.nlm.nih.gov/Sequin/index.html). For simple submissions,<br />
BankIt is recommended (Dennis A. et al. 2005). Sequin is available on<br />
Bio-Linux.<br />
2.3 Expressed Sequence Tag (EST) sequence<br />
EST sequences can be submitted to the public EST repository dbEST<br />
(http://www.ncbi.nlm.nih.gov/dbEST/). The trace2dbEST software<br />
developed by the EGTDC and available on Bio-Linux can be used for EST<br />
processing and direct submission to dbEST<br />
(http://envgen.nox.ac.uk/est.html)<br />
3. Transcriptomics <strong>Data</strong><br />
3.1 Microarray Experiments<br />
Microarray experiment descriptions and results should be annotated to<br />
MIAME standard and submitted to a public repository such as ArrayExpress.<br />
(http://www.ebi.ac.uk/arrayexpress/). Further details on the MIAME standard<br />
can be found at http://envgen.nox.ac.uk/miame/index.html and on the<br />
MIAME/Env data standard at<br />
http://envgen.nox.ac.uk/miame/miame_env.html.<br />
We recommend the maxdLoad2 software, developed by the EGTDC and<br />
installed on Bio-Linux for annotation and preparation of a file in MAGEML<br />
format suitable for submission to ArrayExpress.<br />
The EGTDC works closely with ArrayExpress. As of March 2005, the EGTDC<br />
recommends that microarray data be submitted via the EGTDC and a copy of<br />
the annotated data will be held at the data centre as well as in ArrayExpress.<br />
Reasons for this include:<br />
• Functions for data retrieval and searching across datasets held in<br />
ArrayExpress are still under development and holding the data locally<br />
enables us to provide accessibility and functionality not currently<br />
supported by the public repository<br />
• Partial datasets can be submitted and held<br />
• Potential for searching across datasets of other types held in<br />
compatible databases being developed by the EGTDC<br />
2
Hence, the recommended process for submission of microarray expression<br />
data is as follows:-<br />
1. Use maxdLoad2 to export your experiment as maxdML<br />
2. Submit the maxdML file to the EGTDC<br />
3. The EGTDC stores the data in a central maxd database and arranges<br />
submission to ArrayExpress with all communications between the EBI and<br />
the EGTDC open to the researcher providing the dataset<br />
You will be issued an ArrayExpress accession number for your dataset. By<br />
default ArrayExpress holds back your data from public release until you have<br />
published the data.<br />
If you have made other arrangements to store your data and submit your data<br />
to the EBI please advise us of the accession number to complete your data<br />
catalogue entry.<br />
For further information please see the EGTDC MIAME-Compliance Guide<br />
(http://envgen.nox.ac.uk/envgen/software/archives/000527.html)<br />
4. Proteomic data<br />
4.1 Protein Sequence <strong>Data</strong><br />
Directly-sequenced protein/peptide sequences can be submitted to Uni-Prot<br />
(Universal Protein Resource). The recommended method for direct<br />
submission to Uni-Prot is via the Spin website<br />
(http://www.ebi.ac.uk/swissprot/<strong>Submission</strong>s/submissions.html).<br />
The EGTDC is currently evaluating solutions for proteomics data. If you have<br />
data to submit please contact the EGTDC (helpdesk@envgen.nox.ac.uk).<br />
5. References<br />
Carola Kanz, Philippe Aldebert, Nicola Althorpe, et al. Title: The EMBL Nucleotide<br />
Sequence <strong>Data</strong>base. Nucl. Acids Res. 2005. Full Text at<br />
http://nar.oupjournals.org/cgi/content/full/33/suppl_1/D29<br />
Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and<br />
David L. Wheeler. Title: GenBank. Nucl. Acids Res. 2005. Full text at<br />
http://nar.oupjournals.org/cgi/content/full/33/suppl_1/D34<br />
3