Corona Lite Introduction
Corona Lite Introduction
Corona Lite Introduction
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong>
Section Outline – <strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong><br />
• Workflow and Setup<br />
• Matching pipeline<br />
• Pairing pipeline<br />
• Variation pipeline<br />
2 © 2009 Applied Biosystems
<strong>Corona</strong> <strong>Lite</strong> Overview<br />
3 © 2009 Applied Biosystems
GlobalSETS versus <strong>Corona</strong> <strong>Lite</strong><br />
Category SOLiD Global SETS v3.0 <strong>Corona</strong>_<strong>Lite</strong> v4.2<br />
Mapping Algorithm MapReads MapReads<br />
Mapping scheme -<br />
progressive<br />
-full length with fixed<br />
number of mismatches<br />
Yes, for max throughput.<br />
(default)<br />
Yes<br />
Repeat Classifier Yes, new in v3.0 No<br />
MatchingRepeat, Random,<br />
and Consolidate<br />
Yes, new in v3.0<br />
No<br />
Yes<br />
SNP algorithm diBayes SNP caller<br />
Multiple run combination<br />
analysis<br />
Integrated small indel<br />
analysis<br />
No<br />
Jun-09<br />
No<br />
Yes.<br />
Yes<br />
4 © 2009 Applied Biosystems
Global SETS Versus <strong>Corona</strong> <strong>Lite</strong><br />
Category SOLiD Global SETS v3.0 <strong>Corona</strong>_<strong>Lite</strong> v4.2<br />
Matching<br />
- Fasta-like .ma files<br />
- gff v.2<br />
Yes<br />
Default<br />
Yes<br />
Optional using MaToGff.sh<br />
Pairing<br />
- .mates<br />
- gff v.2<br />
Yes<br />
Default<br />
Yes<br />
Optional using MatesToGff.sh<br />
SNP pipe<br />
- SNP summary<br />
- Consensus base<br />
sequence<br />
Gff v.3<br />
Yes<br />
SNP list text file<br />
Yes<br />
Stats Files New format Old Stats file<br />
5 © 2009 Applied Biosystems
<strong>Corona</strong> <strong>Lite</strong> Setup<br />
• Before you start<br />
• Set the correct environment<br />
• Make cmap file<br />
• Validate reference<br />
• Generate double encode reference<br />
6 © 2009 Applied Biosystems
<strong>Corona</strong> <strong>Lite</strong> Setup – Environment Variables<br />
• Set up environment variables<br />
• for csh/tcsh:<br />
• setenv CORONAROOT /share/apps/corona_lite<br />
• source $CORONAROOT/etc/profile.d/corona.csh<br />
• for sh/ksh/bash:<br />
• export CORONAROOT=/share/apps/corona_lite<br />
• source $CORONAROOT/etc/profile.d/corona.sh<br />
7 © 2009 Applied Biosystems
<strong>Corona</strong> <strong>Lite</strong> Setup – Chromosome Map (cmap(<br />
cmap) ) File<br />
• Prepare the chromosome map file (tab-delimited):<br />
• Chromosome ID<br />
• Chromosome Name<br />
• FASTA Reference<br />
• Double-Encoded Reference<br />
• For example<br />
1 chr1 /path/to/file/chr1.fa /path/to/file/de_chr1.fa<br />
2 chr2 /path/to/file/chr2.fa /path/to/file/de_chr2.fa<br />
3 chr3 /path/to/file/chr3.fa /path/to/file/de_chr3.fa<br />
4 chr4 /path/to/file/chr4.fa /path/to/file/de_chr4.fa<br />
8 © 2009 Applied Biosystems
<strong>Corona</strong> <strong>Lite</strong> Setup – Validate and Double Encode Ref<br />
• Validate reference<br />
• reference_validation.pl –r chr1.fa –s 9999999999 –o<br />
chr1_validated.fa<br />
• Generate double-encoded sequence<br />
• encodeFasta.py -n -l sequence.fasta > de_sequence.fasta<br />
9 © 2009 Applied Biosystems
Section Outline – <strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong><br />
• Workflow and Setup<br />
• Matching pipeline<br />
• Pairing pipeline<br />
• Variation pipeline<br />
10 © 2009 Applied Biosystems
Things To Consider<br />
• Number of hits to report (-z)<br />
• Default is 10 per chromosome<br />
• What does it mean if it hit 10 times?<br />
• Recommended mismatches<br />
• 2 for 25bp reads<br />
• 3-4 for 35bp reads<br />
• 4-6 for 50bp reads<br />
• Can consider counting valid adjacent mismatches as 1<br />
(-a=1)<br />
11 © 2009 Applied Biosystems
Matching Parameters (Required)<br />
• matching_large_genomes_cmap_save.pl<br />
• -csfasta – F3 or R3 reads<br />
• -dir – Output directory<br />
• -cmap – Chromosome map file<br />
• -t – Tag length<br />
• -e – Number of errors allowed<br />
12 © 2009 Applied Biosystems
Matching Parameters (Optional)<br />
• matching_large_genomes_cmap_save.pl<br />
• -p – Pattern mask for reads<br />
• -a – 0 = no; 1 = valid adjacent errors; 2 = all adjacent<br />
errors: defaults to 0<br />
• -z – Maximum number of hits per chromosome: defaults<br />
to 10<br />
• -incremental – Remove reads that have already<br />
mapped<br />
13 © 2009 Applied Biosystems
Submitting Jobs<br />
• For PBS, use submit_scripts_to_PBS.pl<br />
• Submission scripts exist for LSF, SGE and SMP machines<br />
• Required Options<br />
• -j – Job list file<br />
• Optional Options<br />
• -h – Usage description<br />
• -q – Specify a queue<br />
• -i – Interactive queue<br />
14 © 2009 Applied Biosystems
Section Outline – <strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong><br />
• Workflow and Setup<br />
• Matching pipeline<br />
• Pairing pipeline<br />
• Variation pipeline<br />
15 © 2009 Applied Biosystems
Find Insert Size<br />
• pairing_by_group.pl<br />
• -F3 – F3 match file (.csfasta.ma)<br />
• -R3 – R3 match file (.csfasta.ma)<br />
• -e – Total errors allowed during mapping<br />
• -output_dir – Output directory<br />
• -find_pairing_dist – Flag for finding distance<br />
distribution<br />
• Look at pairingDist.freq.binned file<br />
16 © 2009 Applied Biosystems
Insert Size Distribution<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
0 500 1000 1500 2000 2500 3000<br />
17 © 2009 Applied Biosystems
Perform Mate-pair Rescue<br />
• pairing_by_group.pl<br />
• -F3 – F3 match file (.csfasta.ma)<br />
• -R3 – R3 match file (.csfasta.ma)<br />
• -e – Total errors allowed during mapping<br />
• -output_dir – Output directory<br />
• -min_insert_size – From distribution<br />
• -max_insert_size – From distribution<br />
• -ref – Multi FASTA reference file<br />
18 © 2009 Applied Biosystems
Mate-pair Descriptions<br />
• Mate-pairs are annotated with a three letter code<br />
19 © 2009 Applied Biosystems
Section Outline – <strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong><br />
• Workflow and Setup<br />
• Matching pipeline<br />
• Pairing pipeline<br />
• Variation pipeline<br />
20 © 2009 Applied Biosystems
SNP Pipeline<br />
• Preparation<br />
• Single tag: split_by_chromosome.pl<br />
• -f – Unique match file (.unique.csfasta.ma)<br />
• -c – Output chromosome directory<br />
• Mate pair: multi_chr_pairing_parser.pl<br />
• -mates – Mates file from pairing pipeline (.mates)<br />
• -o_dir – Output directory<br />
21 © 2009 Applied Biosystems
SNP Pipeline<br />
• Consensus and SNP calling<br />
• consensus_prep_and_wrapper_cmap_save_script.pl<br />
• -mates/match_dir – Output from preparation step<br />
• -cmap – Chromosome map file<br />
• -mlf3/mlr3 – Tag length<br />
• -ef3/er3 – Mismatches allowed<br />
• -o_dir – Output directory<br />
• -insert_start/_end – Pairing size for mate pair run<br />
22 © 2009 Applied Biosystems
SNP Pipeline<br />
• Consensus sequence generated from alignment to the<br />
reference sequence<br />
• Files<br />
• snps.txt<br />
• snps_sorted.txt<br />
• snp_probs.dat<br />
• bp_consensus_confirmed_sequence_with_Ns.fasta<br />
23 © 2009 Applied Biosystems
<strong>Corona</strong> <strong>Lite</strong> Overview<br />
24 © 2009 Applied Biosystems
Quiz<br />
• What do you need to do before running <strong>Corona</strong> <strong>Lite</strong>?<br />
• What are the three main steps of <strong>Corona</strong> <strong>Lite</strong>?<br />
• What is the workflow of each pipeline in <strong>Corona</strong> <strong>Lite</strong>?<br />
• What is the meaning of the three letter annotations of<br />
mate-pairs (e.g., AAA, ABA, etc)?<br />
• What are the main differences between <strong>Corona</strong> <strong>Lite</strong> and<br />
GlobalSETS?<br />
25 © 2009 Applied Biosystems
Appendix
General – Global SETS Versus <strong>Corona</strong> <strong>Lite</strong><br />
Category SOLiD Global SETS v3.0 <strong>Corona</strong>_<strong>Lite</strong> v4.2<br />
Supported OS<br />
Linux CentOS, Scyld<br />
Clusterware, PBS (Torque)<br />
Will test LSF, PBS pro and<br />
SGE by June 2009<br />
Linux, PBS, LSF, SGE<br />
Programming language Java (algorithms in C++) Scripting languages (some algorithms<br />
in C++)<br />
Analysis set up and<br />
execution<br />
Integrate with custom<br />
pipeline<br />
Speed<br />
Automatic through SETS GUI;<br />
Integrated command line<br />
Yes (SAI)<br />
GUI, and Command line<br />
Optimized for compute<br />
performance for complex<br />
genome analysis<br />
Warranty Yes No<br />
AB support to end users Yes Yes<br />
License Fee<br />
Comes with SOLiD 3 System<br />
Contact AB sales<br />
Integrated command. Can run batch<br />
mode.<br />
Yes, command line interface<br />
Support complex genome analysis<br />
Free open source<br />
27 © 2009 Applied Biosystems