18.10.2014 Views

Corona Lite Introduction

Corona Lite Introduction

Corona Lite Introduction

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong>


Section Outline – <strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong><br />

• Workflow and Setup<br />

• Matching pipeline<br />

• Pairing pipeline<br />

• Variation pipeline<br />

2 © 2009 Applied Biosystems


<strong>Corona</strong> <strong>Lite</strong> Overview<br />

3 © 2009 Applied Biosystems


GlobalSETS versus <strong>Corona</strong> <strong>Lite</strong><br />

Category SOLiD Global SETS v3.0 <strong>Corona</strong>_<strong>Lite</strong> v4.2<br />

Mapping Algorithm MapReads MapReads<br />

Mapping scheme -<br />

progressive<br />

-full length with fixed<br />

number of mismatches<br />

Yes, for max throughput.<br />

(default)<br />

Yes<br />

Repeat Classifier Yes, new in v3.0 No<br />

MatchingRepeat, Random,<br />

and Consolidate<br />

Yes, new in v3.0<br />

No<br />

Yes<br />

SNP algorithm diBayes SNP caller<br />

Multiple run combination<br />

analysis<br />

Integrated small indel<br />

analysis<br />

No<br />

Jun-09<br />

No<br />

Yes.<br />

Yes<br />

4 © 2009 Applied Biosystems


Global SETS Versus <strong>Corona</strong> <strong>Lite</strong><br />

Category SOLiD Global SETS v3.0 <strong>Corona</strong>_<strong>Lite</strong> v4.2<br />

Matching<br />

- Fasta-like .ma files<br />

- gff v.2<br />

Yes<br />

Default<br />

Yes<br />

Optional using MaToGff.sh<br />

Pairing<br />

- .mates<br />

- gff v.2<br />

Yes<br />

Default<br />

Yes<br />

Optional using MatesToGff.sh<br />

SNP pipe<br />

- SNP summary<br />

- Consensus base<br />

sequence<br />

Gff v.3<br />

Yes<br />

SNP list text file<br />

Yes<br />

Stats Files New format Old Stats file<br />

5 © 2009 Applied Biosystems


<strong>Corona</strong> <strong>Lite</strong> Setup<br />

• Before you start<br />

• Set the correct environment<br />

• Make cmap file<br />

• Validate reference<br />

• Generate double encode reference<br />

6 © 2009 Applied Biosystems


<strong>Corona</strong> <strong>Lite</strong> Setup – Environment Variables<br />

• Set up environment variables<br />

• for csh/tcsh:<br />

• setenv CORONAROOT /share/apps/corona_lite<br />

• source $CORONAROOT/etc/profile.d/corona.csh<br />

• for sh/ksh/bash:<br />

• export CORONAROOT=/share/apps/corona_lite<br />

• source $CORONAROOT/etc/profile.d/corona.sh<br />

7 © 2009 Applied Biosystems


<strong>Corona</strong> <strong>Lite</strong> Setup – Chromosome Map (cmap(<br />

cmap) ) File<br />

• Prepare the chromosome map file (tab-delimited):<br />

• Chromosome ID<br />

• Chromosome Name<br />

• FASTA Reference<br />

• Double-Encoded Reference<br />

• For example<br />

1 chr1 /path/to/file/chr1.fa /path/to/file/de_chr1.fa<br />

2 chr2 /path/to/file/chr2.fa /path/to/file/de_chr2.fa<br />

3 chr3 /path/to/file/chr3.fa /path/to/file/de_chr3.fa<br />

4 chr4 /path/to/file/chr4.fa /path/to/file/de_chr4.fa<br />

8 © 2009 Applied Biosystems


<strong>Corona</strong> <strong>Lite</strong> Setup – Validate and Double Encode Ref<br />

• Validate reference<br />

• reference_validation.pl –r chr1.fa –s 9999999999 –o<br />

chr1_validated.fa<br />

• Generate double-encoded sequence<br />

• encodeFasta.py -n -l sequence.fasta > de_sequence.fasta<br />

9 © 2009 Applied Biosystems


Section Outline – <strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong><br />

• Workflow and Setup<br />

• Matching pipeline<br />

• Pairing pipeline<br />

• Variation pipeline<br />

10 © 2009 Applied Biosystems


Things To Consider<br />

• Number of hits to report (-z)<br />

• Default is 10 per chromosome<br />

• What does it mean if it hit 10 times?<br />

• Recommended mismatches<br />

• 2 for 25bp reads<br />

• 3-4 for 35bp reads<br />

• 4-6 for 50bp reads<br />

• Can consider counting valid adjacent mismatches as 1<br />

(-a=1)<br />

11 © 2009 Applied Biosystems


Matching Parameters (Required)<br />

• matching_large_genomes_cmap_save.pl<br />

• -csfasta – F3 or R3 reads<br />

• -dir – Output directory<br />

• -cmap – Chromosome map file<br />

• -t – Tag length<br />

• -e – Number of errors allowed<br />

12 © 2009 Applied Biosystems


Matching Parameters (Optional)<br />

• matching_large_genomes_cmap_save.pl<br />

• -p – Pattern mask for reads<br />

• -a – 0 = no; 1 = valid adjacent errors; 2 = all adjacent<br />

errors: defaults to 0<br />

• -z – Maximum number of hits per chromosome: defaults<br />

to 10<br />

• -incremental – Remove reads that have already<br />

mapped<br />

13 © 2009 Applied Biosystems


Submitting Jobs<br />

• For PBS, use submit_scripts_to_PBS.pl<br />

• Submission scripts exist for LSF, SGE and SMP machines<br />

• Required Options<br />

• -j – Job list file<br />

• Optional Options<br />

• -h – Usage description<br />

• -q – Specify a queue<br />

• -i – Interactive queue<br />

14 © 2009 Applied Biosystems


Section Outline – <strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong><br />

• Workflow and Setup<br />

• Matching pipeline<br />

• Pairing pipeline<br />

• Variation pipeline<br />

15 © 2009 Applied Biosystems


Find Insert Size<br />

• pairing_by_group.pl<br />

• -F3 – F3 match file (.csfasta.ma)<br />

• -R3 – R3 match file (.csfasta.ma)<br />

• -e – Total errors allowed during mapping<br />

• -output_dir – Output directory<br />

• -find_pairing_dist – Flag for finding distance<br />

distribution<br />

• Look at pairingDist.freq.binned file<br />

16 © 2009 Applied Biosystems


Insert Size Distribution<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

100<br />

0<br />

0 500 1000 1500 2000 2500 3000<br />

17 © 2009 Applied Biosystems


Perform Mate-pair Rescue<br />

• pairing_by_group.pl<br />

• -F3 – F3 match file (.csfasta.ma)<br />

• -R3 – R3 match file (.csfasta.ma)<br />

• -e – Total errors allowed during mapping<br />

• -output_dir – Output directory<br />

• -min_insert_size – From distribution<br />

• -max_insert_size – From distribution<br />

• -ref – Multi FASTA reference file<br />

18 © 2009 Applied Biosystems


Mate-pair Descriptions<br />

• Mate-pairs are annotated with a three letter code<br />

19 © 2009 Applied Biosystems


Section Outline – <strong>Corona</strong> <strong>Lite</strong> <strong>Introduction</strong><br />

• Workflow and Setup<br />

• Matching pipeline<br />

• Pairing pipeline<br />

• Variation pipeline<br />

20 © 2009 Applied Biosystems


SNP Pipeline<br />

• Preparation<br />

• Single tag: split_by_chromosome.pl<br />

• -f – Unique match file (.unique.csfasta.ma)<br />

• -c – Output chromosome directory<br />

• Mate pair: multi_chr_pairing_parser.pl<br />

• -mates – Mates file from pairing pipeline (.mates)<br />

• -o_dir – Output directory<br />

21 © 2009 Applied Biosystems


SNP Pipeline<br />

• Consensus and SNP calling<br />

• consensus_prep_and_wrapper_cmap_save_script.pl<br />

• -mates/match_dir – Output from preparation step<br />

• -cmap – Chromosome map file<br />

• -mlf3/mlr3 – Tag length<br />

• -ef3/er3 – Mismatches allowed<br />

• -o_dir – Output directory<br />

• -insert_start/_end – Pairing size for mate pair run<br />

22 © 2009 Applied Biosystems


SNP Pipeline<br />

• Consensus sequence generated from alignment to the<br />

reference sequence<br />

• Files<br />

• snps.txt<br />

• snps_sorted.txt<br />

• snp_probs.dat<br />

• bp_consensus_confirmed_sequence_with_Ns.fasta<br />

23 © 2009 Applied Biosystems


<strong>Corona</strong> <strong>Lite</strong> Overview<br />

24 © 2009 Applied Biosystems


Quiz<br />

• What do you need to do before running <strong>Corona</strong> <strong>Lite</strong>?<br />

• What are the three main steps of <strong>Corona</strong> <strong>Lite</strong>?<br />

• What is the workflow of each pipeline in <strong>Corona</strong> <strong>Lite</strong>?<br />

• What is the meaning of the three letter annotations of<br />

mate-pairs (e.g., AAA, ABA, etc)?<br />

• What are the main differences between <strong>Corona</strong> <strong>Lite</strong> and<br />

GlobalSETS?<br />

25 © 2009 Applied Biosystems


Appendix


General – Global SETS Versus <strong>Corona</strong> <strong>Lite</strong><br />

Category SOLiD Global SETS v3.0 <strong>Corona</strong>_<strong>Lite</strong> v4.2<br />

Supported OS<br />

Linux CentOS, Scyld<br />

Clusterware, PBS (Torque)<br />

Will test LSF, PBS pro and<br />

SGE by June 2009<br />

Linux, PBS, LSF, SGE<br />

Programming language Java (algorithms in C++) Scripting languages (some algorithms<br />

in C++)<br />

Analysis set up and<br />

execution<br />

Integrate with custom<br />

pipeline<br />

Speed<br />

Automatic through SETS GUI;<br />

Integrated command line<br />

Yes (SAI)<br />

GUI, and Command line<br />

Optimized for compute<br />

performance for complex<br />

genome analysis<br />

Warranty Yes No<br />

AB support to end users Yes Yes<br />

License Fee<br />

Comes with SOLiD 3 System<br />

Contact AB sales<br />

Integrated command. Can run batch<br />

mode.<br />

Yes, command line interface<br />

Support complex genome analysis<br />

Free open source<br />

27 © 2009 Applied Biosystems

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!