Corona Lite Introduction

Corona Lite Introduction

Section Outline – Corona Lite Introduction 

• Workflow and Setup 

• Matching pipeline 

• Pairing pipeline 

• Variation pipeline 

2 © 2009 Applied Biosystems

Corona Lite Overview 


GlobalSETS versus Corona Lite 

Category SOLiD Global SETS v3.0 Corona_Lite v4.2 

Mapping Algorithm MapReads MapReads 

Mapping scheme - 

progressive 

-full length with fixed 

number of mismatches 

Yes, for max throughput. 

(default) 

Yes 

Repeat Classifier Yes, new in v3.0 No 

MatchingRepeat, Random, 

and Consolidate 

Yes, new in v3.0 

No 

Yes 

SNP algorithm diBayes SNP caller 

Multiple run combination 

analysis 

Integrated small indel 

analysis 

No 

Jun-09 

No 

Yes. 

Yes 


Global SETS Versus Corona Lite 


Matching 

- Fasta-like .ma files 

- gff v.2 

Yes 

Default 

Yes 

Optional using MaToGff.sh 

Pairing 

- .mates 

- gff v.2 

Yes 

Default 

Yes 

Optional using MatesToGff.sh 

SNP pipe 

- SNP summary 

- Consensus base 

sequence 

Gff v.3 

Yes 

SNP list text file 

Yes 

Stats Files New format Old Stats file 


Corona Lite Setup 

• Before you start 

• Set the correct environment 

• Make cmap file 

• Validate reference 

• Generate double encode reference 


Corona Lite Setup – Environment Variables 

• Set up environment variables 

• for csh/tcsh: 

• setenv CORONAROOT /share/apps/corona_lite 

• source $CORONAROOT/etc/profile.d/corona.csh 

• for sh/ksh/bash: 

• export CORONAROOT=/share/apps/corona_lite 

• source $CORONAROOT/etc/profile.d/corona.sh 


Corona Lite Setup – Chromosome Map (cmap( 

cmap) ) File 

• Prepare the chromosome map file (tab-delimited): 

• Chromosome ID 

• Chromosome Name 

• FASTA Reference 

• Double-Encoded Reference 

• For example 

1 chr1 /path/to/file/chr1.fa /path/to/file/de_chr1.fa 





Corona Lite Setup – Validate and Double Encode Ref 

• Validate reference 

• reference_validation.pl –r chr1.fa –s 9999999999 –o 

chr1_validated.fa 

• Generate double-encoded sequence 

• encodeFasta.py -n -l sequence.fasta > de_sequence.fasta 








Things To Consider 

• Number of hits to report (-z) 

• Default is 10 per chromosome 

• What does it mean if it hit 10 times? 

• Recommended mismatches 

• 2 for 25bp reads 

• 3-4 for 35bp reads 

• 4-6 for 50bp reads 

• Can consider counting valid adjacent mismatches as 1 

(-a=1) 


Matching Parameters (Required) 

• matching_large_genomes_cmap_save.pl 

• -csfasta – F3 or R3 reads 

• -dir – Output directory 

• -cmap – Chromosome map file 

• -t – Tag length 

• -e – Number of errors allowed 


Matching Parameters (Optional) 

• matching_large_genomes_cmap_save.pl 

• -p – Pattern mask for reads 

• -a – 0 = no; 1 = valid adjacent errors; 2 = all adjacent 

errors: defaults to 0 

• -z – Maximum number of hits per chromosome: defaults 

to 10 

• -incremental – Remove reads that have already 

mapped 


Submitting Jobs 

• For PBS, use submit_scripts_to_PBS.pl 

• Submission scripts exist for LSF, SGE and SMP machines 

• Required Options 

• -j – Job list file 

• Optional Options 

• -h – Usage description 

• -q – Specify a queue 

• -i – Interactive queue 








Find Insert Size 

• pairing_by_group.pl 

• -F3 – F3 match file (.csfasta.ma) 

• -R3 – R3 match file (.csfasta.ma) 

• -e – Total errors allowed during mapping 

• -output_dir – Output directory 

• -find_pairing_dist – Flag for finding distance 

distribution 

• Look at pairingDist.freq.binned file 


Insert Size Distribution 

800 

700 

600 

500 

400 

300 

200 

100 

0 

0 500 1000 1500 2000 2500 3000 


Perform Mate-pair Rescue 

• pairing_by_group.pl 

• -F3 – F3 match file (.csfasta.ma) 

• -R3 – R3 match file (.csfasta.ma) 

• -e – Total errors allowed during mapping 

• -output_dir – Output directory 

• -min_insert_size – From distribution 

• -max_insert_size – From distribution 

• -ref – Multi FASTA reference file 


Mate-pair Descriptions 

• Mate-pairs are annotated with a three letter code 








SNP Pipeline 

• Preparation 

• Single tag: split_by_chromosome.pl 

• -f – Unique match file (.unique.csfasta.ma) 

• -c – Output chromosome directory 

• Mate pair: multi_chr_pairing_parser.pl 

• -mates – Mates file from pairing pipeline (.mates) 

• -o_dir – Output directory 


SNP Pipeline 

• Consensus and SNP calling 

• consensus_prep_and_wrapper_cmap_save_script.pl 

• -mates/match_dir – Output from preparation step 

• -cmap – Chromosome map file 

• -mlf3/mlr3 – Tag length 

• -ef3/er3 – Mismatches allowed 

• -o_dir – Output directory 

• -insert_start/_end – Pairing size for mate pair run 


SNP Pipeline 

• Consensus sequence generated from alignment to the 

reference sequence 

• Files 

• snps.txt 

• snps_sorted.txt 

• snp_probs.dat 

• bp_consensus_confirmed_sequence_with_Ns.fasta 


Corona Lite Overview 


Quiz 

• What do you need to do before running Corona Lite? 

• What are the three main steps of Corona Lite? 

• What is the workflow of each pipeline in Corona Lite? 

• What is the meaning of the three letter annotations of 

mate-pairs (e.g., AAA, ABA, etc)? 

• What are the main differences between Corona Lite and 

GlobalSETS? 


Appendix

General – Global SETS Versus Corona Lite 


Supported OS 

Linux CentOS, Scyld 

Clusterware, PBS (Torque) 

Will test LSF, PBS pro and 

SGE by June 2009 

Linux, PBS, LSF, SGE 

Programming language Java (algorithms in C++) Scripting languages (some algorithms 

in C++) 

Analysis set up and 

execution 

Integrate with custom 

pipeline 

Speed 

Automatic through SETS GUI; 

Integrated command line 

Yes (SAI) 

GUI, and Command line 

Optimized for compute 

performance for complex 

genome analysis 

Warranty Yes No 

AB support to end users Yes Yes 

License Fee 

Comes with SOLiD 3 System 

Contact AB sales 

Integrated command. Can run batch 

mode. 

Yes, command line interface 

Support complex genome analysis 

Free open source

Corona Lite Introduction

Create successful ePaper yourself

Delete template?

Save as template?