Accurate Sequence Alignment Using Distributed Filtering on GPU ...

gputechconf.com

Accurate Sequence Alignment Using Distributed Filtering on GPU ...

ong>Accurateong> ong>Sequenceong> ong>Alignmentong> usingong>Distributedong> ong>Filteringong> on GPU ClustersReza Farivar *Shivaram Venkataraman §Roy H. Campbell **University of Illinois at Urbana-Champaign§University of California, Berkeley


Problem Introduction• What is Gene Sequencing ?• 1 st Phase, getting the reads• 2 nd Phase, Aligning them to a reference DNA5/15/12 University of Illinois at Urbana-Champaign 2


Problem Description• Human genome sequence– 3.5 billion nucleotides• A C T G– Less than 30 chromosomes– Unidentified bases (N)• Reads– Explore a sequence by designing and searching forshort DNA reads• Types of disagreements– Insertions, Deletions and Mismatches5/15/12 University of Illinois at Urbana-Champaign 3


Why is the Problem So Hard• Divide reference DNA into patterns using asliding window• Brute Force - (3 × 10^9 ) × (20 × 10^6)ong>Usingong> Quad core, each core workingon two datasets in parallel usingSSE will take 195 days to complete !5/15/12 University of Illinois at Urbana-Champaign 4


Our ApproachPHASE 1• Generate Masks for Filter Phase of thealgorithm• Reduce the number of Masks• Filter the Pattern-Read pairsPHASE 2• Align them using an accelerator like GPUs, getrid of the False positives5/15/12 University of Illinois at Urbana-Champaign 5


Phase 1 - Masks• Related to the q-gram lemma– Based on pigeon hole principle• If n “disagreements” are spread out in a string,and if we divide it in n+k divisions, at least kdivision will be an exact match5/15/12 University of Illinois at Urbana-Champaign 6


ong>Distributedong> ong>Filteringong>• ong>Usingong> a sliding window, create patterns• For each mask, make a “masked array” from allpatterns• For each read, mask with a proper value and searchin the corresponding masked array• If found, this pattern/read are a potential match forphase 25/15/12 University of Illinois at Urbana-Champaign 7


Creating the distributed filterTGCCTTCATTTCGTTATGTACCCAGTAGTCATAAAAGCACTAGCTTGCCAAGTTSorted Masked Arrays1 1 0 1 0 1 0 1 1TGCCTTCCTT00CC00CA00CATTGCCTTCCTTC00CT00AT00CCTTCCTTCAGCCT00GC00TC00CTTCCTTCATTGCC00TG00TT00TCATTTCATTTTCA00TT00TT00TTCAong>Distributedong> pigeon hole filter5/15/12 University of Illinois at Urbana-Champaign 8


Masked Read MatchingA Short ReadCCATCASorted Masked Arrays1 1 0 1 0 1 0 1 11 1 0 1 0 1 0 1 1CCTT00CTTC00CC00CACT00AT00CATT00CCTTGCCT00GC00TC00CTTCCCAT00 CC00CA 00ATCATGCC00TG00TT00TCATTTCA00TT00TT00TTCA5/15/12 University of Illinois at Urbana-Champaign 9


Performance of filteringFilter Passed % Avg. No. of Patterns Masks Used85.38 13.6 1-1-1-213.39 21.38 1-1-1-32.84 46.66 1-1-1-45/15/12 University of Illinois at Urbana-Champaign 10


Limited Fast Memory• NW in Shared Memory– Amount of shared memory available perstreaming multiprocessor (in Tesla) = 16KB– For 32 long strings, 1024 bytes required 16 threads per SM– For 128 long strings 1 thread per SM !Problem with 16 threads per SM– Very low GPU utilization– Unable to hide RAW latency5/15/12 University of Illinois at Urbana-Champaign 13


Matrix in Global Memory• Allows hundreds of threads per streamingmultiprocessor• Many threads can be used to hide latencyincurred by each thread, normal GPU practicePROBLEM• In the best case, we are global memory bound• The algorithm does not support coalescedmemory accesses during back-tracing• This results in serializing the threads5/15/12 University of Illinois at Urbana-Champaign 14


Algorithm Walk through1. Divide the matrix into quadrants, store the boundaries2. Start from the bottom right quadrant i,j = 31– Fill the alignment matrix for the quadrant– Back trace– Decide on a new quadrant3. Repeat 2 until i,j = 05/15/12 University of Illinois at Urbana-Champaign 15


Storing The BoundariesATCGAATTAAATTTT……………………………….32A T C G A T T C G G A T……………………………………….................................32• Calculation of eachquadrant requires the firstrow and the first column ofthe quadrant256 Bytes+81 Bytes• Storing in shared memory only 2 warps per SM 6% occupancy• Storing in global memory Shared memory usage 81 bytes Threads per SM=192 19% GPU utilization 5/15/12 University of Illinois at Urbana-Champaign 16


SpeedupExperimental Results: Speedup• NVIDIA GTX 275• Intel Core i7 92010.59.58.57.56.55.54.5Number of pairs to compareGPU registers GPU 32bit coalasced GPU 64bit coalasced


Run Time Break DownRun Time Details (in seconds)14121086420Read from Disk Creating masked array Binary search for allreads in one maskedarrayGPU processing timeMisc5/15/12 University of Illinois at Urbana-Champaign 18


Future Work• Scaling to public clouds– ong>Distributedong> filtering using a cluster of GPUs– Integration with standard bioinformatics file formats• Length of short sequences– Currently 32, we are extending to 100 and longer5/15/12 University of Illinois at Urbana-Champaign 19


Backup Slides5/15/12 University of Illinois at Urbana-Champaign 20


Goals and Assumptions• Goals:– Make the kernel compute bound• Minimize global memory accesses, Ensure coalescing– Maximize use of fast memory• Shared memory, registers– Eliminate diverging flows• Assumption:– If we eliminate memory bounds, we will have extracycles to spare re-compute when needed– Bank-conflict-free shared memory access is as fastas registers5/15/12 University of Illinois at Urbana-Champaign 21


The Proposed Algorithm• Dynamically construct smaller quadrants while backtracing• Solve each quadrant in shared memoryA T C G A T T C G G A T……………………………………….................................32ATCGAATTAAATTTT……………………………….32ATGCAAAGTA T C C T A G G A5/15/12 University of Illinois at Urbana-Champaign 22


Optimization with Registers• 32 long strings boundaries take 256 bytes• Trick: store the 256 bytes in 32 registers• Problem: Registers are not array addressable– Need to use the registers conditionally divergent code • Solution: We use registers as bulk storage,paging into shared memory• Shared memory usage 81 bytes 19% GPU utilization, no global memory5/15/12 University of Illinois at Urbana-Champaign 23


Normalized run-timeRuntime vs. Speedup180160140120100806040200Number of pairs to compareNormalized OpenMP CPU timeNormalized GPU time


Experimental Results• Compared to a regular Needleman Wunschimplementation on an Intel i7 processor ourimplementation on GPUs is 22x faster• Compared to a multithreaded implementation ofthe regular Needleman Wunsch using Open-MPon 4 Intel i7 cores with Hyper threading support(8 HT cores), our implementation is up to 8xfaster.• What does this mean ? Gene alignment for ahuman DNA with 3 Billion reads will now take 27minutes instead of 3 hours !!5/15/12 University of Illinois at Urbana-Champaign 26

More magazines by this user
Similar magazines