GPU-Based Real-Time Volumetric Ultrasound Image Reconstruction ...

1260 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013Fig. 3.Real-time image reconstruction procedure.performance advantage over using standard pageable memory[16]. In addition, it enables asynchronous data copy required toimplement task parallelism using CUDA streams, which willbe discussed shortly. The raw data are then transferred to GPUmemory for real-time image reconstruction and display. Thesignal processing procedure for image reconstruction is summarizedin Fig. 3.1) Data Transfer to GPU Memory, Hadamard Decoding,Analytic Signal Conversion, and Aperture Weighting: For animaging depth of 25 mm and a sampling rate of 45 MHz, thesystem acquires 1536 data samples per A-scan. With each datasample stored as a 2-byte unsigned integer, 12 MB of raw dataare obtained per frame from 64 transmit events and 64 receivechannels. It takes about 8 ms to copy this amount of data toGPU’s global memory from standard pageable CPU memory.However, utilizing page-locked memory reduces it to 4.1 ms.The received raw data are Hadamard-coded because thetransmit pulses are spatially encoded with the Hadamardmatrix. Therefore, the first step in the signal processing is todecode the raw data to obtain the A-scan for each transmit-receiveelement pair in the data set, which can be done bymultiplying the raw data by the Hadamard matrix. For thisstep, we used a fast Hadamard transform algorithm with acomputational complexity of[17], instead ofsimple matrix multiplication with an complexity. Thefast Hadamard transform follows the recursive definition of theHadamard matrixto recursively break down the transform into two half-sizedtransforms. A CUDA thread block is assigned to decode eachA-scan, with the threads in the block accessing the data samplesin sequential order to maximize the memory access efficiency.This algorithm consumes 4.5 ms of GPU time per frame in ourimplementation.After Hadamard decoding, the software computes the analyticsignal of the RF data. Ideally, this is done by Hilbert transformwhich involves a direct and an inverse Fourier transform.Fig. 4. (a) Processing time for data transfer to GPU memory, Hadamarddecoding, and analytic signal conversion combined with aperture weighting.(b) Hadamard decoding combined with analytic signal conversion and apertureweighting. (c) Task parallelism in data transfer and signal processing usingCUDA streams.Fig. 5. Delay-and-sum operations using multiple CUDA threads. threadsrun in parallel to reconstruct one image pixel. in this implementation.Tosavethetimetakenintheanalyticsignalconversion,weadopted the direct sampling process [18]. Since we are samplingthe RF signal at four times the center frequency, the quadraturecomponent of the analytic signal is approximately the in-phasesignal delayed by one data sample; therefore, the quadraturecomponent can be approximated by the RF data delayed by onedata sample. Aperture weighting is easily combined with directsampling by multiplying the corresponding weight whensampling the in-phase and the quadrature components of theRF signal. Direct sampling, including aperture weighting, takes1.1 ms per frame, compared to 8.7 ms for ideal Hilbert transformimplemented using the Nvidia CUDA fast Fourier transformlibrary (cuFFT). Furthermore, we can combine these processeswith Hadamard decoding; we modified the last iterationof the fast Hadamard transform to generate both the in-phaseand the quadrature components of the weighted signal. The three

CHOE et al.: GPU-BASED REAL-TIME VOLUMETRIC ULTRASOUND IMAGE RECONSTRUCTION FOR A RING ARRAY 1261TABLE IIIRAW DATA SIZE AND SIGNAL PROCESSING TIMETABLE IVIMAGE DATA SIZE AND SIGNAL PROCESSING TIMETABLE VCALCULATED AND MEASURED FRAME RATES (IN FRAMES PER SECOND) FOR DISPLAYING THREE CROSS-SECTIONAL IMAGESTABLE VIEXPERIMENTAL CONDITIONS FOR DISPLAYING THREECROSS-SECTIONAL IMAGESFig. 6.Location of 10 wire targets.processes, Hadamard decoding, analytic signal conversion, andaperture weighting, take 4.9 ms when combined altogether.Using multiple CUDA streams, GPU kernel operations canrun in parallel with data copy to or from the GPU memory.It is effective to parallelize the data transfer and the signalprocessing, because they take comparable amount of time. Toimplement this task parallelism, the RF data are split into 64blocks, each containing the data received from one channel.The 64 data blocks are processed by two CUDA streams, eachresponsible for one of the odd or even channels’ data. Whileone stream transfers a block of data to GPU memory usingthe copy engine, the other stream performs signal processingfor the previously transferred data block on the kernel engine,as illustrated in Fig. 4(c). Signal processing time for onedata block is longer than 1/64 of the processing time for theentire data, because fewer CUDA threads run in parallel toprocess a smaller data block. However, the total processingtime is shorter with this task parallelism. Thus, all these fourprocesses are combined and they take 6.4 ms per frame insteadofms.2) Delay-and-Sum: The Hadamard-decoded and weightedanalytic RF data are stored in a 2-D texture memory. With thetwo dimensions representing transmit-receive channel pair andtime delay, the data samples constructing adjacent image pixelsare spatially localized. Fig. 5 shows this texture memory structure,and illustrates how multiple CUDA threads run in parallelto reconstruct the image data. To reconstruct an image withpixels, CUDA threads are created, where is thenumber of threads assigned to one image pixel. In our applicationswith around 10 000 pixels, was empirically chosen tobe eight. Running more threads in parallel makes the memoryaccess inefficient and smaller increases the computationaloverhead per thread, increasing the total processing time. Eightthreads are assigned to each image pixel, and each thread performsdelay-and-sum operations on the A-scans received byeight receive channels, covering all 64 channels. The outputsfrom these eight threads are then summed up to calculate thecomplex image data for one image pixel. The threads reconstructingadjacent image pixels are grouped together to optimizethe memory access pattern by utilizing the spatial locality of the

1262 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 32, NO. 7, JULY 2013Fig. 7. Snapshot of real-time cross-sectional images of the wire phantom displayed at 45 frames per second. (a) A B-mode image showing the 10 wires. (b) AnotherB-mode image orthogonal to (a), showing the two wires at the center (3 and 8 in Fig. 6). (c) A constant-R image at (depth of wire 3 in Fig. 6).TABLE VIICALCULATED AND EXPERIMENTAL IMAGE FRAME RATES (IN FRAMES PER SECOND) AND VOLUME RATES (IN VOLUMES PER SECOND)FOR DISPLAYING ONE ROTATING B-MODE IMAGE AND ITS MAXIMUM INTENSITY PROJECTIONTABLE VIIIEXPERIMENTAL CONDITIONS FOR DISPLAYING ONE B-MODEIMAGE AND ITS MAXIMUM INTENSITY PROJECTIONFig. 8.Metal spring phantom.data samples in the 2-D texture memory. For fast delay calculation,the transducer element locations are stored in GPU’s constantmemory.3) Envelope Detection and Image Postprocessing: The samenumber of threads as the number of pixels are launched andruninparalleltomakethefinal image to be displayed on thescreen. Each thread calculates the magnitude of one pixel fromthe complex image data, and optionally performs logarithmicor gamma compression for display. While being displayed onthe screen, the final image data are transferred back to the CPUmemory for on-demand saving.III. REAL-TIME IMAGING PERFORMANCEThesignalprocessingtimeperimageframeisafunctionofinput and output data sizes. The input data size grows as we in-crease the imaging depth or the sampling rate, and it results inlonger processing time in data transfer, Hadamard decoding, andanalytic signal conversion. On the other hand, the output datasize becomes larger when we reconstruct more image pixels byincreasing the field of view or the pixel density. As a result, thetime spent in the delay-and-sum and the envelope detection operationsincreases. Tables III and IV summarize how the signalprocessing time changes with the data size.A. Cross-Sectional ImagingTo deliver the volume information on the screen in real-time,the software displays three cross-sectional images, two B-modeplanes orthogonal to each other and one constant-R plane. Theframe rate varies depending on how many data samples we acquirefor each A-scan, and how many pixels we reconstruct forthese planes. Therefore, we can increase the frame rate with reducedfield of view or lowered image pixel density. Calculatedand experimentally-measured frame rates under various conditionsare listed in Table V. The calculated frame rates are based

CHOE et al.: GPU-BASED REAL-TIME VOLUMETRIC ULTRASOUND IMAGE RECONSTRUCTION FOR A RING ARRAY 1263Fig. 9. (a) Real-time rotating B-mode image of the spring phantom and (b) its maximum intensity projection, captured when the B-mode image crossedplane. Frame updated at 60 Hz, resulting in 0.33 volumes per second with a rotation angle of 1 . The frame rate was limited by the monitor refresh rate.only on the time spent for image reconstruction on the GPU. Experimentalframe rates appear to be lower due to the additionaloverhead from the CPU for the raw data reception and, in somecases, the 60-Hz monitor refresh rate.A test phantom was made using 10 150- fluorocarbonfishing wires [Fig. 6] to demonstrate real-time display of threecross-sectional images. With an imaging depth of 25 mm, weacquired 1536 RF data samples at 45-MHz sampling rate foreach A-scan. A total of 24 059 pixels were reconstructed for thethree images, and then scan-converted using OpenGL for displayon the screen. More details on the imaging conditions canbe found in Table VI and a snapshot of the resulting real-timeimages of these three cross sections are presented in Fig. 7. Weachieved a rate of 45 fps to reconstruct and display these images,which is a significant improvement compared to 10 fps reportedin [4]. The image pixels were sampled uniformly in ( , , )coordinatesystem in [4], and this resulted in more pixels for thesame fieldofviewthaninFig.7wherethepixelsweremoreefficiently sampled in ( , , ) space. For the samenumber of 24 059 pixels, our previous CPU-based software reconstructs19 frames per second.B. Volumetric ImagingAn alternative way to effectively display a volume is to rotatea B-mode image. In this implementation, we display only oneB-mode plane that rotates about the axis by a small angle fromframe to frame, covering the whole volume after 180 rotation.For better visualization of volume, orthogonal projection of thisrotating plane is calculated and the MIP for the whole volume isdisplayed in real-time. Computing the MIP takes about 0.1 ms,which is negligible compared to the time taken in image reconstruction.The volume rate for this implementation is dependenton the rotation angle step, as well as the input and the outputdata sizes. Table VII shows the volume rate measured experimentallyunder different conditions.A metal spring with a 6 mm diameter [Fig. 8] was imagedto display one rotating B-mode plane and its MIP in real-time.Each A-scan contained 1024 data samples for an imaging depthof 15 mm, and the B-mode plane consisted of 11 648 pixels.Table VIII summarizes the conditions for this experiment. Animaging rate of 104 fps would have been achieved if limitedonly by the reconstruction time, but it was limited by the monitorrefresh rate to 60 fps. The supplemental video shows the rotatingplane and its MIP, and Fig. 9 depicts one screenshot taken fromthis experiment. The B-mode plane [Fig. 9(a)] rotated by 1 inevery frame, covering the whole volume after 180 rotation. Itresulted in a volume rate of 0.33 volumes per second. The MIPimage [Fig. 9(b)] was updated by the running MIP as the planerotated in each frame, yielding a complete MIP image in every180 rotation, after which it was reset for the next volume scan.In the MIP of the supplemental video, only half planewas projected in each frame to better visualize the rotatingplane. Therefore, a complete MIP image was created in every360 rotation.IV. CONCLUSIONWe developed GPU-based real-time volumetric imagingsoftware for a CMUT ring array and experimentally demonstratedits performance. With massive parallelism in thesynthetic beamforming operations and more efficient samplingof image pixels, the GPU-based software reconstructs real-timeimages 4.5 times faster than the previous multi-core CPU-basedsoftware for displaying three cross-sectional images. Fastercomputation on the GPU platform enables the implementationof an alternative volume representation, where a fast rotatingB-mode plane and its MIP are displayed in real-time. Bothimaging modes were experimentally tested using a fishing wirephantom and a metal spring phantom, and they successfullygenerated real-time volumetric images as presented in thispaper.The achieved frame rates and volume rates are primarilylimited by the GPU computational speed. As more and morepowerful graphics cards become available, our software performanceis expected to improve further. Some of the relevantadvances include more efficient architecture, more GPU cores,and larger shared memory space for storing raw data sampleswhich are currently stored in slower global memory. In addition,we can increase the throughput using multiple graphics

GPU-Based Real-Time Volumetric Ultrasound Image Reconstruction ...

Create successful ePaper yourself

Delete template?

Save as template?