High Performance Computing for Hyperspectral ... - IEEE Xplore

More documents

Recommendations

Info

PLAZA et al.: HIGH PERFORMANCE COMPUTING FOR HYPERSPECTRAL REMOTE SENSING 533 not satisfy the ANC and the ASC constraints usually imposed in the linear mixture model. An estimate satisfying the ASC constraint can be obtained by solving the following optimization problem: Similarly, imposing the ANC constraint results in the following optimization problem: As indicated in [48], a non-negative constrained least squares (NCLS) algorithm can be used to obtain a solution to the ANC-constrained problem described in (5) in iterative fashion [49]. In order to take care of the ASC constraint, a new endmember signature matrix, denoted by ,andamodified version of the abundance vector , denoted by , are introduced as follows: (4) (5) where and controls the impact of the ASC constraint. Using the two expressions in (6), a fully constrained estimate can be directly obtained from the NCLS algorithm by replacing and used in the NCLS algorithm with and , respectively. IV. PARALLEL IMPLEMENTATIONS OF THE HYPERSPECTRAL UNMIXING CHAIN This section first develops a parallel implementation of the hyperspectral unmixing chain described in Section III which has been specifically designed to be run on massively parallel, homogeneous clusters. Then, the parallel version is transformed into a heterogeneity-aware implementation by introducing an adaptive data partitioning algorithm specifically developed to capture the specificities of the underlying heterogeneous networks of distributed workstations. Finally, both FPGA and GPU implementations are also described. A. Cluster-Based Implementation To reduce code redundancy and enhance reusability, our goal when designing the cluster-based implementation was to reuse much of the code for the sequential algorithm. For that purpose, we adopted a spatial-domain decomposition approach [50] that subdivides the image cube into multiple blocks made up of entire pixel vectors, and assigns each block to a different processing element in the cluster. It should be noted that our hyperspectral unmixing chain is mainly based on calculations in which pixels are always treated as entire spectral signatures. Therefore, a spectral-domain partitioning scheme (i.e. subdividing the multi-band data into sub-volumes across the spectral dimension) is not appropriate in our application domain because the calculations made for each hyperspectral pixel would (6) need to originate from several processing elements, thus requiring intensive inter-processor communication [13]. Therefore, in our implementation a simple master-slave approach is implemented in which the master processor distributes spatialdomain partitions of the data to the workers and coordinates their actions. Then, the master gathers the partial results provided by the workers and produces a global result. The parallel algorithm is given by the following steps: 1) Data partitioning. The master processor produces a set of equally-sized spatial-domain partitions of the hyperspectral image and scatters all partitions by indicating all partial data structure elements which are to be accessed and sent to each of the workers. 2) Skewer generation. The master generates random unit vectors , and broadcasts the entire set of skewers to all the workers. 3) Extreme projections. For each , project all the sample pixel vectors at each local partition onto to find sample vectors at its extreme projections, andformanextremasetfor which is denoted by . Now calculate the number of times each pixel vector in the local partition is selected as extreme using the following expression: 4) Candidate selection. Each worker now sends the number of times that each pixel vector in the local partition has been selected as extreme to the master, which forms a final matrix of pixel purity indexes by combining all the individual matrices provided by the workers. 5) Endmember selection. The master selects those pixels with and forms a unique set of endmembers by calculating the SA for all possible pixel vector pairs and discarding those pixels which result in angle values below . 6) Fully constrained spectral unmixing. The master broadcasts the set of endmembers to all the workers, and each worker locally computes a fully constrained abundance estimation for each pixel in its local partition . After the workers have computed their estimations locally, the master simply gathers the individual abundance estimation results. To conclude this subsection, it is important to emphasize that several steps of this algorithm are purely sequential. This means that the master node performs some steps of the algorithm on its own. Nevertheless, the execution time of these purely sequential steps is insignificant in comparison to the total execution time (i.e. less than 1%). Moreover, as shown by the algorithm description, some communication steps between master and workers are required. However, the impact of communications was not particularly significant in our application, while most of the computations involved for endmember extraction and abundance estimation can be performed independently at each worker without additional memory management. In turn, other applications of cluster computing in remote sensing applications may have different results depending upon such issues (7)
534 IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 4, NO. 3, SEPTEMBER 2011 as degree of parallelization, amount of communication overhead in algorithms, and load balancing strategies used. B. Heterogeneous Implementation In order to balance the load of the processors in a heterogeneous parallel environment, each processor should execute an amount of work that is proportional to its speed [17]. Therefore, the parallel algorithm in Section IV-A needs to be adapted for efficient execution on heterogeneous computing environments. Two major goals of data partitioning in heterogeneous networks are [18]: 1) to obtain an appropriate set of workload fractions that best fit the heterogeneous environment, and 2) to translate the chosen set of values into a suitable decomposition of total workload , taking into account the properties of the heterogeneous system. In order to accomplish the above goals, we have developed a workload estimation algorithm (WEA) for heterogeneous networks that assumes that the workload of each processor must be directly proportional to its local memory and inversely proportional to its speed. Below, we provide a description of WEA, which replaces the data partitioning step in the parallel algorithm described in Section IV-A. Steps 2–6 of the parallel algorithm in Section IV-A would be executed immediately after WEA. The input to WEA is a hyperspectral data cube , and the output is a set of spatial-domain heterogeneous partitions of : 1) Obtain necessary information about the heterogeneous system, including the number of workers , each processor’s identification number , and processor cycle-times . 2) Set for all . In other words, this step first approximates the so that the amount of work assigned to each processing node is proportional to its speed and for all processors. 3) Iteratively increment some until the set of best approximates the total workload to be completed, i.e., for to , find so that ,andthenset . 4) Produce partitions of the input hyperspectral data set, so that the spectral channels corresponding to the same pixel vector are never stored in different partitions. In order to achieve this goal, we have adopted a methodology which consists of three main steps: a) The hyperspectral data set is first partitioned, using spatial-domain decomposition, into a set of vertical slabs which retain the full spectral information at the same partition. The number of rows in each slab is set to be proportional to the estimated values of , and assuming that no upper bound exist on the number of pixel vectors that can be stored by the local memory at the considered node. b) For each processor , check if the number of pixel vectors assigned to it is greater than the upper bound. For all the processors whose upper bounds are exceeded, assign them a number of pixels equal to their upper bounds. Now, we solve the partitioning problem of a set with remaining pixel vectors over the remaining processors. We recursively apply this procedure until all pixel vectors in the input data have been assigned, thus obtaining an initial workload distribution for each . It should be noted that, with the proposed algorithm description, it is possible that all processors exceed their upper bounds. This situation was never observed in our experiments. However, if the considered network includes processing units with low memory capacity, this situation could be handled by allocating an amount of data equal to the upper bound to those processors, and then processing the remaining data as an offset in a second algorithm iteration. c) Iteratively recalculate the workload assigned to each processor using the following expression: where denotes the set of neighbors of processing node ,and denotes the workload of (i.e., the number of pixel vectors assigned to this processor) after the -th iteration. This scheme has been demonstrated in previous work to converge to an average workload [51]. The parallel heterogeneous algorithm has been implemented using the C++ programming language with calls to standard message passing interface (MPI) 10 library functions (also used in our cluster-based implementation). C. FPGA Implementation Our strategy for implementation of the hyperspectral unmixing chain in reconfigurable hardware is aimed at enhancing replicability and reusability of slices in FPGA devices through the utilization of systolic array design [14]. Fig. 2 describes our systolic architecture. Here, local results remain static at each processing element, while a total of pixel vectors with dimensions are input to the systolic array from top to bottom. Similarly, skewers with dimensions are fed to the systolic array from left to right. In Fig. 2, asterisks represent delays. The processing nodes labeled as in Fig. 2 perform the individual products for the skewer projections. On the other hand, the nodes labeled as and respectively compute the maxima and minima projections after the dot product calculations have been completed. In fact, the and nodes avoid broadcasting the pixel while simplifying the collection of the results. Based on the systolic array described above (which also allows implementation of the fully constrained spectral unmixing stage) we have implemented the full hyperspectral unmixing chain using the very high speed integrated circuit hardware description language (VHDL) 11 for the specification of the systolic array. Further, we have used the Xilinx ISE environment and the Embedded Development Kit (EDK) environment 12 to 10 http://www.mcs.anl.gov/mpi. 11 http://www.vhdl.org. 12 http://www.xilinx.com/ise/embedded/edk_pstudio.htm. (8)
Page 1 and 2: 528 IEEE JOURNAL OF SELECTED TOPICS
Page 5: 532 IEEE JOURNAL OF SELECTED TOPICS
Page 17: 544 IEEE JOURNAL OF SELECTED TOPICS

High Performance Computing for Hyperspectral ... - IEEE Xplore

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?