29.01.2015 Views

High Performance Computing for Hyperspectral ... - IEEE Xplore

High Performance Computing for Hyperspectral ... - IEEE Xplore

High Performance Computing for Hyperspectral ... - IEEE Xplore

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

534 <strong>IEEE</strong> JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 4, NO. 3, SEPTEMBER 2011<br />

as degree of parallelization, amount of communication overhead<br />

in algorithms, and load balancing strategies used.<br />

B. Heterogeneous Implementation<br />

In order to balance the load of the processors in a heterogeneous<br />

parallel environment, each processor should execute an<br />

amount of work that is proportional to its speed [17]. There<strong>for</strong>e,<br />

the parallel algorithm in Section IV-A needs to be adapted <strong>for</strong><br />

efficient execution on heterogeneous computing environments.<br />

Two major goals of data partitioning in heterogeneous networks<br />

are [18]: 1) to obtain an appropriate set of workload fractions<br />

that best fit the heterogeneous environment, and 2) to<br />

translate the chosen set of values into a suitable decomposition<br />

of total workload , taking into account the properties of the<br />

heterogeneous system. In order to accomplish the above goals,<br />

we have developed a workload estimation algorithm (WEA) <strong>for</strong><br />

heterogeneous networks that assumes that the workload of each<br />

processor must be directly proportional to its local memory and<br />

inversely proportional to its speed. Below, we provide a description<br />

of WEA, which replaces the data partitioning step in the<br />

parallel algorithm described in Section IV-A. Steps 2–6 of the<br />

parallel algorithm in Section IV-A would be executed immediately<br />

after WEA. The input to WEA is a hyperspectral data cube<br />

, and the output is a set of spatial-domain heterogeneous<br />

partitions of :<br />

1) Obtain necessary in<strong>for</strong>mation about the heterogeneous<br />

system, including the number of workers , each processor’s<br />

identification number , and processor<br />

cycle-times .<br />

2) Set <strong>for</strong> all .<br />

In other words, this step first approximates the<br />

so that the amount of work assigned to each processing<br />

node is proportional to its speed and<br />

<strong>for</strong> all<br />

processors.<br />

3) Iteratively increment some until the set of best<br />

approximates the total workload to be completed, i.e.,<br />

<strong>for</strong> to , find so that<br />

,andthenset<br />

.<br />

4) Produce partitions of the input hyperspectral data set, so<br />

that the spectral channels corresponding to the same pixel<br />

vector are never stored in different partitions. In order to<br />

achieve this goal, we have adopted a methodology which<br />

consists of three main steps:<br />

a) The hyperspectral data set is first partitioned, using<br />

spatial-domain decomposition, into a set of vertical<br />

slabs which retain the full spectral in<strong>for</strong>mation at the<br />

same partition. The number of rows in each slab is set<br />

to be proportional to the estimated values of ,<br />

and assuming that no upper bound exist on the number<br />

of pixel vectors that can be stored by the local memory<br />

at the considered node.<br />

b) For each processor , check if the number of pixel<br />

vectors assigned to it is greater than the upper bound.<br />

For all the processors whose upper bounds are exceeded,<br />

assign them a number of pixels equal to<br />

their upper bounds. Now, we solve the partitioning<br />

problem of a set with remaining pixel vectors over<br />

the remaining processors. We recursively apply this<br />

procedure until all pixel vectors in the input data have<br />

been assigned, thus obtaining an initial workload distribution<br />

<strong>for</strong> each . It should be noted that, with the<br />

proposed algorithm description, it is possible that all<br />

processors exceed their upper bounds. This situation<br />

was never observed in our experiments. However,<br />

if the considered network includes processing units<br />

with low memory capacity, this situation could be<br />

handled by allocating an amount of data equal to the<br />

upper bound to those processors, and then processing<br />

the remaining data as an offset in a second algorithm<br />

iteration.<br />

c) Iteratively recalculate the workload assigned to each<br />

processor using the following expression:<br />

where denotes the set of neighbors of processing<br />

node ,and denotes the workload of (i.e., the<br />

number of pixel vectors assigned to this processor)<br />

after the -th iteration. This scheme has been demonstrated<br />

in previous work to converge to an average<br />

workload [51].<br />

The parallel heterogeneous algorithm has been implemented<br />

using the C++ programming language with calls to standard<br />

message passing interface (MPI) 10 library functions (also used<br />

in our cluster-based implementation).<br />

C. FPGA Implementation<br />

Our strategy <strong>for</strong> implementation of the hyperspectral unmixing<br />

chain in reconfigurable hardware is aimed at enhancing<br />

replicability and reusability of slices in FPGA devices through<br />

the utilization of systolic array design [14]. Fig. 2 describes our<br />

systolic architecture. Here, local results remain static at each<br />

processing element, while a total of pixel vectors with<br />

dimensions are input to the systolic array from top to bottom.<br />

Similarly, skewers with dimensions are fed to the systolic<br />

array from left to right. In Fig. 2, asterisks represent delays.<br />

The processing nodes labeled as in Fig. 2 per<strong>for</strong>m the<br />

individual products <strong>for</strong> the skewer projections. On the other<br />

hand, the nodes labeled as and respectively compute<br />

the maxima and minima projections after the dot product calculations<br />

have been completed. In fact, the and nodes<br />

avoid broadcasting the pixel while simplifying the collection<br />

of the results.<br />

Based on the systolic array described above (which also allows<br />

implementation of the fully constrained spectral unmixing<br />

stage) we have implemented the full hyperspectral unmixing<br />

chain using the very high speed integrated circuit hardware description<br />

language (VHDL) 11 <strong>for</strong> the specification of the systolic<br />

array. Further, we have used the Xilinx ISE environment<br />

and the Embedded Development Kit (EDK) environment 12 to<br />

10 http://www.mcs.anl.gov/mpi.<br />

11 http://www.vhdl.org.<br />

12 http://www.xilinx.com/ise/embedded/edk_pstudio.htm.<br />

(8)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!