12.07.2015 Views

The Hierarchically Tiled Arrays Programming Approach

The Hierarchically Tiled Arrays Programming Approach

The Hierarchically Tiled Arrays Programming Approach

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

a = hta(MX,{dist}, [P 1]);b = hta(P, 1, [P 1]);b{:} = V;r = a * b;Figure 4: Sparse matrix vector product.for i = 1:nc = c + a * b;a = circshift(a, [0, -1]);b = circshift(b, [-1, 0]);endWhile the previous example only required communicationbetween the client, which executes the main thread, andeach individual server, Cannon’s matrix multiplication algorithm[4] is an example of code that also requires communicationbetween the servers.<strong>The</strong> algorithm has O(n) time complexity and uses O(n 2 )processors (servers). In our implementation of the algorithm,the operands, denoted a and b respectively, areHTAs tiled in two dimensions which are mapped onto amesh of n × n processors.Figure 5: Main loop in Cannon’s algorithm.6 ×6 matrix is distributed on a 2 ×2 mesh of processors asthe the last parameter of the HTA constructor indicates.In the future we plan to offer more mappings and a representationfor hierarchical organizations of processors.3. PARALLEL PROGRAMMING USING HTASIn this section we illustrate the use of HTAs with five simplecode examples.3.1 Sparse Matrix-Vector ProductOur first example, sparse matrix-vector product (Fig. 4),illustrates the expressivity and simplicity of our parallelprogramming approach. This code multiplies a sparse matrixMX by a dense vector V using P processors. We begin bydistributing the contents of the sparse matrix MX in chunksof rows into an HTA a by calling an HTA constructor. <strong>The</strong>P servers handling the HTA are organized into a single column.We rely on thedist argument to distribute the arrayMX in such a way that it results in a uniform computationalload across the servers.Next we create and empty HTA b distributed across allprocessors. We assign the vector V to each of the tilesin hta b (b{:}=V). With this assignment, the vector V iscopied to each tile of the hta b. Since HTA b is distributedacross the P processors, this copy requires that the clientbroadcasts V to all the servers that have a tile of the HTAb. Notice that since HTA b and a have the same numberof tiles and they are mapped to the same processor mesh,each processor holding a tile of a, will now hold a copy ofV too.<strong>The</strong> multiplication itself is in the last line of the code,where the binary operator * is invoked on a and b. <strong>The</strong>effect is that corresponding tiles of a and b, which arelocated in the same server, are multiplied, giving place toa distributed matrix-vector multiply. <strong>The</strong> result is a HTAr, distributed across the servers with the same mapping asthe inputs. This HTA can be flattened back into a vectorcontaining the result of the multiplication of a by b byusing the r(:) notation.<strong>The</strong> code completely hides the fact that MX is sparse becauseMATLAB TM provides the very same syntax for denseand sparse computations, a feature our HTA class implementationin MATLAB TM has inherited.3.2 Cannon’s Algorithm for Matrix MultiplicationIn each iteration of the algorithm’s main loop each serverexecutes a matrix multiplication of the tile of a and b thatcurrently reside on that server. <strong>The</strong> result of the multiplicationis accumulated in a (local) tile of the result HTA,c. After the computation, the tiles of a and b are circularshiftedas follows: the tiles of b are shifted along the firstdimension; the tiles of a are shifted along the second dimension.<strong>The</strong> effect of this operation is that the tiles of aare sent to the left processor in the mesh and the tiles of bare sent to the upper processor in the mesh. <strong>The</strong> left-mostprocessor transfers its tile of a to the right-most processorin its row and the bottom-most processor transfers its tileof b to the top-most processor in its column.At the end of n iterations each server holds the correctvalue for its associated tile in the output HTAc=a*b. Fig. 5shows the main loop of Cannon’s algorithm using HTAs.3.3 Jacobi RelaxationReferencing arbitrary elements of HTAs results in complexcommunication patterns. <strong>The</strong> blocked Jacobi relaxationcode in Fig. 6 computes the new value of each elementas the average of its four neighbors. Each block of d×delements of the input matrix is represented by a tile ofthe HTA v. In addition, the tiles also contain extra rowsand columns for use as border regions for exchanging informationwith the neighbors. As a result, each tile hasd+2 rows and d+2 columns. This situation is depicted inFig. 7, which shows an example 3 × 3 HTA v with d= 3.<strong>The</strong> inner part of each tile, surrounded by a dotted line inthe picture, contains the data from the input matrix. <strong>The</strong>first and last row and column in each tile are the shadowsthat will receive data from the internal tile of the neighborsin order to calculate locally the average of the fourneighbors for each element. This exchange of shadows isexecuted in the first four statements of the main loop andit is also illustrated in Fig. 7, which shows the executionof the statement v{2:n,:}(1,:) = v{1:n-1,:}(d+1,:).As we see, by using HTA addressing, communication andcomputation can be easily identified by looking at the tiledindexes.In this case the flattened version of the HTA v does notquite represent the desired end result, because of the existenceof the border exchange regions. However, the desiredmatrix can be obtained by first removing the borderregions and applying the flattening operator afterward:(v{:,:}(2:d+1,2:d+1))(:,:).3.4 Embarrassingly Parallel ProgramsEasy programming of embarrassingly parallel and MIMDstyle codes is also possible using HTAs thanks to theparHTAFunc

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!