30.09.2015 Views

Christian Lessig

SORtc

SORtc

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Eigenvalue Computation using Bisection<br />

processing step. This reduces divergence of the computations and therefore improves<br />

efficiency (cf. [10, Chapter 5.1.1.2]).<br />

Eigenvalue Count Computation<br />

At the beginning of this report we introduced Algorithm 1 to compute the eigenvalue count<br />

C x<br />

. Although the algorithm is correct in a mathematical sense, it is not monotonic when<br />

computed numerically. It is therefore possible for the algorithm to return an eigenvalue<br />

count C x 1<br />

so that Cx1 Cx 2<br />

although x1 x2<br />

. Clearly, this will lead to problems in<br />

practice. In our implementation we therefore employ Algorithm FLCnt_IEEE from the<br />

paper by Demmel et al. [6] which is guaranteed to be monotonic.<br />

Data Representation<br />

The data that has to be stored for Algorithm 3 are the active intervals and the non-zero<br />

elements of the input matrix. Each interval is represented by its left and right interval<br />

bounds and the eigenvalue counts for the bounds. The input matrix can be represented by<br />

two vectors containing the main and the first upper (or lower) diagonal, respectively.<br />

Summarizing the guidelines from the CUDA programming guide [10], to obtain optimal<br />

performance on an NVIDIA compute device it is important to represent data so that<br />

<br />

<br />

<br />

(high-latency) data transfers to global memory are minimized,<br />

uncoalesced (non-aligned) data transfers to global memory are avoided, and<br />

shared memory is employed as much as possible.<br />

We therefore would like to perform all computations entirely in shared memory and<br />

registers. Slow global memory access would then only be necessary at the beginning of the<br />

computations to load the data and at the end to store the result. The limited size of shared<br />

memory – devices with compute capabilities 1.x have 16 KB which corresponds 4096<br />

32 -bit variables – makes this unfortunately impossible. For matrices with more than<br />

2048 2048 elements shared memory would not even be sufficient to store the matrix<br />

representation. We therefore store only the active intervals in shared memory and the two<br />

vectors representing the input matrix are loaded from global memory whenever the<br />

eigenvalue count has to be computed in Step 3.1.b.<br />

Shared memory is not only limited in its size but also restrictive in that it is shared among<br />

the threads of a single thread block. With a maximum of 512 threads per block and by<br />

using one thread to perform Step 3.1 for each interval, we are thus limited to matrices with<br />

at most 512 512 elements. Although such an implementation is very efficient, it defeats<br />

our goal of an algorithm that can process arbitrary size matrices. An extension of Algorithm<br />

3 for arbitrary size matrices will therefore be discussed in the next section. For the<br />

remainder of the section, however, we will restrict ourselves to the simpler case of matrices<br />

with at most 512 eigenvalues.<br />

For simplicity, in practice the list L, containing the information about the active intervals, is<br />

represented by four separate arrays:<br />

__shared__ float s_left[MAX_INTERVALS_BLOCK]<br />

__shared__ float s_right[MAX_INTERVALS_BLOCK]<br />

July 2012 11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!