Christian Lessig
SORtc
SORtc
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Eigenvalue Computation using Bisection<br />
processing step. This reduces divergence of the computations and therefore improves<br />
efficiency (cf. [10, Chapter 5.1.1.2]).<br />
Eigenvalue Count Computation<br />
At the beginning of this report we introduced Algorithm 1 to compute the eigenvalue count<br />
C x<br />
. Although the algorithm is correct in a mathematical sense, it is not monotonic when<br />
computed numerically. It is therefore possible for the algorithm to return an eigenvalue<br />
count C x 1<br />
so that Cx1 Cx 2<br />
although x1 x2<br />
. Clearly, this will lead to problems in<br />
practice. In our implementation we therefore employ Algorithm FLCnt_IEEE from the<br />
paper by Demmel et al. [6] which is guaranteed to be monotonic.<br />
Data Representation<br />
The data that has to be stored for Algorithm 3 are the active intervals and the non-zero<br />
elements of the input matrix. Each interval is represented by its left and right interval<br />
bounds and the eigenvalue counts for the bounds. The input matrix can be represented by<br />
two vectors containing the main and the first upper (or lower) diagonal, respectively.<br />
Summarizing the guidelines from the CUDA programming guide [10], to obtain optimal<br />
performance on an NVIDIA compute device it is important to represent data so that<br />
<br />
<br />
<br />
(high-latency) data transfers to global memory are minimized,<br />
uncoalesced (non-aligned) data transfers to global memory are avoided, and<br />
shared memory is employed as much as possible.<br />
We therefore would like to perform all computations entirely in shared memory and<br />
registers. Slow global memory access would then only be necessary at the beginning of the<br />
computations to load the data and at the end to store the result. The limited size of shared<br />
memory – devices with compute capabilities 1.x have 16 KB which corresponds 4096<br />
32 -bit variables – makes this unfortunately impossible. For matrices with more than<br />
2048 2048 elements shared memory would not even be sufficient to store the matrix<br />
representation. We therefore store only the active intervals in shared memory and the two<br />
vectors representing the input matrix are loaded from global memory whenever the<br />
eigenvalue count has to be computed in Step 3.1.b.<br />
Shared memory is not only limited in its size but also restrictive in that it is shared among<br />
the threads of a single thread block. With a maximum of 512 threads per block and by<br />
using one thread to perform Step 3.1 for each interval, we are thus limited to matrices with<br />
at most 512 512 elements. Although such an implementation is very efficient, it defeats<br />
our goal of an algorithm that can process arbitrary size matrices. An extension of Algorithm<br />
3 for arbitrary size matrices will therefore be discussed in the next section. For the<br />
remainder of the section, however, we will restrict ourselves to the simpler case of matrices<br />
with at most 512 eigenvalues.<br />
For simplicity, in practice the list L, containing the information about the active intervals, is<br />
represented by four separate arrays:<br />
__shared__ float s_left[MAX_INTERVALS_BLOCK]<br />
__shared__ float s_right[MAX_INTERVALS_BLOCK]<br />
July 2012 11