Christian Lessig

Recommendations

Info

Eigenvalue Computation using Bisection contain only intervals on one level of the interval tree – not having a parent-child relationship – which can be processed in parallel. A data parallel implementation of Step 2.2 of Algorithm 2 is provided in the following algorithm. Algorithm 3. 3.1. Perform in parallel for all intervals l j k u j, k 3.1.a. Compute m j, k l j, k , u j, k . 3.1.b. Compute C , . m j k , , (all at level j ) in the list L. 3.1.c. Determine if non-empty child intervals converge. 3.1.d. For non-empty, converged child intervals compute an approximation of the eigenvalue i and store it in O at index C i . 3.1.e. Store non-empty, unconverged child intervals in L. Note that Algorithm 3 does no longer use a stack to store tree nodes but only a simple list L. The Algorithm terminates successfully when all non-empty intervals converged. The reported eigenvalues are then, for example, the midpoints of the final intervals. Care is required when the distance between eigenvalues is less than the desired accuracy and converged intervals contain multiple eigenvalues. In this case we store k instances of the C , where k is the multiplicity of the eigenvalue. Situations where Algorithm 3, as described above, does not terminate successfully are described in the following sections. eigenvalue approximation at the appropriate indices i Thread Allocation Although it is possible to process multiple intervals with one thread, we decided to use the the “natural” approach on a data parallel architecture and employ one thread to process one interval. CUDA does not permit dynamic creation of new threads on the device. The number of threads specified at kernel launch time has therefore to match the maximum number of intervals that will be processed in parallel. Assuming that is smaller than the minimum distance between any two eigenvalues, then the number of threads has to equal the number of eigenvalues of the input matrix. Note that this implies that for the first tree levels there are many more threads than intervals to process and many threads have to be inactive until a higher tree level has been reached. The available parallelism is therefore fully exploited only after some iterations of Step 3.1. Interval Subdivision Until now we did not specify how m j, k l j, k , u j, k m j , k should be chosen but merely required that . In the literature different approaches such as midpoint subdivision or Newton-like methods have been used to determine m j , k [6,7,15]. We employ midpoint subdivision which ensures that all child intervals on the same tree level have the same size. All converged intervals are therefore on the same tree level and reached at the same July 2012 10
Eigenvalue Computation using Bisection processing step. This reduces divergence of the computations and therefore improves efficiency (cf. [10, Chapter 5.1.1.2]). Eigenvalue Count Computation At the beginning of this report we introduced Algorithm 1 to compute the eigenvalue count C x . Although the algorithm is correct in a mathematical sense, it is not monotonic when computed numerically. It is therefore possible for the algorithm to return an eigenvalue count C x 1 so that Cx1 Cx 2 although x1 x2 . Clearly, this will lead to problems in practice. In our implementation we therefore employ Algorithm FLCnt_IEEE from the paper by Demmel et al. [6] which is guaranteed to be monotonic. Data Representation The data that has to be stored for Algorithm 3 are the active intervals and the non-zero elements of the input matrix. Each interval is represented by its left and right interval bounds and the eigenvalue counts for the bounds. The input matrix can be represented by two vectors containing the main and the first upper (or lower) diagonal, respectively. Summarizing the guidelines from the CUDA programming guide [10], to obtain optimal performance on an NVIDIA compute device it is important to represent data so that (high-latency) data transfers to global memory are minimized, uncoalesced (non-aligned) data transfers to global memory are avoided, and shared memory is employed as much as possible. We therefore would like to perform all computations entirely in shared memory and registers. Slow global memory access would then only be necessary at the beginning of the computations to load the data and at the end to store the result. The limited size of shared memory – devices with compute capabilities 1.x have 16 KB which corresponds 4096 32 -bit variables – makes this unfortunately impossible. For matrices with more than 2048 2048 elements shared memory would not even be sufficient to store the matrix representation. We therefore store only the active intervals in shared memory and the two vectors representing the input matrix are loaded from global memory whenever the eigenvalue count has to be computed in Step 3.1.b. Shared memory is not only limited in its size but also restrictive in that it is shared among the threads of a single thread block. With a maximum of 512 threads per block and by using one thread to perform Step 3.1 for each interval, we are thus limited to matrices with at most 512 512 elements. Although such an implementation is very efficient, it defeats our goal of an algorithm that can process arbitrary size matrices. An extension of Algorithm 3 for arbitrary size matrices will therefore be discussed in the next section. For the remainder of the section, however, we will restrict ourselves to the simpler case of matrices with at most 512 eigenvalues. For simplicity, in practice the list L, containing the information about the active intervals, is represented by four separate arrays: __shared__ float s_left[MAX_INTERVALS_BLOCK] __shared__ float s_right[MAX_INTERVALS_BLOCK] July 2012 11
Page 1 and 2: Eigenvalue Computation with CUDA Ch
Page 3 and 4: Abstract The computation of all or
Page 5 and 6: Eigenvalue Computation using Bisect
Page 11: Eigenvalue Computation using Bisect

Christian Lessig

Create successful ePaper yourself

Delete template?

Save as template?