11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Thus we have arrived at a c<strong>on</strong>tradicti<strong>on</strong> and proved that <str<strong>on</strong>g>for</str<strong>on</strong>g> RCSLMADs, each d-tuplein the domain D maps to a distinct locati<strong>on</strong> in memory.d∏Theorem 2. The size S <str<strong>on</strong>g>of</str<strong>on</strong>g> the set represented by the RCSLMAD L is given by S = u k .Pro<str<strong>on</strong>g>of</str<strong>on</strong>g>. The size S <str<strong>on</strong>g>of</str<strong>on</strong>g> the set is equal to the size <str<strong>on</strong>g>of</str<strong>on</strong>g> the domain D because each element inthe domain maps to a distinct memory locati<strong>on</strong>. The size S can simply be computed as aproduct <str<strong>on</strong>g>of</str<strong>on</strong>g> the upper bounds because the domain is rectangular.A loop nest may have multiple RCSLMADs and they may overlap at <strong>on</strong>e or more memorylocati<strong>on</strong>s. If each RCSLMAD is transferred independently, then the same memory locati<strong>on</strong><strong>on</strong> the CPU may be copied to multiple copies <strong>on</strong> the GPU. If there is a read/write dependenceinvolving a memory locati<strong>on</strong>, then creating multiple copies <strong>on</strong> the GPU is cumbersomebecause the copies must be kept c<strong>on</strong>sistent. A better soluti<strong>on</strong> is to compute a set uni<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>the RCSLMADs and then transfer the uni<strong>on</strong> to the GPU. This way there will be exactly<strong>on</strong>e copy <str<strong>on</strong>g>of</str<strong>on</strong>g> M <strong>on</strong> the GPU. The compiler also ensures that no other CPU thread writesto the memory locati<strong>on</strong>s copied to the GPU be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the GPU executi<strong>on</strong> and data transfersare complete. For correctness, it is not necessary to compute an exact uni<strong>on</strong>. If the uni<strong>on</strong>is represented by U, then it is sufficient to compute a superset U <str<strong>on</strong>g>of</str<strong>on</strong>g> the uni<strong>on</strong>. We nowpresent the problem statement including the design criteria <str<strong>on</strong>g>of</str<strong>on</strong>g> an analysis to compute theuni<strong>on</strong>:Problem Statement 1. Given a list {L 1 , L 2 , .., L n } <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs defined over a d-dimensi<strong>on</strong>al domain D:1. Compute the set M <str<strong>on</strong>g>of</str<strong>on</strong>g> memory locati<strong>on</strong>s to be transferred to the GPU. The set M shouldbe a superset <str<strong>on</strong>g>of</str<strong>on</strong>g> the uni<strong>on</strong> U <str<strong>on</strong>g>of</str<strong>on</strong>g> RCSLMADs. The number <str<strong>on</strong>g>of</str<strong>on</strong>g> elements in M −U shouldbe as small as possible. M should be easily representable by the compiler.2. Compute the size S <str<strong>on</strong>g>of</str<strong>on</strong>g> the set M.3. C<strong>on</strong>struct a map F with domain M. The range R <str<strong>on</strong>g>of</str<strong>on</strong>g> F should be a set <str<strong>on</strong>g>of</str<strong>on</strong>g> validmemory locati<strong>on</strong>s in the GPU address space. If F (m), mɛM represents the address <str<strong>on</strong>g>of</str<strong>on</strong>g>the memory locati<strong>on</strong> m <strong>on</strong> the GPU, then F should be such that F (m 1 ) ≠ F (m 2 ) ifm 1 ≠ m 2 . The Range R should be as small as possible. The computati<strong>on</strong>al complexityand memory requirement <str<strong>on</strong>g>for</str<strong>on</strong>g> computing F should be as low as possible.5.2 General soluti<strong>on</strong>C<strong>on</strong>sider the case where the list c<strong>on</strong>tains a single RCSLMAD L defined over a domain D.Without loss <str<strong>on</strong>g>of</str<strong>on</strong>g> generality, assume that the ordering functi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> L is the identity functi<strong>on</strong>.In such a case, an exact soluti<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> the problem 1 can be calculated by algorithm 1.k=138

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!