Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab
Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab
Compile-time Loop Splitting for Distributed Memory ... - Stanford AI Lab
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Perf. Improvement<br />
Perf. Improvement<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
|<br />
0<br />
1<br />
|<br />
0<br />
1<br />
|<br />
|<br />
|<br />
|<br />
|<br />
�� �� ��<br />
�<br />
�<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
�� �<br />
�� �<br />
|<br />
|<br />
|<br />
|<br />
2<br />
|<br />
2<br />
|<br />
4<br />
|<br />
4<br />
|<br />
8<br />
|<br />
8<br />
� Rational<br />
� Interval<br />
��<br />
|<br />
16<br />
� Rational, opt<br />
� Interval, opt<br />
�<br />
�<br />
� �<br />
� �<br />
�<br />
�<br />
|<br />
16<br />
� ��<br />
�<br />
|<br />
32<br />
|<br />
32<br />
|<br />
64<br />
# Processors<br />
|<br />
64<br />
# Processors<br />
|<br />
128<br />
Matrix Multiply 40x40<br />
|<br />
128<br />
Matrix Multiply 40x40<br />
Perf. Improvement<br />
Perf. Improvement<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
|<br />
0<br />
1<br />
|<br />
0<br />
1<br />
|<br />
|<br />
|<br />
|<br />
|<br />
�<br />
� �<br />
�<br />
� �<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
�<br />
|<br />
�<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
|<br />
2<br />
� �<br />
�<br />
�<br />
|<br />
2<br />
|<br />
4<br />
|<br />
4<br />
� ��<br />
�<br />
|<br />
8<br />
�<br />
� �<br />
�<br />
|<br />
8<br />
� Rational<br />
� Interval<br />
|<br />
16<br />
� Rational, opt<br />
� Interval, opt<br />
|<br />
16<br />
� ��<br />
�<br />
|<br />
32<br />
� ��<br />
|<br />
32<br />
|<br />
64<br />
# Processors<br />
|<br />
128<br />
Matrix Transposition 100x50<br />
�<br />
|<br />
64<br />
# Processors<br />
|<br />
128<br />
Matrix Transposition 100x50<br />
Figure 4-2: The per<strong>for</strong>mance improvement in multiplication of 40x40 matrices and transposition<br />
of a 100x50 matrix: without additional compiler optimizations (top) and with the<br />
optimizations (bottom).<br />
44