Building Source-to-Source Compilers for Heterogeneous Targets
Building Source-to-Source Compilers for Heterogeneous Targets
Building Source-to-Source Compilers for Heterogeneous Targets
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
THÈSE TÉLÉCOM Bretagne<br />
sous le sceau de l’Université Européenne de<br />
Bretagne<br />
pour obtenir le titre de<br />
DOCTEUR DE TÉLÉCOM Bretagne<br />
En habilitation conjointe avec l’Université<br />
de Bretagne Occidentale<br />
Spécialité :<br />
<strong>Building</strong><br />
<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong><br />
<strong>Compilers</strong> <strong>for</strong><br />
<strong>Heterogeneous</strong> <strong>Targets</strong><br />
présentée par<br />
Serge Guel<strong>to</strong>n<br />
ÉCOLE DOCTORALE : SICMA<br />
LABORATOIRE : TB/INFO<br />
Thèse soutenue le 7 oc<strong>to</strong>bre 2011<br />
devant le jury composé de :<br />
Albert Cohen, Professeur, INRIA<br />
Fran cois Irigoin, Professeur, MINES ParisTech<br />
Ronan Keryell, Enseignant-chercheur, Télécom Bretagne<br />
& HPC Project<br />
Fabrice Lemonnier, Responsable Projet, Thales<br />
Research & Technology<br />
Bernard Pottier, Professeur, Université de Bretagne<br />
Occidentale<br />
Patrice Quin<strong>to</strong>n, Professeur, École Normale<br />
Supérieure de Cachan-Bretagne<br />
Sanjay Rajopadhye, Professeur, Colorado State<br />
University<br />
Eugene Ressler, Professeur, United States Military<br />
Academy
N o d’ordre : 2011telb0203<br />
Sous le sceau de l’Université européenne de Bretagne<br />
Télécom Bretagne<br />
En habilitation conjointe avec l’Université de Bretagne<br />
Occidentale<br />
École Doc<strong>to</strong>rale – SICMA<br />
<strong>Building</strong> <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> <strong>Compilers</strong> <strong>for</strong> <strong>Heterogeneous</strong><br />
<strong>Targets</strong><br />
Thèse de Doc<strong>to</strong>rat<br />
Mention : Sciences et Technologies de l’In<strong>for</strong>mation et des Communications<br />
Présentée par Serge Guel<strong>to</strong>n<br />
Labora<strong>to</strong>ire : TB/INFO<br />
Directeur de thèse : François Irigoin<br />
Encadrant de thèse : Ronan Keryell<br />
Soutenue le 7 oc<strong>to</strong>bre 2011<br />
Jury :<br />
Albert Cohen, Professeur, INRIA<br />
Fran cois Irigoin, Professeur, MINES ParisTech<br />
Ronan Keryell, Enseignant-chercheur, Télécom Bretagne & HPC Project<br />
Fabrice Lemonnier, Responsable Projet, Thales Research & Technology<br />
Bernard Pottier, Professeur, Université de Bretagne Occidentale<br />
Patrice Quin<strong>to</strong>n, Professeur, École Normale Supérieure de Cachan-Bretagne<br />
Sanjay Rajopadhye, Professeur, Colorado State University<br />
Eugene Ressler, Professeur, United States Military Academy
Abstract<br />
<strong>Heterogeneous</strong> computers—plat<strong>for</strong>ms that make use of multiple specialized devices <strong>to</strong><br />
achieve high throughput or low energy consumption—are difficult <strong>to</strong> program. Hardware<br />
vendors usually provide compilers from a C dialect <strong>to</strong> their machines, but complete application<br />
rewriting is frequently required <strong>to</strong> take advantage of them.<br />
In this thesis, we propose a new approach <strong>to</strong> building bridges between regular applications<br />
written in C and these dialects. From an analysis of the hardware constraints, we<br />
propose original code trans<strong>for</strong>mations that can meet those constraints, and we combine<br />
them using a completely programmable pass manager that can handle the complex compilation<br />
flows required by the code generation process <strong>for</strong> multiple targets. It makes it<br />
possible <strong>to</strong> build a collection of compilers based on the same infrastructure while targeting<br />
different architectures.<br />
All the code trans<strong>for</strong>mations are done at the source level using the pips source-<strong>to</strong>source<br />
compiler framework. New trans<strong>for</strong>mations are detailed using denotational semantics<br />
on a simplified language and have been combined <strong>to</strong> build four different compilers: an<br />
openmp directive genera<strong>to</strong>r, a retargetable multimedia instruction genera<strong>to</strong>r <strong>for</strong> sse, avx<br />
and neon, an assembly code genera<strong>to</strong>r <strong>for</strong> an fpga-based image processor and a cuda<br />
genera<strong>to</strong>r.<br />
Résumé<br />
Les machines hétérogènes — des ordinateurs reposant sur la combinaison d’unités de<br />
calculs spécialisées pour obtenir des per<strong>for</strong>mances élevées et une consommation énergétique<br />
moindre — sont difficiles à programmer. Les vendeurs de matériels fournissent généralement<br />
un compilateur d’un dialecte du C pour leur machines, mais il faut dans ce cas réécrire<br />
complètement l’application cible pour en profiter.<br />
Dans cette thèse, nous proposons une nouvelle approche pour faire le pont entre des<br />
applications classiques écrites en C et ces dialectes. Partant d’une analyse des contraintes<br />
imposées par le matériel, nous proposons un ensemble de trans<strong>for</strong>mations de code original<br />
qui permet de répondre à ces contraintes et nous les combinons à l’aide d’un gestionnaire<br />
de passe complètement programmable qui peut gérer les flots de compilation complexes<br />
mis en œuvre lors de la génération de code multi-cible. Cela permet d’obtenir un ensemble<br />
de compilateurs réutilisant les mêmes éléments de base <strong>to</strong>ut en ciblant des architectures<br />
différentes.<br />
Toutes les trans<strong>for</strong>mations s’appliquent au niveau source en se basant sur<br />
l’infrastructure de compilation pips. Les nouvelles trans<strong>for</strong>mations proposées sont explicitées<br />
en se basant sur la sémantique dénotationnelle et un langage cible simplifié. Elles<br />
sont utilisées pour assembler quatre compilateurs différents : un générateur de directives<br />
openmp, un générateur reciblable d’instructions multimédia pour sse, avx et neon, un<br />
générateur de code assembleur pour une machine à base de fpga spécialisée dans le traitement<br />
d’images et un générateur de code cuda.<br />
iii
Remerciements<br />
Acknowledgments is something better done in one’s native <strong>to</strong>ngue, <strong>for</strong> it is<br />
difficult <strong>to</strong> express the subtlety of feelings in another language. . .<br />
Cette thèse a été financée par l’Agence Nationale de la Recherche dans la cadre du<br />
projet freia. Elle a été effectuée en collaboration avec thales trt et mines paristech.<br />
Il y a cinq ans de ça, suivant à Grenoble la souriante étudiante qui allait devenir ma<br />
femme, j’ai fait mes débuts dans le monde du travail dans une équipe de recherche nommée<br />
mescal, sous la direction d’un certain Jean-Marc Vincent. Ce chercheur passionné et<br />
chaleureux m’a le plus innocemment du monde 1 fait plonger dans un monde merveilleux,<br />
plongeon dont je ne ressors qu’après la rédaction de ce manuscrit. Ce sont ces grenoblois,<br />
Thierry, Jean-Louis, Vincent, Frédéric, Arnaud, Bruno et leur joyeuse bande de thésard<br />
Xavier, Sébastien, Maxime, Jean-Noël, Swann, qui m’ont fait penché du côté lumineux.<br />
Un autre Jean-Louis, rennais celui-là, a su en profiter et équilibrer par la raison la<br />
folie qui guette du haut des montagnes grenobloises. Il a pourtant commis une faute irréparable<br />
en me jetant dans les rets d’un de mes anciens professeurs bre<strong>to</strong>ns, le machiavélique<br />
Ronan Keryell, me soustrayant par là-même aux saintes intentions d’un exthésard<br />
grenoblois nouvellement luxembourgeois. Me promettant l’équilibre parfait entre<br />
recherche et développement, s’inspirant des démons grenoblois pour mieux m’attirer à lui,<br />
il parvint sans peine à me faire franchir la ligne séparant l’ingénierie de la recherche pour<br />
me faire commencer la thèse dont ce manuscrit est le fruit.<br />
Que serais-je devenu entre les mains de ce personnage s’il n’avait pas eu l’égarement de<br />
me placer sous la direction de François Irigoin ? Il est sûr que ces trois années auraient<br />
eues une <strong>to</strong>ute autre saveur, et il m’est encore maintenant difficile de mesurer les bienfaits<br />
de sa présence, <strong>to</strong>ujours prompte — bien que parfois inefficace — à contrebalancer les excès<br />
d’enthousiasmes provoqués par mes travaux.<br />
Ce fut une thèse itinérante, qui a commencé Rennes, en<strong>to</strong>uré des joyeux symbiotes<br />
Dominique, Rayan, Jacques, Pierre et les autres, pour terminer à Brest avec la délicieuse<br />
Armelle, les sympathiques Eliya, Zhe, Xu et Jiayi, les indélogeables habitants du D3 128,<br />
les joyeux Grégoire, Frédéric, Adrien, Sébastien et les fidèles membres du club de tkd,<br />
entrecoupé de séjours bellifontains avec Corinne, Fabien, Pierre, Laurent, Amira. . . Sans<br />
oublier le canal #pipsien et Mehdi, Pierre, Béatrice.<br />
La rédaction de ce manuscrit a <strong>to</strong>ut particulièrement profité des conseils et des relectures<br />
avisées de François Irigoin, Ronan Keryell, Béatrice Creusillet, Pierre Jouvelot,<br />
1. Avec le recul, il y a peut-être là une volonté maligne de conversion de sa part.<br />
v
vi REMERCIEMENTS<br />
Fabien Dagnat et Adrien Guinet. A big thanks <strong>to</strong> Aimée Johansen <strong>for</strong> her english<br />
advices and <strong>for</strong> spotting my numerous mistakes. Mes rapporteurs Albert Cohen, Sanjay<br />
Rajopadhye et Eugene Ressler n’ont pas hésité à pointer du doigt les nombreux défauts<br />
de la version qui leur a été soumise 2 et ont grandement contribué à son amélioration.<br />
Les ef<strong>for</strong>ts typographiques consentis à ce document doivent beaucoup à Yannis Haralambous<br />
et son Chicago Manual of Style, les atrocités restantes sont l’unique fruit de<br />
ma paresse. Le plaisir de la rédaction doit beaucoup au système de composition L ATEX, au<br />
paquet TikZ et à l’éditeur de texte vim.<br />
Mes parents m’ont <strong>to</strong>ujours poussé a faire des études pour garder le plus de portes<br />
ouvertes, et m’envoyaient à l’école en me disant « amuse <strong>to</strong>i bien ». J’ai essayé de suivre<br />
le premier de ces conseils, et il ne fut pas bien difficile de suivre le second.<br />
Et pour les moments difficiles, les baisses de moral, les longues nuits de soumission<br />
d’article, les bien plus longs mois de rédaction, pour les sourires radieux, les attentions<br />
quotidiennes, 灰太狼十分感谢伟大的红太狼和狼宝宝.<br />
2. Qu’ils soient remerciés pour cette masse de travail supplémentaire :-)
Contents<br />
Remerciements v<br />
Acknowledgments, in french.<br />
Résumé en français 1<br />
Dissertation summary in French.<br />
1 Introduction 21<br />
2 <strong>Heterogeneous</strong> Computing Paradigm 25<br />
2.1 <strong>Heterogeneous</strong> Computing Model . . . . . . . . . . . . . . . . . . . . . . . 26<br />
2.2 Influence on Programming Model . . . . . . . . . . . . . . . . . . . . . . . 29<br />
2.3 Hardware Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
2.4 Note About the C Language . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
2.5 OpenCL Programming Model Analysis . . . . . . . . . . . . . . . . . . . . 37<br />
2.6 Other Programing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />
3 Compiler Design <strong>for</strong> <strong>Heterogeneous</strong> Architectures 43<br />
3.1 Extending Compiler Infrastructures . . . . . . . . . . . . . . . . . . . . . . 44<br />
3.1.1 Existing Compiler Infrastructures . . . . . . . . . . . . . . . . . . . 44<br />
3.1.2 A Simple Model <strong>for</strong> Code Trans<strong>for</strong>mations . . . . . . . . . . . . . . 48<br />
3.1.2.1 Trans<strong>for</strong>mations: Definition and Compositions . . . . . . . 48<br />
3.1.2.2 Parametric Trans<strong>for</strong>mations . . . . . . . . . . . . . . . . . 50<br />
3.1.2.3 From Model <strong>to</strong> Implementation . . . . . . . . . . . . . . . 50<br />
3.1.3 Programmable Pass Management . . . . . . . . . . . . . . . . . . . 51<br />
3.1.3.1 A Class Hierarchy <strong>for</strong> Pass Management . . . . . . . . . . 51<br />
3.1.3.2 Control Flow and Pass Management . . . . . . . . . . . . 53<br />
3.2 On <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> <strong>Compilers</strong> . . . . . . . . . . . . . . . . . . . . . . . . 55<br />
3.2.1 Exploring <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Opportunities . . . . . . . . . . . . . . 55<br />
3.2.2 Impact of <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compilation on the Compilation Infrastructure<br />
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />
3.3 pyps, a High Level Pass Manager api . . . . . . . . . . . . . . . . . . . . . 58<br />
3.3.1 api Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />
vii
viii CONTENTS<br />
3.3.2 Usage Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />
4 Representing the Instruction Set Architecture in C 69<br />
4.1 C as a Common Denomina<strong>to</strong>r . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />
4.1.1 C Dialects and <strong>Heterogeneous</strong> Computing . . . . . . . . . . . . . . 70<br />
4.1.2 From the ISA <strong>to</strong> C . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />
4.2 Native Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
4.2.1 Scalar Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
4.2.2 Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />
4.2.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />
4.3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />
4.4 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />
4.4.1 Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />
4.4.2 N-Address Code Generation . . . . . . . . . . . . . . . . . . . . . . 80<br />
4.5 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
4.6 Function calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
4.6.1 Removing Function Calls . . . . . . . . . . . . . . . . . . . . . . . . 83<br />
4.6.2 Outlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />
4.6.2.1 Outlining Algorithm . . . . . . . . . . . . . . . . . . . . . 84<br />
4.6.2.2 Using Outlining <strong>to</strong> Reduce Compilation Time . . . . . . . 87<br />
4.7 Library Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />
5 Parallelism with Multimedia Instructions 91<br />
5.1 Super-word Level Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />
5.1.2 A Meta-Multimedia Instruction Set . . . . . . . . . . . . . . . . . . 97<br />
5.1.2.1 Sequential Implementation . . . . . . . . . . . . . . . . . . 98<br />
5.1.2.2 Target-Specific Implementation . . . . . . . . . . . . . . . 99<br />
5.1.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />
5.1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />
5.1.4 Generation of Optimized simd Instructions . . . . . . . . . . . . . . 101<br />
5.1.4.1 Statement Closeness . . . . . . . . . . . . . . . . . . . . . 101<br />
5.1.4.2 Parametric Vec<strong>to</strong>r Instruction Generation Algorithm . . . 101<br />
5.1.5 Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />
5.1.6 Loop Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />
5.1.7 Combining Loop Vec<strong>to</strong>rization and Super-word Level Parallelism . . 104<br />
5.2 Reduction Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />
5.2.1 Reduction Detection Inside a Sequence . . . . . . . . . . . . . . . . 106<br />
5.2.2 Delegating <strong>to</strong> Library . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />
5.3 Computational Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
CONTENTS ix<br />
5.3.1 Execution Time versus Transfer Time Estimation . . . . . . . . . . 110<br />
5.3.2 Limitations of the Model . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />
6 Trans<strong>for</strong>mations <strong>for</strong> Memory Size and Distribution 113<br />
6.1 Statement Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />
6.1.1 Formulation of Statement Isolation . . . . . . . . . . . . . . . . . . 114<br />
6.1.1.1 Expression Renaming . . . . . . . . . . . . . . . . . . . . 116<br />
6.1.1.2 Type Renaming . . . . . . . . . . . . . . . . . . . . . . . . 118<br />
6.1.1.3 Statement renaming . . . . . . . . . . . . . . . . . . . . . 119<br />
6.1.1.4 Restricted Statement Isolation . . . . . . . . . . . . . . . 119<br />
6.1.2 Statement Isolation and Convex Array Regions . . . . . . . . . . . 120<br />
6.2 Memory Footprint Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
6.2.1 Memory Footprint Estimate . . . . . . . . . . . . . . . . . . . . . . 121<br />
6.2.2 Symbolic Rectangular Tiling . . . . . . . . . . . . . . . . . . . . . . 122<br />
6.3 Redundant Load S<strong>to</strong>re Optimization . . . . . . . . . . . . . . . . . . . . . 123<br />
6.3.1 Redundant Load Elimination . . . . . . . . . . . . . . . . . . . . . 125<br />
6.3.1.1 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />
6.3.1.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />
6.3.1.3 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />
6.3.1.4 Interprocedurally . . . . . . . . . . . . . . . . . . . . . . . 127<br />
6.3.2 Redundant S<strong>to</strong>re Elimination . . . . . . . . . . . . . . . . . . . . . 127<br />
6.3.3 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
6.3.4 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />
6.3.5 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
6.3.6 Interprocedurally . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
6.3.7 Combining Load and S<strong>to</strong>re Elimination . . . . . . . . . . . . . . . . 128<br />
6.3.7.1 Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />
6.3.7.2 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />
6.3.8 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />
7 Compiler Implementations and Experiments 133<br />
7.1 A Simple OpenMP Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />
7.1.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 134<br />
7.1.2 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . 134<br />
7.1.3 Experiments & Validation . . . . . . . . . . . . . . . . . . . . . . . 137<br />
7.2 A GPU Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139<br />
7.2.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 139<br />
7.2.2 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . 141<br />
7.2.3 Experiments & Validation . . . . . . . . . . . . . . . . . . . . . . . 144<br />
7.3 An FPGA Image Processor Accelera<strong>to</strong>r Compiler . . . . . . . . . . . . . . 144<br />
7.3.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 146
x CONTENTS<br />
7.3.2 terapix Compiler Implementation . . . . . . . . . . . . . . . . . . 147<br />
7.3.2.1 Input Code Splitting . . . . . . . . . . . . . . . . . . . . . 147<br />
7.3.3 Experiments & Validation . . . . . . . . . . . . . . . . . . . . . . . 150<br />
7.4 A Retargetable Multimedia Instruction Set Compiler . . . . . . . . . . . . 152<br />
7.4.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 152<br />
7.4.2 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . 152<br />
7.4.3 Multimedia Instruction Set on Desk<strong>to</strong>p and Embedded Processors . 152<br />
7.4.4 Results & Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159<br />
8 Conclusion 161<br />
A The PIPS Compiler Infrastructure 165<br />
B The LuC language 169<br />
B.1 Syntactic Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />
B.2 Semantic Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170<br />
C Using PyPS <strong>to</strong> Drive a Compilation Benchmark 171<br />
D Using C <strong>to</strong> Emulate sse Intrinsics 173<br />
Glossary 177<br />
Acronyms 179<br />
Bibliography 183<br />
Personal Bibliography 197<br />
Index 198
List of Listings<br />
3.1 gcc pass manager initialization. . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
3.2 Dynamic phase ordering using the llvm pass manager command line interface. 47<br />
3.3 Usage of exceptions at the pass manager level. . . . . . . . . . . . . . . . . 54<br />
3.4 Example of workspace composition at the pass manager level using pyps. . 60<br />
3.5 Streaming simd Extension (sse) C intrinsics generated <strong>for</strong> a scalar product. 62<br />
3.6 Sequential intrinsic implementation of a sse scalar product. . . . . . . . . 62<br />
3.7 Native intrinsic implementation of a sse scalar product. . . . . . . . . . . . 62<br />
3.8 Fuzz testing with pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />
4.1 Broadcast a single value in sse. . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
4.2 Vec<strong>to</strong>r type emulation in C. . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />
4.3 Sequential implementation of _mm_set1_ps. . . . . . . . . . . . . . . . . . 73<br />
4.4 Example of structure removal. . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />
4.5 Structure removal in the presence of function call. . . . . . . . . . . . . . . 76<br />
4.6 Two-step trans<strong>for</strong>mation from multi-dimensional arrays <strong>to</strong> pointers. . . . . 77<br />
4.7 Using naming convention <strong>to</strong> distinguish registers <strong>for</strong> the terapix architecture. 78<br />
4.8 Outlining of the inner loop of an erosion kernel. . . . . . . . . . . . . . . . 86<br />
4.9 Enhanced outlining of the inner loop of an erosion kernel. . . . . . . . . . . 86<br />
5.1 Excerpt from libmpcodecs/vf_gradfun.c file from the mplayer source tree. 93<br />
5.2 Sample of representation of a vec<strong>to</strong>r register using C type. . . . . . . . . . 97<br />
5.3 Sample of tree pattern in polish notation used <strong>for</strong> slp. . . . . . . . . . . . 98<br />
5.4 fma sequential version <strong>for</strong> a vec<strong>to</strong>r of 4 floats. . . . . . . . . . . . . . . . . 99<br />
5.5 fma generic operation implemented <strong>for</strong> the neon instruction set. . . . . . 99<br />
5.6 Vec<strong>to</strong>rized output <strong>for</strong> a matrix multiply. . . . . . . . . . . . . . . . . . . . 100<br />
5.8 Conditional loop tiling on a matrix-vec<strong>to</strong>r product. . . . . . . . . . . . . . 104<br />
5.9 Horizontal erosion code sample. . . . . . . . . . . . . . . . . . . . . . . . . 110<br />
6.1 Illustration of statement isolation on a scalar assignment. . . . . . . . . . . 114<br />
6.2 Code after statement isolation. . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />
6.3 Symbolic tiling of the outermost loop of an horizontal erosion with in<strong>for</strong>mation<br />
about in and out regions. . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />
6.4 Illustration of the redundant load s<strong>to</strong>re elimination algorithm. . . . . . . . 131<br />
7.1 Original PyPS script <strong>for</strong> openmp code generation. . . . . . . . . . . . . . . 136<br />
7.2 Makefile stub <strong>for</strong> openmp compilation. . . . . . . . . . . . . . . . . . . . . 137<br />
7.3 terapix assembly <strong>for</strong> a 3 × 3 convolution kernel. . . . . . . . . . . . . . . 148<br />
xi
xii LIST OF LISTINGS<br />
7.4 Illustration of terapix code generation, host part. . . . . . . . . . . . . . 153<br />
7.5 Illustration of terapix code generation, accelera<strong>to</strong>r part. . . . . . . . . . . 154<br />
7.6 Illustration of terapix compacted assembly. . . . . . . . . . . . . . . . . . 155<br />
A.1 A simple loop <strong>to</strong> illustrate pips analyses. . . . . . . . . . . . . . . . . . . . 166<br />
A.2 Example of precondition analysis. . . . . . . . . . . . . . . . . . . . . . . . 166<br />
A.3 Example of trans<strong>for</strong>mers analysis. . . . . . . . . . . . . . . . . . . . . . . . 167<br />
A.4 Example of cumulated memory effects analysis. . . . . . . . . . . . . . . . 168<br />
A.5 Example of convex array regions analysis. . . . . . . . . . . . . . . . . . . 168
List of Figures<br />
2.1 <strong>Heterogeneous</strong> computing model. . . . . . . . . . . . . . . . . . . . . . . . 27<br />
2.2 von Neumann architecture vs. opencl architecture. . . . . . . . . . . . . 28<br />
2.3 Impact of heterogeneous architecture on compilation. . . . . . . . . . . . . 30<br />
2.4 Example of hardware feature diagram. . . . . . . . . . . . . . . . . . . . . 32<br />
2.5 Multicore with vec<strong>to</strong>r unit feature diagram. . . . . . . . . . . . . . . . . . 33<br />
2.6 Comparison of C89 and C99 syntax on a complex matrix vec<strong>to</strong>r multiply. . 35<br />
2.7 Comparison of two versions of the HPEC Challenge benchmark: C89 vs. C99. 36<br />
2.8 Comparison of two versions of the Coremark benchmark: C89 vs. C99. . 37<br />
2.9 Comparison of two versions of the Linpack benchmark: C89 vs. C99. . . . 38<br />
2.10 Compilation flow in opencl. . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />
3.1 A classical 3-phases retargetable compiler architecture. . . . . . . . . . . . 44<br />
3.2 Improved compilation flow <strong>for</strong> heterogeneous computing. . . . . . . . . . . 45<br />
3.3 pips as a generic compiler infrastructure sample. . . . . . . . . . . . . . . . 46<br />
3.4 pyps class hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
3.5 Usage of conditionals at the pass manager level. . . . . . . . . . . . . . . . 53<br />
3.6 Usage of loops at the pass manager level. . . . . . . . . . . . . . . . . . . . 54<br />
3.7 <strong>Source</strong>-<strong>to</strong>-source cooperation with external <strong>to</strong>ols. . . . . . . . . . . . . . . 56<br />
3.8 <strong>Heterogeneous</strong> compilation stages. . . . . . . . . . . . . . . . . . . . . . . . 57<br />
3.9 <strong>Source</strong>-<strong>to</strong>-source heterogeneous compilation scheme. . . . . . . . . . . . . . 58<br />
4.1 Using outlining <strong>to</strong> reduce analyse compilation time on an unrolled sequence<br />
of matrix multiplications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />
5.1 Various way <strong>to</strong> program <strong>for</strong> the sse mis. . . . . . . . . . . . . . . . . . . . 94<br />
5.2 Comparison of llvm, gcc and icc vec<strong>to</strong>rizers using linpack. . . . . . . . 95<br />
5.3 Multimedia Instruction Set his<strong>to</strong>ry <strong>for</strong> x86 processors. . . . . . . . . . . . . 95<br />
5.4 Vec<strong>to</strong>rized implementation of float vec<strong>to</strong>r multiply-addition operation (epilogue<br />
omitted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />
5.5 Parallelizing reductions in a sequence. . . . . . . . . . . . . . . . . . . . . . 107<br />
5.6 Effect of reduction parallelization on an unrolled loop. . . . . . . . . . . . . 108<br />
5.7 Manually vec<strong>to</strong>rizing an inner product vs. using a library. . . . . . . . . . . 109<br />
7.1 Multicore hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . 134<br />
xiii
xiv LIST OF FIGURES<br />
7.2 <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> openmp. . . . . . . . . . . . . . . 135<br />
7.3 Per<strong>for</strong>mance of an openmp directive genera<strong>to</strong>r pro<strong>to</strong>type on the polybench<br />
benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />
7.4 nVidia Fermi architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />
7.5 gpu hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . . . 140<br />
7.6 <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> gpu. . . . . . . . . . . . . . . . . 142<br />
7.7 Splitting a array sum example code in<strong>to</strong> host, loop proxy and accelera<strong>to</strong>r<br />
parts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143<br />
7.8 Median execution time on a gpu <strong>for</strong> dsp kernels. . . . . . . . . . . . . . . 145<br />
7.9 terapix architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />
7.10 terapix hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . 148<br />
7.11 terapix redundant computations. . . . . . . . . . . . . . . . . . . . . . . 149<br />
7.12 <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> terapix. . . . . . . . . . . . . . . 149<br />
7.13 mis hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . . . . 156<br />
7.14 <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> mis. . . . . . . . . . . . . . . . . . 156<br />
7.15 Pass reuse among 4 pyps-based compilers. . . . . . . . . . . . . . . . . . . 159
List of Algorithms<br />
1 Fuzz testing at the pass manager level. . . . . . . . . . . . . . . . . . . . . . 63<br />
2 Compilation complexity reduction with outlining. . . . . . . . . . . . . . . . 87<br />
3 Parametric vec<strong>to</strong>r instruction generation algorithm. . . . . . . . . . . . . . . 103<br />
4 Hybrid vec<strong>to</strong>rization at the pass manager level. . . . . . . . . . . . . . . . . 105<br />
5 Memory footprint reduction algorithm. . . . . . . . . . . . . . . . . . . . . . 123<br />
6 Redundant load s<strong>to</strong>re elimination algorithm at the pass manager level. . . . 130<br />
7 Parallel loop generation algorithm <strong>for</strong> openmp. . . . . . . . . . . . . . . . . 135<br />
8 terapix kernel extraction algorithm at the pass manager level. . . . . . . . 150<br />
9 C-<strong>to</strong>-terapix translation algorithm at the pass manager level. . . . . . . . 151<br />
xv
xvi LIST OF ALGORITHMS
List of Tables<br />
3.1 Comparison of source-<strong>to</strong>-source compilation infrastructures. . . . . . . . . . 55<br />
3.2 sloccount reports <strong>for</strong> the gcc and llvm compilers. . . . . . . . . . . . . . 65<br />
4.1 C dialects and targeted hardware. . . . . . . . . . . . . . . . . . . . . . . . 71<br />
7.1 sloccount report <strong>for</strong> an openmp directive genera<strong>to</strong>r pro<strong>to</strong>type written in<br />
pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137<br />
7.2 sloccount report <strong>for</strong> a cuda genera<strong>to</strong>r pro<strong>to</strong>type written in pyps. . . . . 144<br />
7.3 Description of a terapix microinstruction. . . . . . . . . . . . . . . . . . . 147<br />
7.4 Ratio between terapix microcode cycle counts <strong>for</strong> au<strong>to</strong>matic and manual<br />
code generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />
7.5 sloccount report <strong>for</strong> a terapix assembly genera<strong>to</strong>r pro<strong>to</strong>type written in<br />
pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />
7.6 sloccount report <strong>for</strong> an avx intrinsic genera<strong>to</strong>r pro<strong>to</strong>type written in pyps. 159<br />
7.7 Summary of the sloccount reports <strong>for</strong> the compiler pro<strong>to</strong>types written in<br />
pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160<br />
xvii
xviii LIST OF TABLES
Notations Quick Reference Sheet<br />
This thesis makes use of various notations from different computer science fields. They<br />
are concisely and in<strong>for</strong>mally summarized here. Occasionally, a symbol may be used with<br />
different meaning than shown here. Context should always be sufficient <strong>to</strong> make these<br />
cases clear. For example, P is used <strong>to</strong> denote both the power set of a set and also—as<br />
indicated here—the domain of all programs.<br />
Sets and domains are in general denoted with capital letters in a cursive font, <strong>for</strong><br />
example S. Set members are lowercase, <strong>for</strong> example s. Semantic functions are denoted in<br />
bold capitals, <strong>for</strong> example S.<br />
Where notations given here are unclear, the reader is urged <strong>to</strong> continue, referring back<br />
<strong>to</strong> this sheet as needed.<br />
Set Opera<strong>to</strong>rs<br />
P(A) The power set of A.<br />
|A|<br />
Ā<br />
The number of elements in the set A.<br />
The convex hull of set A.<br />
⌈A⌉ The rectangular hull of set A.<br />
A ∪ B The union of sets A and B.<br />
A ¯∪ B The convex union of sets A and B.<br />
Ak A<br />
The set of all k-tuples of A.<br />
∗<br />
<br />
k∈N Ak .<br />
xix
xx NOTATIONS QUICK REFERENCE SHEET<br />
Syntactic Domains<br />
P The domain of programs p.<br />
F The domain of functions f.<br />
S The domain of statements s.<br />
E The domain of expressions e.<br />
I The domain of identifiers id.<br />
L The domain of memory locations l.<br />
R The domain of references r.<br />
C The domain of constants cst.<br />
Op The domain of opera<strong>to</strong>rs op.<br />
D The domain of declarations d.<br />
T The domain of types t.<br />
Semantic Domains<br />
V The domain of denotable values v.<br />
Σ : L → (V ∪ {unbound}) The domain of s<strong>to</strong>res σ.<br />
Semantic Functions<br />
S : S × Σ → Σ S evaluates statement s in s<strong>to</strong>re σ0 <strong>to</strong> produce s<strong>to</strong>re σ1.<br />
P : P × V ∗ → V ∗ P evaluates the body of function main from the program p in<br />
the s<strong>to</strong>re {〈istdin, [v0, . . . , vn−1]〉} and returns the list of values<br />
accumulated in istdout, where istdin is an identifier reserved <strong>for</strong><br />
standard input and istdout is an identifier reserved <strong>for</strong> standard<br />
output.<br />
E : E × Σ → V E evaluates expression e in s<strong>to</strong>re σ <strong>to</strong> produce value v.<br />
D : D × Σ → Σ D evaluates declaration d in s<strong>to</strong>re σ0 <strong>to</strong> produce s<strong>to</strong>re σ1.<br />
R : R × Σ → P(L) R evaluates reference r in s<strong>to</strong>re σ <strong>to</strong> produce locations l.<br />
T : T × Σ → N T returns the number of memory cells occupied by type t.<br />
I : I → P(L) I returns the locations l associated <strong>to</strong> the identifier i.<br />
Syntactic Opera<strong>to</strong>rs<br />
if E then E1 else E2 The statement denoted by the syntactic clause “if E then E1<br />
else E2”.
NOTATIONS QUICK REFERENCE SHEET xxi<br />
Miscellaneous Opera<strong>to</strong>rs<br />
f ◦ g The function f composed with function g.<br />
f[x → y] A function identical <strong>to</strong> f but <strong>for</strong> the element x where it returns<br />
y.<br />
loc : I × V × Σ → Σ loc produces v new locations from an identifier i, adds them <strong>to</strong><br />
the s<strong>to</strong>re σ and returns this new s<strong>to</strong>re.<br />
unbind(σ, id) : Σ × I → Σ unbind(σ, id) = σ[l → unbound | l ∈ I(id)], i.e. the function<br />
that removes all the locations accessible by an identifier from<br />
memory.<br />
<strong>for</strong>mal(id) : I → I The <strong>for</strong>mal parameter of function id.<br />
body(id) : I → S The body of function id.<br />
refs(s) : S → P(R) The set of all references syntactically found in statement s.<br />
Array Regions<br />
An array region is an application that associates a statement and a<br />
memory state <strong>to</strong> a set of references associated <strong>to</strong> this memory state.<br />
R = r : S × Σ → P(R) The exact array region of the references read by statement s<br />
evaluated in s<strong>to</strong>re σ.<br />
R r : S × Σ → P(R) An over-approximated array region of the references read by<br />
statement s evaluated in s<strong>to</strong>re σ.<br />
R = i : S × Σ → P(R) The exact array region of the references imported by statement<br />
s evaluated in s<strong>to</strong>re σ.<br />
R <br />
i : S × Σ → P(R) An over-approximated array region of the references imported<br />
by statement s evaluated in s<strong>to</strong>re σ.<br />
R = w : S × Σ → P(R) The exact array region of the references written by statement s<br />
evaluated in s<strong>to</strong>re σ.<br />
R w : S × Σ → P(R) An over-approximated array region of the references written by<br />
statement s evaluated in s<strong>to</strong>re σ.<br />
R = o : S × Σ → P(R) The exact array region of the references exported by statement<br />
s evaluated in s<strong>to</strong>re σ.<br />
R o : S × Σ → P(R) An over-approximated array region of the references exported<br />
by statement s evaluated in s<strong>to</strong>re σ.
xxii NOTATIONS QUICK REFERENCE SHEET
Résumé en français<br />
Cette section contient, pour chaque chapitre de la thèse, une traduction de son<br />
introduction et un résumé de chacune de ses sections. Le lecteur intéressé est<br />
invité à se reporter au chapitre correspondant pour obtenir des précisions sur<br />
un sujet particulier.<br />
Construction de compilateurs source à source<br />
pour cibles hétérogènes<br />
1. Introduction<br />
La loi de Moore [Moo65] a été invalidée il y a cinq ans, quand les transis<strong>to</strong>rs sont<br />
devenus si petits que le silicium ne pouvait plus dissiper l’énergie libérée par l’activité des<br />
processeur utilisés à leur fréquence maximale. William J. Dally a appelé ce phénomène<br />
« The End of Denial Architecture » [Dal09]. Pour dépasser ces limitations physiques, les fabricants<br />
de transis<strong>to</strong>rs ont augmentés le nombre de cœurs de calcul par processeur, menant<br />
aux architectures muli-cœurs. L’utilisation de plusieurs cœurs de calcul permet ainsi d’augmenter<br />
la puissance de calcul globale disponible sans avoir à monter en fréquence. Le même<br />
problème de dissipation de chaleur se posera néanmoins quand la densité d’intégration de<br />
ces nœuds de calcul sera telle que le silicium n’arrivera plus à dissiper la chaleur générée,<br />
même à des fréquences moindres. Pour continuer à améliorer les per<strong>for</strong>mances des applications,<br />
les coprocesseurs, ou accélérateurs, sont apparus et ont préparés l’arrivée du calcul<br />
hétérogène, où plusieurs unités de calcul avec des capacités différentes collaborent. Il existe<br />
de nombreux types de coprocesseurs, partant des processeurs spécialisés, très efficaces sur<br />
un petit nombre d’applications, aux plus généralistes. Ainsi, deux chemins vers le calcul<br />
haute per<strong>for</strong>mance sont maintenant possibles : l’utilisation de multi-cœurs homogènes, et<br />
l’utilisation d’une combinaison d’accélérateurs spécialisés. La frontière entre ces deux catégories<br />
est floue, comme le prouve le larabee [SCS + 08] d’Intel, où plusieurs unités de<br />
calcul homogènes sont chacune dotée d’une unité de calcul vec<strong>to</strong>riel de 512 bits.<br />
Il y a 30 ans de cela, de nombreuses machines parallèles ont été construites. La Con-<br />
1
2 RÉSUMÉ EN FRANÇAIS<br />
nection Machine et le MasPar MP2 en sont deux exemples notables. À cette époque, les<br />
supercalculateurs comme le Cray-1 coûtaient 58 M$ et avaient une per<strong>for</strong>mance moyenne<br />
de 80 Mflops. Sortie en 2006, la console de jeu PlayStation 3 (ps3) basée sur une architecture<br />
Cell be coûtaient alors 500$ et pouvait atteindre 230 GFLoating point Operations<br />
per Second (flops) grâce à la combinaison d’un processeur généraliste et de plusieurs<br />
processeurs spécialisés. Moins de 100 exemplaires du Cray-1 ont été construit alors que<br />
plus de 50 millions de ps3 ont été fabriquées. Ce changement de marché a inévitablement<br />
créé un besoin en outils de développements. En 2006, le parallélisme n’est plus l’affaire de<br />
quelques spécialistes comme ce fut le cas à l’époque des Cray.<br />
Il y a plus à apprendre de l’expérience de la ps3 : bien que lancée par sony un an avant<br />
la xbox360 de Microsoft, cette dernière a eu un plus grand succès, en partie à cause<br />
d’un catalogue de jeux plus riche. Il est apparu que de nombreux studios de développement<br />
de jeux vidéos ont trouvé la ps3 et son architecture hétérogène trop difficile à programmer.<br />
En effet, l’architecture Cell be utilise des espaces mémoires séparés, demande un contrôle<br />
manuel des transferts mémoires, la gestion de registres vec<strong>to</strong>riels de 128 bits et demande<br />
une gestion manuelle du cache. Celle complexité était peut-être de trop pour le développeur<br />
moyen.<br />
Il existe trois moyens pour faire face à une telle complexité : engager un ingénieur expert<br />
du domaine, qui maitrise complètement l’architecture cible ; développer une bibliothèque<br />
spécialisée qui masque le comportement de la machine derrière une Application Programming<br />
Interface (api) simplifiée ; ou construire un compilateur qui permette de traduire un<br />
langage de haut niveau vers du langage machine.<br />
La première option est la plus flexible, mais aussi la plus coûteuse. La seconde est efficace<br />
mais manque de flexibilité, d’autant plus que tirer le maximum des per<strong>for</strong>mances d’un<br />
matériel complexe peut rarement se faire à travers une api simple. La dernière approche<br />
combine les avantages des deux précédentes, à condition qu’un compilateur puisse effectivement<br />
être construit à un coût raisonnable. nVidia a par exemple adopté cette approche<br />
avec succès pour leurs General Purpose gpus (gpgpus). La plupart des programmeurs<br />
sur gpgpu de nVidia utilisent une extension du langage C/C++ — le langage Compute<br />
Unified Device Architecture (cuda) — et le compilateur nvcc fournit par nVidia.<br />
En effet, le calcul sur machines hétérogènes impose de <strong>to</strong>utes nouvelles contraintes sur<br />
le flot de compilation. D’après le dragon book [ALSU06],<br />
Definition. Un compilateur est un programme qui sait lire un programme dans un langage<br />
— le langage source — et le traduit en un programme équivalent dans un autre langage —<br />
le langage cible.<br />
Il n’est pas mentionné qu’un compilateur puisse avoir besoin de trans<strong>for</strong>mer son programme<br />
d’entrée en plusieurs programmes de sorties — un pour chaque accélérateur présent<br />
sur la machine hétérogène ciblée — chacun étant écrit dans un langage différent. Les architectures<br />
de machines existantes ne demandaient pas un tel traitement. Quand un fichier<br />
source se dérive en plusieurs fichiers de sortie, le compilateur a besoin de modéliser le système<br />
dans son ensemble afin de prendre les bonnes décisions, par exemple vis à vis de ses<br />
per<strong>for</strong>mances globales.
RÉSUMÉ EN FRANÇAIS 3<br />
Le cas le plus complexe est certainement celui du code existant écrit sans aucune primitive<br />
de parallélisme explicite. Les compilateurs parallélisant ont échoué dans les années<br />
80 à cause de la difficulté d’extraire du parallélisme à partir d’un code séquentiel. La parallélisation<br />
au<strong>to</strong>matique est impossible dans le cas général car les algorithme parallèles<br />
peuvent être complètement différents des algorithmes séquentiels. Même pour les cas où<br />
l’algorithme parallèle est identique à l’algorithme séquentiel, cas où le parallélisme pourrait<br />
être détecté au<strong>to</strong>matiquement par un outil, la tâche peut être rendue difficile à cause<br />
de trans<strong>for</strong>mations de code destinée à améliorer les per<strong>for</strong>mances pour le cas séquentiel.<br />
Alors que le parallélisme devient de plus en plus répandu, la façon dont les algorithmes<br />
sont conçus évolue elle aussi et commence à prendre en compte la dimension parallèle.<br />
Pour cette raison, il parait raisonnable de se concentrer sur la compilation de programmes<br />
explicitement parallèles, et non pas sur l’extraction de parallélisme.<br />
Les machines hétérogènes accentuent l’analogie donnée dans [ABC + 06], où le logiciel se<br />
pose comme un pont entre le matériel et les applications. Pour construire de tels ponts, une<br />
approche est la compilation, et c’est le compilateur qui fera se rencontrer applications et<br />
matériels. De nombreux composants nécessaires à la construction de tels ponts dans un environnement<br />
hétérogène existent déjà dans une littérature riche de plus de trente ans, mais<br />
pas <strong>to</strong>us. Dans [Pat10], David Patterson expose de nombreux aspects de la recherche en<br />
parallélisme depuis les années 60. Comme l’on peut s’y attendre, sa conclusion est qu’il n’y<br />
a pas de repas gratuit dans le domaine. Bien que des cas isolés aient rencontrés un certain<br />
succès, aucune solution globale n’a émergée concernant le problème du parallélisme. Il y a<br />
eu des solutions particulières pour des cas particuliers. On peut donc penser qu’écrire un<br />
compilateur capable de traiter le problème du calcul hétérogène est une tâche impossible.<br />
Il y a pourtant des détails encourageants. Ces solutions particulières peuvent être vues<br />
comme les fondements sur lesquels bâtir des solutions plus complexes. Si ses solutions<br />
sont conçues pour être réutilisables, alors la création de nouveaux compilateurs pour de<br />
nouvelles cibles peut en être simplifiée. L’idée de construire des compilateurs en composant<br />
les blocs de base est au centre de cette thèse. De nombreuses questions en découlent :<br />
Quels sont les blocs de base pertinents dans le contexte du calcul hétérogène ? Comment<br />
composer ces blocs de base en fonction de la cible ? Est-il possible de représenter l’ensemble<br />
des cibles matérielles dans une unique représentation interne ? Finalement, existe-t-il une<br />
méthodologie pour construire des compilateurs pour une cible hétérogène ?<br />
Dans une étude des techniques de conceptions croisées entre le matériel et le logiciel<br />
publiée en 1994 [Wol94], Wayne H. Wolfe déclarait que<br />
Pour être capable de continuer à exploiter l’augmentation des per<strong>for</strong>mances<br />
des CPUs rendue possible par la loi de Moore (. . .), nous devons développer<br />
de nouvelles méthodes de conception et de nouveaux algorithmes qui permettent<br />
aux concepteurs de prédire les coûts d’implémentation, de raffiner de<br />
manière incrémentale les modèles de machine à travers plusieurs niveaux<br />
d’abstraction, et de créer une première implémentation fonctionnelle.<br />
Dix-sept ans plus tard, le défi posé par Wolfe n’est <strong>to</strong>ujours pas relevé. Ni les codes<br />
sources ni les développeurs n’ont évolués aussi vite que le matériel. Par conséquent, les
4 RÉSUMÉ EN FRANÇAIS<br />
applications n’ont pas bénéficié des nouvelles architectures matérielles, sauf qu’en d’intenses<br />
ef<strong>for</strong>ts ont été fournis. La quantité de code existant est telle qu’il ne parait pas viable<br />
économiquement d’utiliser autre chose qu’un compilateur pour combler le fossé qui se<br />
creuse entre le matériel et le logiciel. Les conseils que Wolfe donne aux concepteurs de<br />
matériel tient <strong>to</strong>ut autant pour les concepteurs de compilateurs, ceux-là même qui doivent<br />
concevoir des outils capables de traduire la base de code existante en des codes capables de<br />
tirer parti des capacités des coprocesseurs parallèles. : raffiner de manière incrémentale la<br />
conception des compilateurs en utilisant plusieurs niveaux d’abstraction et en créant<br />
une première implémentation fonctionnelle. Cette tâche est rendue d’autant plus<br />
difficile que les outils utilisés dans un flot de compilation classique sont souvent spécifiques à<br />
un langage de programmation, et ne sont pas <strong>to</strong>ujours adaptés à la compilation hétérogène.<br />
Cette thèse suit l’approche en trois étapes proposée par Wolfe pour assembler des<br />
compilateurs ciblant autant de machines hétérogènes, de la classique machine multi-cœur<br />
au processeur basé sur des Field Programmable Gate Array (fpga)s et spécialisé dans le<br />
traitement d’image en passant par les Graphical Processing Unit (gpu) ou encore les unités<br />
d’instructions vec<strong>to</strong>rielles intégrées aux General Purpose Processors (gpps). L’objectif n’est<br />
pas de proposer les meilleurs des compilateurs pour chaque cible, mais plutôt de construire<br />
des compilateurs raisonnablement efficaces en réutilisant le plus possible de blocs de base.<br />
Pour atteindre cet objectif, nous commençons par une étude des machines hétérogènes<br />
et des modèles de programmation existants au chapitre 2. Trois familles de contraintes<br />
matérielles ressortent de cette étude, correspondant à autant de sources d’hétérogénéité :<br />
l’Instruction Set Architecture (isa), l’organisation mémoire et la source d’accélération —<br />
le parallélisme. Le chapitre 3, montre que les infrastructures de compilation existantes<br />
manquent de la flexibilité requise pour composer des trans<strong>for</strong>mations de code à grain<br />
fin nécessaires pour lever les contraintes matérielles. Nous proposons un modèle afin de<br />
résoudre ce problème. Le chapitre 4, le chapitre 5 et le chapitre 6 examinent les trois<br />
familles de contraintes identifiées et détaillent plusieurs trans<strong>for</strong>mations de code originales<br />
et indépendantes de la cible. Elles <strong>for</strong>ment les briques de base de notre approche.<br />
Cette thèse n’est pas uniquement un exercice théorique. En se reposant sur les idées<br />
développées dans ce manuscrit, plusieurs compilateurs ont été réalisés. Leur conception et<br />
leur implémentation ainsi que des rapports de per<strong>for</strong>mance sont présentés au chapitrer 7.<br />
Le chapitre 8 résume les contributions de cette thèse et en tire plusieurs conclusions.<br />
Toutes les trans<strong>for</strong>mations et les compilateurs décrits dans ce document ont été implémentés<br />
dans l’infrastructure de compilation Paralléliseur Interprocedural de Programmes<br />
Scientifiques (pips) développée par le Centre de Recherche en In<strong>for</strong>matique (cri) des mines<br />
ParisTech avec les contributions de la jeune pousse hpc project, de Télécom Sud-Paris et<br />
de Télécom Bretagne. Une vue plus détaillée du projet est donnée dans l’appendice A.<br />
Une grande partie des idées développées dans cette thèse sont déjà intégrées dans l’outil<br />
Par4All développé par hpc project.
RÉSUMÉ EN FRANÇAIS 5<br />
2. Calcul sur machines hétérogènes<br />
Le 15 février 2007, nVidia a proposé, à travers le langage cuda et le Software Development<br />
Kit (sdk) associé, une nouvelle manière de développer des applications généralistes<br />
sur gpus, un type de matériel généralement limité au calcul graphique. Depuis, les gpgpus<br />
sont utilisés pour simuler des phénomènes physiques, pour faire du traitement vidéo ou encore<br />
du chiffrement, et de nombreuses machines hétérogènes offrent désormais des solutions<br />
efficaces pour le calcul scientifique : fpgas de Xilinx, Software On Chip (soc) ou MultiProcessor<br />
System-on-Chip (mp-soc) de Texas Instruments ou micro contrôleurs de<br />
Atmel.<br />
En novembre 2011, le supercalculateur Tianhe-1A, une grappe de machine contenant<br />
plus de 7k cartes Tesla, est arrivé en haut du <strong>to</strong>p500. Depuis, les le calcul hétérogène se pose<br />
en alternative viable au calcul sur machines multi-cœurs. Cependant, le calcul hétérogène<br />
est très différent du calcul homogène, <strong>to</strong>ut particulièrement en ce qui concerne les trois<br />
aspects critiques suivant : mémoire, parallélisme et jeux d’instructions. Cela conduit à un<br />
entrelacement de concepts spécifiques à chaque cible, mais indépendant de l’organisation<br />
du code à haut niveau. Cette complexité a été illustrée de façon intéressante lors de la<br />
conférence Supercomputing en 2010 : pendant une présentation, un groupe d’experts a<br />
discuté des trois P du calcul hétérogène.<br />
Le premier P est pour Per<strong>for</strong>mance : contrairement aux applications génériques qui<br />
doivent fournir une per<strong>for</strong>mance moyenne pour <strong>to</strong>ut type d’applications, le but d’un accélérateur<br />
matériel est de fournir une per<strong>for</strong>mance de pointe à des applications spécifiques.<br />
Cet objectif est atteint par l’usage de parallélisme, de type Single Instruction stream, Multiple<br />
Data stream (simd), Multiple Instruction stream, Multiple Data stream (mimd) ou<br />
pipeline, ou à l’aide de routine optimisée cablées en dur.<br />
Le second P est pour Power (Énergie) : la consommation énergétique est une contrainte<br />
critique aussi bien pour les systèmes embarqués, pour maximiser l’utilisation entre deux<br />
chargements, et pour les supercalculateurs, pour minimiser les coûts en électricité. Les<br />
accélérateurs matériels sont de bons candidats pour améliorer le rapport flops par Watt,<br />
e.g. parce que moins d’énergie est dépensée dans des opération non liées au calcul.<br />
Le troisième P est pour la Programmabilité, la principale faiblesse des accélérateurs<br />
matériel. En effet, développer sur un accélérateur matériel implique un changement de<br />
modèle depuis la programmation séquentielle vers la programmation de circuit via vhsic<br />
Hardware Description Language (vhdl) [TM08] pour des cartes fpga, programmation 3D<br />
via Open Graphics Library (opengl) ou directX pour les gpus, our programmation parallèle<br />
e.g. via cuda ou Open Computing Language (opencl). De plus, le modèle d’exécution<br />
introduit la complexité supplémentaire de la gestion de la mémoire partagée.<br />
Le chapitre 2 présente en détail le modèle de calcul hétérogène dans la section 2.1 et<br />
ces conséquences sur le modèle de programmation en section 2.2. Le concept de contrainte<br />
matériel est introduit en section 2.3 comment un moyen de modéliser les interactions entre<br />
le matériel et un hypothétique compilateur, et la section2.4 discute la pertinence d’utiliser<br />
le langage C comme langage d’entrée pour programmer ces machines. Nous donnons en
6 RÉSUMÉ EN FRANÇAIS<br />
section 2.5 une analyse d’un modèle de programmation standardisé, opencl. D’autres<br />
modèles sont présentés en section 2.6.<br />
2.1 Modèles de calcul hétérogène<br />
Il existe de nombreux types de machines hétérogènes, suivant le types d’accélérateurs<br />
mis en œuvre : gpgpus, cartes fpga, Application-Specific Integrated Circuits (asics) etc.<br />
Généralement, l’ordonnancement entre ces différents éléments est géré par un processeur<br />
maître, typiquement un gpp en fonction de leurs capacités. Cela implique le passage à<br />
un modèle de calcul distribué. Chaque accélérateur tire son accélération de spécificités<br />
architecturales qui impliquent un nouveau modèle architectural. Ils peuvent également<br />
avoir une mémoire séparée des autres avec sa propre hiérarchie, ce qui ajoute la difficulté<br />
d’un nouveau modèle mémoire.<br />
2.2 Influence sur le modèle de programmation<br />
La complexité du modèle mémoire a un impact important sur le modèle de programmation<br />
: il faut s’affranchir de la mémoire distribuée ou la gérer à travers des appels à distance.<br />
Les contraintes de taille mémoire peuvent également être bloquantes, notamment dans le<br />
domaine des processeurs embarqués. De même, les spécificités du modèle architectural font<br />
qu’un même code doit être adapté à plusieurs isas, impliquant souvent d’écrire plusieurs<br />
version d’un même code, une par cible. Enfin, le modèle d’exécution, étroitement lié à<br />
différentes <strong>for</strong>mes de parallélisme, rappelle les nombreuses difficultés liées à l’expression du<br />
parallélisme au niveau du modèle de programmation.<br />
2.3 Contraintes matérielles<br />
Nous proposons de modéliser les difficultés liées au calcul hétérogène sous <strong>for</strong>me de<br />
contraintes matérielles. Une contrainte peut être liée à l’isa, la mémoire ou à l’accélération.<br />
Elle est soit obliga<strong>to</strong>ire, auquel cas elle doit être respectée pour utiliser la cible, soit<br />
facultative, auquel cas on peut s’attendre à un meilleur comportement de l’accélérateur si<br />
elle est respectée. Un accélérateur est alors décrit par un diagramme de contraintes qui<br />
aide le développeur à appréhender la cible.<br />
2.4 Notes sur le langage C<br />
His<strong>to</strong>riquement, les compilateurs ciblant des architectures ne respectant pas le modèle<br />
von Neumann se sont souvent basés sur la langage C, ce qui nous amènera à considérer son<br />
emploi dans le cadre de machines hétérogènes. Dans ce document, nous traiterons des codes<br />
de haut niveau écrits en C99 et utilisant des tableaux multidimensionnels de taille variable,
RÉSUMÉ EN FRANÇAIS 7<br />
ce qui dégrade peu les per<strong>for</strong>mances par rapport à un code écrit en C89 en utilisant des<br />
pointeurs mais augmente sensiblement la qualité du résultat des analyses de code.<br />
2.5 Analyse du modèle de programmation de OpenCL<br />
opencl est un standard offrant un modèle et une api C pour programmer dans un<br />
environnement hétérogène. On y retrouve, exhibés au niveau développeur, les différentes<br />
contraintes matérielles que nous avons mentionnées. Le parallélisme y existe sous plusieurs<br />
<strong>for</strong>mes : instructions vec<strong>to</strong>rielles, entre noyaux et entre contextes, certaines combinaisons de<br />
types et de qualificateurs sont interdites pour <strong>to</strong>ucher un plus grand nombre d’architectures<br />
et le développeur gère manuellement les transferts mémoires, à plusieurs niveaux, à l’aide de<br />
Direct Memory Accesss (dmas). Le code spécifique à un accélérateur est dérivé à l’exécution<br />
d’une description générique, chaque vendeur de matériel doit donc fournir le sien, ainsi<br />
qu’une implémentation de l’api utilisateur, pour être compatible avec le standard.<br />
2.6 Autres modèles de programmation<br />
Il n’existe pas de standard autre qu’opencl pour programmer sur machines hétérogènes.<br />
Cependant, de nombreuses approches ont été proposées pour des cibles particulières : compilation<br />
à partir d’un sous-ensemble du langage C, par exemple pour la génération de<br />
vhdl ; extension d’un langage existant avec des concepts propres à la cible, e.g. cuda ;<br />
utilisation d’un code séquentiel annoté avec des directives, e.g. Hybrid Multicore Parallel<br />
Programming (hmpp).<br />
3. Conception de compilateurs pour machines<br />
hétérogènes<br />
Jusqu’à la fin du 20ème siècle, un type d’architecture matériel dominait le marché des<br />
machines génériques : le modèle von Neumann. Par conséquent, les principaux compilateurs<br />
ont été construits pour cibler efficacement les machines implémentant ce modèle :<br />
un front-end analyse le code en entrée et le traduit en une représentation intermédiaire,<br />
un middle-end applique différents optimisations, au niveau bloc de code, boucle, fonction<br />
ou programme, et un back-end génère le code assembleur spécifique à la cible. Ces trois<br />
composants ont été largement étudiés et sont bien connus, comme l’illustre la mise à jour<br />
régulière du Dragon Book [ALSU06].<br />
La complexification de l’architecture même des compilateurs, cristallisée par la difficulté<br />
d’adaptation aux cibles hétérogènes, favorise une approche plus modulaire. En effet, il<br />
parait raisonnable de réutiliser les compilateurs, applications et bibliothèques existant et<br />
capable d’effectuer efficacement une tâche précise. Les combiner au lieu de réinventer la
8 RÉSUMÉ EN FRANÇAIS<br />
roue devrait être possible. PLuTo [BBK + 08] est un bon exemple d’une telle approche :<br />
l’outil combine un front-end C, un optimiseur polyédrique, un générateur de code cuda<br />
et le compilateur cuda de nVidia pour générer du code <strong>to</strong>urnant efficacement sur gpu.<br />
Les schémas de construction d’application se complexifient également quand il s’agit<br />
d’assembler des fichiers objet générés depuis différents langages par différentes chaînes<br />
de compilation. Par exemple, l’outil sloccount dénombre 210 <strong>Source</strong> Lines Of Code<br />
(sloc) dans common.mk, un Makefile générique fournit dans le sdk de cuda. Cela rend<br />
la génération de code plus ardue : il faut non seulement générer du code pour différents<br />
cibles, mais aussi trouver un moyen de les lier au sein d’une même application. Cette tâche,<br />
qui peut s’avérer non triviale dans un environnement homogène, peut devenir très complexe<br />
dans un environnement hétérogène.<br />
Le chapitre 3 étudie l’impacte du modèle de machine hétérogène, présenté dans le<br />
chapitre 2, sur la construction de compilateurs. Il propose la combinaison d’une infrastructure<br />
de compilation générique et d’un gestionnaire de passe programmable pour aborder<br />
le problème évoqué. Cette approche se concentre sur la modularité, la ré-utilisabilité, la<br />
capacité à se re-cibler et la flexibilité du schéma dans son ensemble.<br />
Pour commencer, la section 3.1 étudie l’adéquation entre les infrastructures de compilations<br />
utilisées en production et la nouvelle cible que représente les machines hétérogènes<br />
et propose un modèle pour supporter la gestion de passe programmable, un aspect critique<br />
pour la modularité des compilateurs. Ensuite, la section 3.2 argumente en faveur de<br />
l’usage de trans<strong>for</strong>mations source-à-source, utilisant le fichier de code C comme un moyen<br />
de communication entre outils. Finalement, la section 3.3 introduit une api de haut niveau<br />
pour construire des gestionnaires de passe. Elle expose une abstraction suffisante du comportement<br />
interne du compilateur au développeur qui peut ainsi raisonner efficacement au<br />
niveau passe. Le schéma complet est illustré par l’interface Pythonic PIPS (pyps) développée<br />
au dessus de l’infrastructure de compilation pips. Les travaux connexes sont étudiés<br />
dans la section 3.4.<br />
3.1 Extension des infrastructures de compilation<br />
Les infrastructures de compilation classique reposent sur une architecture en trois<br />
niveaux : un ou plusieurs front-end, un middle-end commun et un ou plusieurs back-end.<br />
La diversité des accélérateurs présents sur une machine hétérogène rend l’utilisation d’un<br />
unique middle-end difficile : il en faut autant que de cibles, ce qui limite également la possibilité<br />
d’avoir plusieurs back-end. Le problème qui se pose est celui de la réutilisation de<br />
code entre middle-end, et la composition des trans<strong>for</strong>mations en des schémas complexes.<br />
Nous proposons un modèle de trans<strong>for</strong>mation de code assurant la validité de leur composition<br />
et de leur application. Ce modèle est utilisé par le gestionnaire de passe, l’entité<br />
responsable de l’ordonnancement des trans<strong>for</strong>mations de code au sein du middle-end. Une<br />
hiérarchie de classes est dérivée de ce modèle pour aboutir à une api utilisable par le<br />
gestionnaire de passes.
RÉSUMÉ EN FRANÇAIS 9<br />
3.2 À propos des compilateurs source-à-source<br />
Un compilateur source-à-source est un compilateur qui prend en entrée un code écrit<br />
dans un langage de haut niveau et produit un code écrit dans un langage de haut niveau.<br />
Cette approche est souvent utilisée par les compilateurs parallélisant qui s’épargnent ainsi le<br />
besoin de gérer la génération de code binaire. Cette approche est particulièrement valable<br />
dans le cadre du calcul hétérogène, puisqu’il existe souvent des compilateurs source-àbinaire<br />
spécifiques à une cible, mais générant du code de qualité moyenne. Un compilateur<br />
source-à-source peut s’interfacer avec de tels outils — ou d’autre compilateurs source-àsource<br />
— pour générer du code de meilleur qualité. Formulé autrement, il parait bénéfique<br />
d’utiliser des outils complémentaires et variés pour s’attaquer à des cibles variées.<br />
3.3 pyps, une api de haut niveau pour la gestion de passe<br />
Nous avons implémenté l’api mentionnée plus haut dans le langage Python en se basant<br />
sur l’infrastructure de compilation source-à-source pips. L’utilisation d’un langage de<br />
script permet de raccourcir les cycles de développement et de pro<strong>to</strong>typer rapidement des<br />
enchaînements complexe de trans<strong>for</strong>mations de code, en travaillant uniquement au niveau<br />
du gestionnaire de passe. Tous les compilateurs assemblés durant cette thèse se basent sur<br />
cette implémentation, nommée pyps.<br />
3.4 Travaux connexes<br />
L’assemblage complexe de trans<strong>for</strong>mations de code au sein d’un compilateur a donnée<br />
lieu à des travaux visant à démontrer les limitations de l’approche traditionnelle dans le<br />
cadre de la compilation itérative, à la production de gestionnaires de passes modulaires voire<br />
programmables. D’autres travaux se concentrent sur la validité de la composition de passes<br />
et sur la sémantique que l’on peut y associer. Enfin, certains traitent les problèmes liés à<br />
l’extension de la représentation interne pour représenter de nouvelles cibles en capitalisant<br />
sur les passes existantes.<br />
4. Représentation d’un jeu d’instruction dans le langage<br />
C<br />
Lors d’une présentation donnée au Fusion Developers Summit 2011 à Bellevue, Washing<strong>to</strong>n,<br />
Phil Rogers annonçait que<br />
Le Fusion System Architecture (fsa) est indépendant de l’isa, que ce soit pour<br />
les gpus ou les Central Processing Units (cpus). C’est un point très important
10 RÉSUMÉ EN FRANÇAIS<br />
car nous invi<strong>to</strong>ns des partenaires à se joindre à nous dans <strong>to</strong>us les secteurs ;<br />
d’autres entreprise de matériel à implémenter fsa et à se joindre à nous au<strong>to</strong>ur<br />
de cette plate<strong>for</strong>me. . .<br />
Sous le capot, le Fusion System Architecture (fsa) repose sur une isa virtuelle. Au<br />
chapitre 3, nous avons énoncé qu’il est important, si l’on veut garder un bon niveau d’abstraction,<br />
de garder la représentation interne indépendante du matériel. Mais est-il possible<br />
de représenter <strong>to</strong>us les raffinements de l’isa cible dans une représentation interne proche<br />
du langage C [ISO99] ? D’après Brian Kernighan [Ker03],<br />
Le C est peut-être le meilleur équilibre jamais trouvé par un langage de programmation<br />
entre expressivité et efficacité. (. . .) Il était si proche de la machine<br />
que l’on pouvait voir ce qu’allait être le code machine (et il n’était pas difficile<br />
d’écrire un bon compilateur), mais il prenait bien garde de rester au dessus du<br />
niveau instruction de façon à pouvoir cibler <strong>to</strong>utes las machines sans avoir à<br />
réfléchir à des astuces particulières pour une machine particulière.<br />
Cela nous rappelle que le langage C a été conçu pour être proche du matériel. Ainsi,<br />
même sir la représentation interne choisie reste proche du C, elle peut-être d’un niveau<br />
suffisamment bas pour exprimer certaines des caractéristiques propres à l’isa ciblée. Cet<br />
aspect est examiné à la section 4.1. Nous détaillons ensuite <strong>to</strong>ues les aspects d’une isa et<br />
montrons que, sous contrainte de respecter certaines conventions et après trans<strong>for</strong>mations,<br />
il est possible d’adapter un code C aux contraintes d’une isa spécifique. La section 4.2<br />
examine les types de base ; la section 4.3 liste les différents types de registres spécifiques ;<br />
la section 4.4 détaille les liens entre fonctions intrinsèques et instructions machines ; et la<br />
section 4.5 parcoure les différences générées par la hiérarchie mémoire. Les problèmes liés<br />
aux frontières entre appels de fonctions sont examinés en section 4.6 et les appels à des<br />
bibliothèques externes sont examinés en section 4.7.<br />
4.1 C comme dénominateur commun<br />
Une étude des langages utilisés par cinq compilateurs pour accélérateurs, Handel-C,<br />
Mitrion-C, c2h, cuda and opencl montrent que des dialectes du langage C sont souvent<br />
utilisés comme langage d’entrée par des compilateurs ciblant du matériel spécialisé. La traduction<br />
en C sans intrinsèque du fichier d’en-tête xmmintrin.h fournissant les intrinsèques<br />
pour utiliser le jeu d’instruction Streaming simd Extension (sse) montre qu’il est parfois<br />
possible de représenter directement en C certaines fonctionnalité du matériel.<br />
4.2 Types de données natifs<br />
Certains type de donnés natifs en C peuvent ne pas être supportés par le compilateur<br />
cible. Dans ce cas, il est parfois possible de se ramener à un type supporté en utilisant<br />
une trans<strong>for</strong>mation de code : utilisation de la virgule fixe en l’absence de support pour les
RÉSUMÉ EN FRANÇAIS 11<br />
nombres flottants, éclatement des enregistrements en autant de variables que de champs<br />
ou remplacement des tableaux de taille fixe par autant de scalaires.<br />
4.3 Registres<br />
Le langage C permet de recommander le placement d’une variable dans un registre via<br />
le mot clé register. Si la machine cible possède des registres dédiés, on peut dé<strong>to</strong>urner<br />
l’usage de ce mot clé et utiliser une convention de nommage particulière pour affecter<br />
certaines variables à certains registres particuliers <strong>to</strong>ut en restant au niveau C.<br />
4.4 Instructions<br />
Il est courant qu’une architecture particulière possède des instructions particulières<br />
qui n’ont pas d’équivalent direct en C : instructions vec<strong>to</strong>rielles, opérations a<strong>to</strong>miques,<br />
Fused Multiply-Add (fma) ou dma sont des exemples classiques. L’approche traditionnelle<br />
dans ce cas est de représenter ces instructions par des fonctions intrinsèques, des fonctions<br />
possédant une signature C correcte mais traitées de manière particulière par le compilateur<br />
pour générer l’instruction assembleur idoine.<br />
4.5 Architecture mémoire<br />
Le calcul hétérogène fait intervenir et communiquer entre eux plusieurs espaces mémoires.<br />
Le langage C ne permet pas de faire la distinction entre une variable déclarée<br />
dans un espace mémoire et une variable déclarée dans un autre espace mémoire. Là encore,<br />
l’utilisation de conventions de nommage appropriées permet de ne pas modifier la<br />
représentation interne.<br />
4.6 Appels de fonction<br />
Certaines architecture ne propose pas de mécanisme bas niveau permettant d’effectuer<br />
un appel de fonction. Dans ce cas, on peut effectuer une expansion de procédure. Par ailleurs<br />
les appels de fonctions peuvent être utilisé pour simuler un appel à un accélérateur. Dans<br />
ce cas, on a besoin de sélectionner une portion de code et de l’extraire dans une nouvelle<br />
fonction qui représentera l’appel. Cette trans<strong>for</strong>mation se base sur l’analyse des effets<br />
mémoires pour limiter le nombre de paramètres de cette nouvelle fonction et restreindre<br />
l’utilisation de pointeurs pour simuler un passage par référence aux cas utiles.
12 RÉSUMÉ EN FRANÇAIS<br />
4.7 Appels de bibliothèques externes<br />
Dans le contexte des analyses inter procédurales, on peut avoir besoin de connaître le<br />
comportement de certaines fonctions pour lesquelles on ne dispose pas d’une implémentation<br />
complète. Le fournisseur de fonction permet de séparer ce problème en plusieurs<br />
parties. C’est un composant capable de fournir plusieurs versions d’une même fonction :<br />
une qui reproduit les effets mémoires de la fonction, destinée au compilateur ; et une par accélérateur<br />
qui correspond à son implémentation réelle. On peut ainsi s’abstraire des appels<br />
de bibliothèque spécifiques à une cible.<br />
5. Exploitation du parallélisme avec des instructions multimedia<br />
Une fonctionnalité récurrente des accélérateurs matériels est l’utilisation qu’ils font du<br />
parallélisme. Ce parallélisme peut se trouver à plusieurs niveaux, et prendre plusieurs<br />
<strong>for</strong>mes, généralement un mélange de parallélisme de type simd et de parallélisme de<br />
type mimd, comme c’est le cas pour les gpgpu. Ces deux types de parallélisme ont<br />
été longuement étudiés, en se concentrant sur la parallélisation de boucle : recherche<br />
d’hyperplans [Lam74], gestion des dépendances de contrôle [AKPW83], vec<strong>to</strong>risation<br />
de boucles [AK87], extraction de parallélisme [WL91b], partitionnement en supernœuds<br />
[IT88], optimisation des communications [DUSsH93], interactions avec la mémoire<br />
cache [KK92], pavage [DSV96, AR97, YRR + 10]. David F. Bacon et al. ont passé en revue<br />
[BGS94] les trans<strong>for</strong>mations de code pour le High Per<strong>for</strong>mance Computing (hpc),<br />
trans<strong>for</strong>mations qui concernent majoritairement les boucles. Vivek Sarkar a étudié l<br />
sélection au<strong>to</strong>matique de trans<strong>for</strong>mations en se basant sur un modèle de coût [Sar97].<br />
Toutes ces techniques ont été appliquées avec succès dans des compilateurs de recherche tels<br />
SUIF [WFW + 94], Polaris [PEH + 93], pips [IJT91, AAC + 11], Rose [Qui00] ou Pocc [PBB10],<br />
et dans des compilateurs utilisés en production comme IBM XL [Sar97], Low Level Virtual<br />
Machine (llvm) [GZA + 11], gnu C Compiler (gcc) [TCE + 10], Intel C++ Compiler<br />
(icc) [DKK + 99], pgi [Wol10].<br />
Dans ce chapitre, nous nous concentrons sur deux aspects de la parallélisation de code :<br />
Instruction Level Parallelism (ilp) et parallélisation de réductions. Le premier cherche<br />
à tirer parti des possibilités de parallélisme au niveau instruction, telles qu’on peut les<br />
trouver dans des séquences de code ou à l’intérieur d’une boucle, en utilisant les Multimedia<br />
Instruction Sets (miss) disponibles sur la plupart des processeurs modernes. Cet aspect est<br />
décrit dans la section 5.1. Le second aspect est un problème critique quand un code mettant<br />
en jeu une réduction doit être exécuté sur du matériel en mode purement simd. Il est abordé<br />
en section 5.2. La section 5.3 propose un modèle simple basé sur le parallélisme que l’on<br />
peut trouver sur des accélérateurs à mémoire distribuée, pour décider s’il est profitable ou<br />
non de déporter un calcul.
RÉSUMÉ EN FRANÇAIS 13<br />
5.1 Parallélisation au niveau instruction<br />
La génération d’instruction vec<strong>to</strong>rielle peut s’effectuer de deux façon : au niveau boucle<br />
en utilisant des techniques de vec<strong>to</strong>risation [BGS94, Bik04], ou au niveau séquence en utilisant<br />
des algorithmes de détection de motif [LA00, SHC05]. Nous proposons un algorithme<br />
hybride capable d’exploiter le parallélisme présent dans les boucles et dans les séquences.<br />
L’idée est d’appliquer des trans<strong>for</strong>mations de boucle de haut niveau, type pavage, puis<br />
de dérouler les boucles internes pour faire apparaitre des séquences sur lesquelles on<br />
peut appliquer la détection de motif. Cet algorithme s’applique sur un jeu d’instruction<br />
générique [Roj04] caractérisé par la taille des registres vec<strong>to</strong>riels et permet donc de cibler<br />
<strong>to</strong>us les jeux d’instructions inclus dans celui-ci.<br />
Un problème récurent lié à la génération d’instructions vec<strong>to</strong>rielles est celui des transferts<br />
mémoires. L’algorithme de détection de motif proposé maintient un état complet<br />
du contenu des registres vec<strong>to</strong>riels et utilise cette connaissance pour limiter le nombre de<br />
transferts depuis la mémoire principale en les remplaçant par des opération de copies ou<br />
de mélange entre registres vec<strong>to</strong>riels.<br />
5.2 Parallélisation de réductions<br />
Du point de vue du parallélisme, les réductions constituent un goulet d’étranglement. Il<br />
existe plusieurs technique bien connues pour extraire du parallélisme de boucles possédant<br />
une réduction [KRS90, Lei92]. Ces techniques on été étendues à la séquence d’instruction<br />
indépendamment de la présence de boucle englobante. L’idée est d’identifier chaque variable<br />
de réduction de la séquence [JD89], et de créer un tableau de la taille idoine pour reporter<br />
la réduction à la fin de la séquence.<br />
Que ce soit en sortie de boucle ou en sortie de séquence, la parallélisation de réduction<br />
fait intervenir un postlude qui n’est pas parallèle. Pour éviter d’avoir à effectuer<br />
cette réduction en séquentiel, il peut exister des mécanismes matériels dédiés. Pour abstraire<br />
ce concept, le traitement du postlude est délégué à une bibliothèque tierce dont<br />
nous fournissons une implémentation séquentielle mais qui peut être remplacée par une<br />
implémentation matérielle si celle-ci est possible.<br />
5.3 Estimation de l’intensité de calcul<br />
Il n’est pas <strong>to</strong>ujours bénéfique de paralléliser une boucle. Ce constat est particulièrement<br />
vrai dans le cas des accélérateurs possédant une mémoire distribuée, en raison des<br />
temps de transfert mémoire. En calculant la somme des volumes [Cla96] des régions de<br />
tableaux convexes lues et écrites par une instruction, il est possible d’obtenir un majorant<br />
de l’empreinte mémoire d’une instruction. De même, pour un code à contrôle statique, il est<br />
possible d’estimer le nombre d’instructions exécutées par ce code. En se basant sur cette<br />
estimation et sur l’empreinte mémoire, on peut estimer le ratio entre calcul et communi-
14 RÉSUMÉ EN FRANÇAIS<br />
cations et émettre un critère de décision local et conservatif comme quoi une boucle ne<br />
dont pas être parallélisée si l’ordre de grandeur des transferts mémoires n’est pas inférieur<br />
à l’ordre de grandeur du nombre d’instructions exécutées.<br />
6. Trans<strong>for</strong>mations pour la taille mémoire et les mémoires<br />
distribuées<br />
Wm. A. Wulf and Sally A. Mckee ont conclu leur article [WM95] « Hitting the<br />
Memory Wall : Implications of the Obvious » publié en 1995 par la phrase suivante :<br />
La solution la plus « pratique » au problème serait la découverte d’une technologie<br />
de mémoire dense et ne chauffant pas, dont la vitesse reste proportionnelle<br />
à celle des processeurs. Nous ne sommes pas au fait qu’une telle technologie<br />
existe (. . .).<br />
Quinze ans plus tard, une telle technologie n’existe <strong>to</strong>ujours pas, et les aspects mémoires<br />
restent un problème critique pour de nombreuses application parallèles. Dans le<br />
contexte du calcul hétérogène, où la mémoire de l’hôte et la mémoire de l’accélérateur sont<br />
souvent séparées, il est important de gérer cette contrainte matérielle avec soin. Dans ce<br />
cadre, nous proposons trois trans<strong>for</strong>mations génériques : l’isolation d’instruction qui sépare<br />
l’espace mémoire de l’accélérateur de l’espace mémoire de l’hôte, présentée en section 6.1,<br />
la réduction d’empreinte mémoire qui trouve les bons paramètres de pavage d’un nid de<br />
boucle de façon à ce que les boucles internes logent dans la mémoire cible, présentée en<br />
section 6.2, et l’élimination de transferts redondants présentée en section 6.3.<br />
6.1 Isolation d’instructions<br />
L’isolation d’instruction est une trans<strong>for</strong>mation qui permet d’isoler une instruction<br />
spécifique dans un nouvel espace mémoire émulé par des variables nouvellement allouées.<br />
L’idée est de remplacer <strong>to</strong>utes les variables référencées par l’instruction concernée et de les<br />
remplacer par de nouvelles variables de même type. Une étape de génération de transfert<br />
génère les copies depuis les anciennes variables vers les nouvelles et réciproquement de<br />
façon à garantir la cohérences des valeurs lues par l’instruction.<br />
Cette trans<strong>for</strong>mation fait intervenir deux <strong>for</strong>mes d’optimisation : en se basant sur les<br />
régions de tableau lues et écrites par l’instruction, elle est capable de placer les tableaux<br />
référencés par l’instruction dans des tableaux de plus petite taille, limitant ainsi ta taille<br />
des transferts. En se basant sur les régions de tableaux produites et consommés par l’instruction,<br />
elle est capable de ne générer les copies que quand celles-ci sont utiles, limitant<br />
ainsi le nombre de transferts.
RÉSUMÉ EN FRANÇAIS 15<br />
6.2 Réduction de l’empreinte mémoire<br />
La réduction d’empreinte mémoire correspond à la recherche de paramètres de pavages<br />
garantissant que le volume mémoire nécessaire pour effectuer un calcul sur une tuile est<br />
inférieur à une borne donnée. Pour cela, on commence par effectuer un pavage rectangulaire<br />
paramétrique du nid de boucle considéré. Une fois le pavage effectué, on calcul un majorant<br />
de l’empreinte mémoire du calcul sur la tuile en se basant sur le volume des régions convexes<br />
lues et écrites. On obtient ainsi une expression fonction des paramètres de pavages que l’on<br />
cherche à maximiser. Les paramètres trouvés sont alors fixés pour revenir à un pavage<br />
statique qui nous garantit que la contrainte de capacité mémoire est respectée.<br />
6.3 Élimination de transferts redondants<br />
Cette trans<strong>for</strong>mation est l’extension de l’élimination de chargement et déchargement<br />
redondant que l’on connait pour les registres scalaires. Elle propose de considérer chaque<br />
tableau comme un registre, et de considérer chaque transfert mémoire généré par l’isolation<br />
d’instruction comme une affectation de registre. En se basant sur ce <strong>for</strong>malise, l’algorithme<br />
d’élimination de transfert redondant fait remonter les transferts mémoires au plus haut<br />
dans la représentation interne, aussi longtemps que la remontée satisfait les conditions<br />
de Bernstein [Ber66] et en utilisant des règles d’élimination pour fusionner certains accès<br />
redondants. Cette traversée de la représentation interne peut se faire de façon intra ou<br />
inter procédurale suivant les caractéristiques de la cible.<br />
7. Implémentations de compilateurs et expériences<br />
Cette thèse présente et décrit une méthodologie pour spécialiser des compilateurs pour<br />
différentes plate<strong>for</strong>mes hétérogènes, en se basant sur une boîte à outil de trans<strong>for</strong>mations<br />
source-à-source bien fournie, une api pour gestionnaires de passe programmable et une<br />
description simple du matériel. Elle ne saurait être complète sans validation expérimentale.<br />
La méthodologie proposée prétend rendre plus aisé l’assemblage de compilateurs. Pour<br />
la valider, nous avons choisi cinq cibles différentes : trois cpu généralistes avec différentes<br />
unités vec<strong>to</strong>rielles, un processeur [BLE + 08] sur fpga spécialisé dans le traitement d’images<br />
et un gpu de chez nVidia. Pour chacun d’entre eux, nous avons développé un pro<strong>to</strong>type<br />
de compilateur en utilisant les techniques présentées dans les chapitres 4, 5 et 6. L’efficacité<br />
du code généré par ces pro<strong>to</strong>types de compilateur est mesurée en utilisant des bancs de<br />
tests ou des applications du domaine concerné.
16 RÉSUMÉ EN FRANÇAIS<br />
7.1 Un compilateur OpenMP naïf<br />
Ce chapitre commence par un générateur de directives Open Multi Processing (openmp)<br />
simple présenté en section 7.1 pour montrer comment appliquer les principes discutés durant<br />
cette thèse sur un cas pratique et réel. Le compilateur pour gpus implémenté par<br />
hpc project en se basant sur nos travaux est détaillé en section 7.2. La section 7.3 présente<br />
terapyps, un compilateur du langage C vers le langage terasm, le langage assembleur utilisé<br />
pour le processeur de traitement d’image terapix. Finalement un compilateur re-ciblable<br />
pour jeux d’instruction multimédia est décrit en section 7.4 pour trois cibles :sse, Advanced<br />
Vec<strong>to</strong>r eXtensions (avx) et neon.<br />
Pour la génération de directives openmp, on commence par caractériser le matériel<br />
disponible comme utilisant une mémoire partagée et un parallélisme de type mimd, ce qui<br />
permet d’identifier les trans<strong>for</strong>mations de code à utiliser, à savoir principalement l’extraction<br />
et la détection de parallélisme. Il n’y a pas d’étape de post-traitement car les directives<br />
openmp font partie intégrante de la représentation interne. Le pro<strong>to</strong>type de compilateur<br />
assemblé ainsi est validé sur le banc de test polybench.<br />
7.2 Un compilateur pour GPU<br />
La génération de code pour gpu est plus riche : il faut prendre en compte la mémoire<br />
partagée. On se contente de prendre en compte le parallélisme de type simd, en<br />
ignorant les capacités mimd du matériel. Les trans<strong>for</strong>mations d’isolation d’instruction et<br />
d’élimination de transferts redondants sont utilisées. Le schéma de compilation est également<br />
plus complexe puisqu’il fait intervenir une étape de séparation du code hôte et du<br />
code accélérateur, une étape de conversion du langage C vers le langage cuda utilisé par le<br />
compilateur source-à-binaire de l’accélérateur, et la génération finale du binaire pour l’accélérateur<br />
par ce dernier, en sus de la génération de code pour l’hôte. Le gestionnaire de<br />
passe est utilisé pour combiner ces différentes étapes, l’extraction de procédure permettant<br />
de générer autant d’unité de compilation que nécessaire. Le compilateur assemblé à partir<br />
de ces briques de bases est validé sur plusieurs noyaux de traitement du signal, sur lesquels<br />
on obtient des accélérations moyennes d’un facteur ×25 sur de gros jeux de données.<br />
7.3 Un compilateur pour un processeur d’images sur FPGA<br />
L’accélérateur terapix, dédié au traitement d’image, pose un défi supplémentaire car<br />
la mémoire de l’accélérateur ne permet pas d’y loger assez de données pour traiter une<br />
image en une passe. De plus sonisa est très spécifique, en particulier de par l’utilisation<br />
d’un jeu d’instruction Very Long Instruction Word (vliw). La trans<strong>for</strong>mation de code<br />
réduction de l’empreinte mémoire permet de s’affranchir des limitations liées à la taille. Le<br />
schéma de compilation utilisé ajoute une passe par rapport aux gpus : la traduction d’un<br />
flot d’instruction séquentiel en un flot d’instruction vliw. Cette étape est assurée par un
RÉSUMÉ EN FRANÇAIS 17<br />
outil tierce, il s’agit donc de <strong>for</strong>mater le code produit pour qu’il satisfasse ses contraintes<br />
d’entrée, à l’aide de trans<strong>for</strong>mations de bas niveau comme la linéarisation de tableau, la<br />
détection d’itérateurs ou la conversion de boucles pour en boucles tant que. La chaîne<br />
de compilation complète permet de traduire au<strong>to</strong>matiquement des noyaux de traitement<br />
d’image écrits en C vers le langage assembleur de terapix, et d’obtenir des noyaux dont<br />
les per<strong>for</strong>mances sont proches d’une version optimisée manuellement (nombre de cycles de<br />
l’ordre de 125% du nombre de cycles optimum).<br />
7.4 Un compilateur reciblable pour jeux d’instructions vec<strong>to</strong>rielles<br />
La génération de code pour des processeurs génériques possédant une petite unité vec<strong>to</strong>rielle,<br />
e.g. de type sse, pose un défi différent, car les contraintes mémoires sont moins<br />
<strong>for</strong>tes, bien que similaires aux cas précédents. Dans ce cas, c’est l’extraction au<strong>to</strong>matique<br />
d’un parallélisme de type simd avec des vecteurs de petite taille qui est la clef de la per<strong>for</strong>mance.<br />
L’algorithme de vec<strong>to</strong>risation hybride présenté au chapitre 5 permet de lever<br />
ses contraintes, et se combine avec l’élimination de transferts redondants pour limiter le<br />
nombre de transferts. Comme les trans<strong>for</strong>mations mises en œuvre sont génériques, le compilateur<br />
assemblé ainsi est indépendant de la cible — mis à part la taille des registres<br />
vec<strong>to</strong>riels — et est facilement reciblable d’un jeu d’instruction à l’autre. En pratique, ce<br />
compilateur génère du code plus efficace que gcc et légèrement moins efficace que icc,<br />
mai supporte plus d’architecture que icc, par exemple l’ARM v7 et le jeu d’instruction<br />
neon.<br />
8. Conclusion<br />
La recherche de per<strong>for</strong>mance passe désormais par les machines hétérogènes : même<br />
l’ordinateur portable utilisé pour écrire cette thèse peut utiliser les capacités de calcul de<br />
deux processeurs génériques, des deux unités d’instructions vec<strong>to</strong>rielles sse associées, et<br />
d’un gpgpu. Le principal problème de ces unités de calcul et la difficulté de les programmer.<br />
Dans cette thèse, nous avons choisi le chemin de la compilation pour au<strong>to</strong>matiser la<br />
production de code pour des accélérateurs matériels. Nous nous sommes concentrés sur la<br />
capacité à produire rapidement plusieurs compilateurs pour des cibles différentes. Comme<br />
le matériel moderne est généralement déjà programmable dans un dialecte du langage C,<br />
nous nous sommes fixé pour objectif de traduire au<strong>to</strong>matiquement des algorithmes standard<br />
écrit en C vers différents noyaux de calcul écrits dans des dialectes du C, et de générer<br />
le code permettant de faire des appels à un accélérateur depuis le processeur hôte.<br />
L’avantage de cette approche est sa modularité : de nombreuses trans<strong>for</strong>mations peuvent<br />
être réutilisée d’un accélérateur à l’autre. Cela réduit les coûts de production de compilateurs,<br />
et l’utilisation du infrastructure de compilation source à source permet d’interagir<br />
avec les outils existants, en particulier avec les compilateurs générant du code binaire dédié
18 RÉSUMÉ EN FRANÇAIS<br />
à partir de dialectes de code C.<br />
8.1 Contributions<br />
Méthodologie pour la construction de compilateurs source-à-source<br />
Nous avons proposé de modéliser les accélérateur matériel par des diagrammes de contraintes<br />
matérielles. Ces diagrammes identifient les contraintes optionnelles et obliga<strong>to</strong>ires<br />
associé au matériel. L’association manuelle entre ces contraintes et des trans<strong>for</strong>mations de<br />
code guide le développeur de compilateurs dans son processus de développement.<br />
Conception d’une infrastructure de compilation générique<br />
L’hétérogénéité des accélérateurs matériel rend difficile la construction d’un unique<br />
compilateur capable de les cibler <strong>to</strong>us. Depuis, il existe déjà de nombreuses applications<br />
capable de traiter certains problèmes posés par ces machines. Au chapitre 3, nous avons<br />
proposé un schéma de compilation qui combine l’utilisation d’une boîte à outil de trans<strong>for</strong>mations<br />
de code source à source, une api pour la gestion de passes et un modèle de machine<br />
hétérogène. Cette méthodologie est validée au chapitre 7 pour 4 cibles différentes. Il est<br />
utilisé dans l’outil Par4All développé par hpc project.<br />
Trans<strong>for</strong>mations pour contraintes d’isa<br />
Les accélérateurs matériels doivent leur accélération à leur spécialisation : on peut dire<br />
qu’elles sont plus efficaces sur un champ d’application plus restreint. La conséquence directe<br />
est une spécialisation de l’isa. Cette spécialisation est visible au niveau des dialectes du<br />
langage C proposés pour programmer ces accélérateurs. Le chapitre 4 propose un ensemble<br />
de trans<strong>for</strong>mations source à source pour raffiner un code C de haut niveau vers un code de<br />
plus bas niveau. Cet ensemble comprend en particulier un algorithme original d’outlining,<br />
basé sur les régions de tableaux convexes.<br />
Algorithme slp hybride<br />
On trouve maintenant des jeux d’instruction multimédia sur <strong>to</strong>us les processeurs<br />
généralistes, et même sur des puces hybrides cpu/gpu. Nous avons développé un algorithme<br />
original basé sur l’état de l’art en vec<strong>to</strong>risation de boucles et de séquences. Cet<br />
algorithme unifie les deux approches et est paramétrés par une description au niveau C de<br />
l’isa. Il respecte ainsi les critères de re-ciblage évoqués au chapitre 3. L’algorithme a été<br />
validé sur trois familles de Multimedia Instruction Set : sse, avx et neon. Ce travail a été<br />
récompensé par le prix du troisième meilleur poster à PACT 2011.
RÉSUMÉ EN FRANÇAIS 19<br />
Trans<strong>for</strong>mations pour contraintes mémoires<br />
Les aspects mémoire sont critiques pour de nombreux systèmes hétérogènes : quand<br />
un accélérateur ne partage pas d’espace mémoire avec son hôte, des rpc et des dma sont<br />
nécessaires. Le modèle de programmation est bien plus complexe que les modèles classiques.<br />
Nous avons présenté au chapitre 6 trois trans<strong>for</strong>mations de code qui prennent ces aspects<br />
en compte : l’isolation d’instructions sépare l’espace mémoire de l’accélérateur de celui<br />
de l’hôte ; la réduction d’empreinte mémoire trouve la matrice de pavage garantissant<br />
que l’accélérateur a suffisamment d’espace mémoire pour exécuter l’application par tuile<br />
successives ; et l’élimination de transferts redondants élimine les mouvements de données<br />
inutiles.<br />
Implémentation<br />
Toutes les trans<strong>for</strong>mations présentées dans cette thèse ont été développés dans l’infrastructure<br />
de compilation source-à-source pips pour le langage C et ont été assemblée en<br />
utilisant le gestionnaire de passes pyps. Elles ont conduit à l’implémentation de quatre<br />
compilateurs : un pro<strong>to</strong>type de générateur de directives openmp, un compilateur reciblable<br />
pour jeux d’instructions vec<strong>to</strong>rielles, un générateur de microcode pour le processeur<br />
à base de fpga dédié au traitement d’image terapix, et un générateur de code pour<br />
gpu développé par hpc project. Cela valide à la fois l’infrastructure de compilation dans<br />
son ensemble et les algorithmes proposés dans ce manuscrit. Les expériences et les flots de<br />
compilations spécifiques sont détaillés au chapitre 7.<br />
Contributions à la communauté de pips<br />
Il est difficile de séparer l’activité de recherche de l’activité de développement dans<br />
une thèse en in<strong>for</strong>matique. L’intégration de nouvelles trans<strong>for</strong>mations dans l’infrastructure<br />
choisie et l’extension au langage C de passes conçues pour la langage Fortran sont des<br />
activités indispensables pour supporter les travaux de recherches mais elles nécessitent un<br />
investissement en temps important. En tant que membre de l’équipe pips, j’ai pris en<br />
charge la modernisation de l’infrastructure de compilation du projet et j’ai rationalisé la<br />
distribution sous <strong>for</strong>me de paquet.<br />
J’ai encadré cinq étudiants à Télécom Bretagne au cours de stages au<strong>to</strong>ur du projet<br />
pips, et j’ai contribué à la valorisation scientifique de l’outil à travers deux tu<strong>to</strong>riels dans<br />
des conférences internationales.<br />
8.2 Travaux à venir<br />
Le monde du hpc est en constante évolution. Un supercalculateur à base de Sparc64 est<br />
arrivé en tête du <strong>to</strong>p500 de juin 2011, alors que les gpu de nVidia menaient la danse 6 mois<br />
plus tôt. Dans cet environnement mouvant, rien n’est encore figé et les vendeurs de matériel<br />
continuent à proposer leur standards pour avoir un modèle de programmation commun
20 RÉSUMÉ EN FRANÇAIS<br />
associé à des outils d’ingénierie efficaces. Cela nécessite une coopération et de nombreuses<br />
interactions entre outils. Dans ce contexte, combler le fossé qui sépare la norme opencl des<br />
générateurs de vhdl existant est un défi intéressant et un thème de recherche encore ouvert.<br />
Cependant, le hpc reste un marché de niche comparé à celui des systèmes embarqués et<br />
des téléphones intelligents. Dans ces domaines, les contraintes matérielles sont encore plus<br />
importantes : consommation énergétique, poids, volume, etc. Les trans<strong>for</strong>mations de code<br />
et l’approche étudiées pendant cette thèse peuvent certainement y trouver des applications.<br />
Les mis devenant de plus en plus flexible, il est de plus en plus courant d’y trouver des<br />
instructions non-simd. Ces instructions au<strong>to</strong>risent des schémas de chargement/déchargement<br />
plus élaborés (e.g. accès non contigus) et permettre d’atteindre de meilleurs per<strong>for</strong>mances<br />
pour des applications contraintes par leurs accès mémoires. L’ajout incrémental de<br />
trans<strong>for</strong>mations de code capable de générer ces instructions dans notre compilateur pour<br />
mis est un sujet prometteur.<br />
Nous voyons deux possibilités d’extension de nos travaux sur les gestionnaires de passe.<br />
Premièrement, la combinaison d’opérateurs que nous avons décrite conduit à la construction<br />
d’un graphe orienté qui présente des opportunités de parallélisation à gros grain, au<br />
niveau du gestionnaire de passe. Cela améliorerait les temps de compilations en rendant le<br />
traitement parallèle. Deuxièmement, il apparait que certaines combinaisons de passes sont<br />
inutiles ou redondantes. L’ajout d’une sémantique précise aux trans<strong>for</strong>mations de code<br />
permettrait d’éliminer certaines séquences d’appels, par exemple dans le contexte de la<br />
compilation itérative.
Chapter 1<br />
Introduction<br />
Pont de l’Iroise, Brest, Finistère c○ lazzarello / flickr<br />
Moore’s law [Moo65] was invalidated five years ago, when transis<strong>to</strong>rs became so small<br />
that silicon could no longer dissipate the energy released by processor activity at maximum<br />
switching speeds. William J. Dally calls this “The End of Denial Architecture” [Dal09].<br />
To overcome thermal limitations, chip makers increased the number of cores on each die,<br />
producing multicore processors, an entirely new direction of development. As transis<strong>to</strong>rs<br />
have continued <strong>to</strong> shrink, multiple cores have provided additional computing power by<br />
putting more in the same die area with no increase in clock frequency. A second power<br />
wall is <strong>for</strong>ecast when transis<strong>to</strong>rs become so densely packed that silicon cannot conduct<br />
enough heat even at current, fixed clock rates. To continue improving application per<strong>for</strong>mance<br />
as this second wall approaches, co-processors have emerged and paved the way <strong>to</strong><br />
heterogeneous computing, where several computation units of different types collaborate.<br />
There are many types of co-processors from the specialized, which are highly efficient at<br />
certain tasks, <strong>to</strong> more general ones. All have common goals of low execution time and<br />
power consumption. In this manner, two general paths <strong>to</strong> high per<strong>for</strong>mance have arisen:<br />
homogeneous many-core machines and heterogeneous machines featuring various accelera<strong>to</strong>r<br />
technologies. Of course, creative processor architects have produced exceptions <strong>to</strong><br />
prove this rule. Intel’s larabee [SCS + 08] processor, <strong>for</strong> example, is a multicore design<br />
21
22 CHAPTER 1. INTRODUCTION<br />
with a vec<strong>to</strong>r arithmetic unit on-chip. The emerging design space is complex and likely <strong>to</strong><br />
change continuously.<br />
Beginning some thirty years ago, engineers built a large variety of parallel machines.<br />
The Connection Machine and the MasPar MP2 are notable examples. At that time,<br />
supercomputers like the Cray-1 cost $58 M and could per<strong>for</strong>m an average of 80 Mflops.<br />
By 2006, the PlayStation 3 (ps3) video game console based on the Cell be architecture<br />
cost $500 and could achieve 230 GFLoating point Operations per Second (flops) thanks<br />
<strong>to</strong> a combination of general-purpose and specialized processors. Of the Cray, less than<br />
a hundred units were built, while more than 50 millions ps3 consoles were produced.<br />
This change of market inevitably created a need <strong>for</strong> better development <strong>to</strong>ols. By 2006,<br />
parallelism was no longer the concern of specialists as in the Cray era.<br />
More can be learnt from the ps3 s<strong>to</strong>ry: although launched by sony one year be<strong>for</strong>e<br />
Microsoft’s xbox360, the latter had greater success, in great part due <strong>to</strong> a larger game<br />
catalog. It turned out that many game development studios found development <strong>for</strong> the ps3<br />
and its heterogeneous architecture <strong>to</strong> be <strong>to</strong>o difficult. Indeed, the Cell be architecture<br />
involved separated memory spaces, manual control of data transfers, manual handling of<br />
128-entry vec<strong>to</strong>r registers in a vec<strong>to</strong>r-only way, and manual cache management. This<br />
complexity proved <strong>to</strong>o much <strong>for</strong> average developers <strong>to</strong> master.<br />
Actually, there is and probably will always be three ways <strong>to</strong> handle such hardware<br />
complexity: hire an expert engineer who has a comprehensive understanding of the machine;<br />
develop a specialized library that exposes the hardware capabilities in an Application<br />
Programming Interface (api); or build a compiler that translates high-level code in<strong>to</strong> the<br />
target machine language.<br />
The first option is versatile but costly. The second is efficient but lacks flexibility.<br />
Drawing maximum per<strong>for</strong>mance from complex hardware using only a limited number of<br />
api calls may be impossible. The last approach combines the advantages of the first two,<br />
given that the needed compiler can be written at reasonable cost. nVidia, <strong>for</strong> example, has<br />
successfully adopted the compilation approach <strong>for</strong> their General Purpose gpus (gpgpus)<br />
technology. Most nVidia gpgpu programmers use an extension of C/C++—the Compute<br />
Unified Device Architecture (cuda) language—and the nVidia compiler nvcc.<br />
Indeed, heterogeneous computing places entirely new demands on the compilation process.<br />
Quoting the dragon book [ALSU06],<br />
Definition 1.1. A compiler is a program that can read a program in one language—the<br />
source language— and translate it in<strong>to</strong> an equivalent program in another language—the<br />
target language.<br />
There is obviously no mention that a compiler may need <strong>to</strong> trans<strong>for</strong>m its one or more<br />
inputs in<strong>to</strong> several outputs—one <strong>for</strong> each accelera<strong>to</strong>r processor in a heterogeneous system,<br />
each in its own language. Previous architectures simply did not require this complex<br />
behavior. In the common case where one input language results in several target programs,<br />
the job is complicated, requiring the compiler <strong>to</strong> internally model complete system<br />
per<strong>for</strong>mance in order <strong>to</strong> make good decisions about per<strong>for</strong>mance.
The most difficult case occurs when the source code is in a legacy language with no explicitly<br />
parallel constructs. <strong>Compilers</strong> failed in the 80’s because of the difficulty <strong>to</strong> extract<br />
parallelism from sequential codes. Au<strong>to</strong>matic parallelization is impossible in the general<br />
case because parallel algorithms are completely different from sequential algorithms and a<br />
compiler cannot create new algorithms. Even <strong>for</strong> cases were parallelism detection is within<br />
the scope of an au<strong>to</strong>mated <strong>to</strong>ol, it is made difficult when code is obfuscated by unconventional<br />
coding methods that were originally intended <strong>to</strong> optimize per<strong>for</strong>mance of earlier<br />
machines and compilers. As parallelism is becoming ubiqui<strong>to</strong>us, the way algorithms are designed<br />
will also evolve and will take in<strong>to</strong> account parallel aspects. For that reason, it seems<br />
reasonable <strong>to</strong> focus on the compilation of explicitly parallel programs <strong>for</strong> heterogeneous<br />
targets, not on the au<strong>to</strong>matic extraction of parallelism.<br />
<strong>Heterogeneous</strong> environments sharpen the analogy given in [ABC + 06], which depicts<br />
software as a bridge 1 between hardware and applications. If software is a bridge, then<br />
the compiler is the bridge-builder, responsible <strong>for</strong> making both ends meet. Many blocks<br />
<strong>for</strong> building such bridges in heterogeneous environments can be found in thirty-year-old<br />
literature, but not all. In [Pat10], David Patterson summarizes many aspects of research<br />
in parallel computing since the 1960’s. As one might expect, its conclusion is that there<br />
is no free lunch. Although there have been success s<strong>to</strong>ries, there exist no global solutions<br />
<strong>to</strong> the problem of parallel computing, but rather only local solutions <strong>to</strong> local problems. It<br />
follows that there is no good reason <strong>to</strong> believe building a compiler <strong>for</strong> any heterogeneous<br />
device will a straight<strong>for</strong>ward task.<br />
There are glimpses of hope, however. Local solutions <strong>for</strong> particular devices can be<br />
viewed as building blocks. If these are configured <strong>for</strong> easy reuse, then creating compilers<br />
<strong>for</strong> new target devices will be simplified. The idea of building compilers by composing basic<br />
blocks in order <strong>to</strong> suit a particular heterogeneous system is the core of this dissertation.<br />
Many interesting questions follow. What are the building blocks relevant <strong>to</strong> heterogeneous<br />
computing? How are building blocks <strong>to</strong> be chained depending on the target? Is it possible<br />
<strong>to</strong> embody all possible hardware specifications in a single Internal Representation (ir)?<br />
Ultimately, is there a standard methodology <strong>to</strong> build compilers <strong>for</strong> heterogeneous devices?<br />
In a survey of hardware-software co-design techniques published in 1994 [Wol94], Wayne<br />
H. Wolfe stated that<br />
To be able <strong>to</strong> continue <strong>to</strong> make use of the ever-higher per<strong>for</strong>mance CPUs<br />
made possible by Moore’s Law (. . . ), we must develop new design methodologies<br />
and algorithms which allow designers <strong>to</strong> predict implementation costs,<br />
incrementally refine a design over multiple levels of abstraction, and<br />
create a working first implementation.<br />
Seventeen years later, Wolfe’s challenge is largely unmet. Neither programmers’<br />
expertise nor the source code they produce have evolved as fast as hardware architectures.<br />
As a result, applications have not greatly benefited from recent alternative hardware designs<br />
except where intense ef<strong>for</strong>ts could be dedicated. The sheer amount of legacy code does not<br />
1. Inspired by the cover of Communications of the ACM, Vol. 52 No. 10, we illustrate each chapter of<br />
this thesis with pho<strong>to</strong>s of classical bridges in Brittany.<br />
23
24 CHAPTER 1. INTRODUCTION<br />
admit economically feasible solutions other than compilers, which appear <strong>to</strong> be the only<br />
way <strong>to</strong> bridge the hardware-software gap. The advice Wolfe gives <strong>to</strong> hardware designers<br />
still holds <strong>for</strong> compiler designers, who face the tremendous task of porting legacy codes that<br />
implement sequential algorithms written in a sequential language <strong>to</strong> ever changing parallel<br />
co-processors: incrementally refine a design, use multiple levels of abstraction and<br />
create a working first implementation. These tasks are made more difficult by <strong>to</strong>ols<br />
based on traditional compilation flows targeting a single language per <strong>to</strong>ol, which are<br />
ill-suited <strong>to</strong> heterogeneous plat<strong>for</strong>ms.<br />
This dissertation adopts Wolfe’s three steps <strong>to</strong> assemble compilers <strong>for</strong> various heterogeneous<br />
plat<strong>for</strong>ms, ranging from classical multicore <strong>to</strong> an Field Programmable Gate Array<br />
(fpga)-based image processor via a Graphical Processing Unit (gpu) and small vec<strong>to</strong>r<br />
units. Our objective is not <strong>to</strong> build the best possible compiler <strong>for</strong> each target, but <strong>to</strong><br />
realize a reasonable compilation scheme <strong>for</strong> each while reusing as many building blocks as<br />
possible.<br />
To achieve this goal, we begin with a study of heterogeneous devices and existing programming<br />
paradigms in Chapter 2. We note three families of constraints resulting from<br />
corresponding dimensions of heterogeneity: the Instruction Set Architecture (isa), the<br />
memory architecture, and the source of acceleration. Chapter 3 shows that traditional<br />
compiler frameworks lack the flexibility needed <strong>to</strong> compose fine-grained code trans<strong>for</strong>mations,<br />
which are needed <strong>to</strong> overcome the constraints. We propose a model <strong>to</strong> solve<br />
this problem. Chapter 4, Chapter 5 and Chapter 6 further examine the three families of<br />
constraints and detail innovative target-independent trans<strong>for</strong>mations <strong>to</strong> <strong>for</strong>m the building<br />
blocks of our approach.<br />
This dissertation is not solely a theoretical work. Several compilers have been realized<br />
with our ideas. Their design and implementation along with per<strong>for</strong>mance benchmarks are<br />
presented in Chapter 7. Chapter 8 summarizes the contributions of this thesis and draws<br />
final conclusions.<br />
All the trans<strong>for</strong>mations and compilers described in this document have been implemented<br />
using the Paralléliseur Interprocedural de Programmes Scientifiques (pips) compiler<br />
infrastructure developed by the Centre de Recherche en In<strong>for</strong>matique (cri) from<br />
mines ParisTech with contributions from the hpc project startup, Télécom Sud-Paris and<br />
Télécom Bretagne. A quick review of this project is given in Appendix A.<br />
Most of the ideas developed in this thesis are already integrated in the Par4All <strong>to</strong>ol<br />
within the hpc project.
Chapter 2<br />
<strong>Heterogeneous</strong> Computing Paradigm<br />
Pont Fleuri, Quimperlé, Finistère c○ Jean Louis Lemoigne<br />
On February the 15 th , 2007, nVidia introduced a way <strong>to</strong> program general purpose applications<br />
on Graphical Processing Units (gpus), a class of hardware generally confined <strong>to</strong><br />
the manipulation of computer graphics, through the Compute Unified Device Architecture<br />
(cuda) language and its associated Software Development Kit (sdk). Since then General<br />
Purpose gpus (gpgpus) have dug their way in<strong>to</strong> physical simulations, video file conversions<br />
or cryp<strong>to</strong>graphy, and many heterogeneous devices have appeared as efficient solutions<br />
<strong>to</strong> per<strong>for</strong>m specific computations. Many firms now propose dedicated hardware such as:<br />
Field Programmable Gate Array (fpga)s from Xilinx, Software On Chip (soc) or Multi-<br />
Processor System-on-Chip (mp-soc) from Texas Instruments or micro-controllers from<br />
Atmel.<br />
In November 2011, the Tianhe-1A supercomputer, a cluster of powered by more than<br />
7K Tesla cards, ranked first in the <strong>to</strong>p500. 1 Since then, heterogeneous computing has<br />
been assessed as one viable alternative <strong>to</strong> multicore computing <strong>for</strong> scientific computations.<br />
1. As of June 2011, the second, third and fifth systems of the <strong>to</strong>p500 are using nVidia gpus.<br />
25
26 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
However, heterogeneous computing is rather different from homogeneous computing, particularly<br />
with respect <strong>to</strong> three critical aspects: memory, parallelism and instruction sets.<br />
This leads <strong>to</strong> a complicated interleaving of concepts specific <strong>to</strong> each target, but nonetheless<br />
independent from the high level organization of the source code. An interesting illustration<br />
of this complexity was given during the 2010 conference of Supercomputing: During a talk,<br />
the panelists discussed the three P’s of heterogeneous computing.<br />
The first P stands <strong>for</strong> Per<strong>for</strong>mance: <strong>to</strong> the opposite of general purpose processors that<br />
must deliver average per<strong>for</strong>mance <strong>for</strong> any application, the goal of hardware accelera<strong>to</strong>rs<br />
is <strong>to</strong> deliver important peak per<strong>for</strong>mance <strong>to</strong> specific applications. They achieve this goal<br />
through extensive usage of parallelism, either Single Instruction stream, Multiple Data<br />
stream (simd), Multiple Instruction stream, Multiple Data stream (mimd) or pipelining,<br />
or through the use of hard-coded, optimized routines.<br />
The second P stands <strong>for</strong> Power: power consumption plays a key role in both embedded<br />
systems, <strong>to</strong> maximize usage time between two recharges, and supercomputers, <strong>to</strong> minimize<br />
electricity costs and building size. Hardware accelera<strong>to</strong>rs are relevant candidates <strong>to</strong> improve<br />
the FLoating point Operations per Second (flops) per Watt metric, e.g. because less power<br />
is spent in non-computational operations.<br />
The third of the three P’s stands <strong>for</strong> Programmability and is the major weakness of<br />
hardware accelera<strong>to</strong>rs. Indeed, programming a hardware accelera<strong>to</strong>r implies a paradigm<br />
shift from sequential programming <strong>to</strong> circuit programming via vhsic Hardware Description<br />
Language (vhdl) [TM08] <strong>for</strong> fpga boards, 3D programming e.g. via Open Graphics<br />
Library (opengl) or directX <strong>for</strong> gpu’s or parallel programming e.g. via cuda or Open<br />
Computing Language (opencl) <strong>for</strong> gpgpu’s. Moreover, the basic execution model generally<br />
introduces the additional complexity of remote memory management. It is a time<br />
consuming task and developers are not used <strong>to</strong> it.<br />
This chapter presents in detail the heterogeneous computing model in Section 2.1 and<br />
its consequences <strong>for</strong> the programming model in Section 2.2. The concept of hardware constraints<br />
is introduced in Section 2.3 as a way <strong>to</strong> model the interaction between the hardware<br />
and a hypothetic compiler and Section 2.4 discusses the relevancy of using the C language<br />
as an input language <strong>to</strong> program specific hardware components. As an illustration of a<br />
standardized programming model, we give an analysis of the one <strong>for</strong> opencl in Section 2.5.<br />
Other programing models are presented in Section 2.6.<br />
2.1 <strong>Heterogeneous</strong> Computing Model<br />
The opencl specifications [KOWG10] illustrates a heterogeneous computer organization<br />
with Figure 2.1, which shows that the main characteristics of a heterogeneous device<br />
is the presence of several computational units with different capabilities.<br />
The key <strong>to</strong> per<strong>for</strong>mance is <strong>to</strong> use these different capabilities <strong>to</strong> the best of their capacity<br />
in order <strong>to</strong> achieve efficient computations, because each device is specialized in one<br />
kind of computations. e.g. a gpgpu is well suited <strong>for</strong> intensive, regular computations,<br />
while a General Purpose Processor (gpp) per<strong>for</strong>ms better <strong>for</strong> irregular, unpredictable al-
2.1. HETEROGENEOUS COMPUTING MODEL 27<br />
Figure 2.1: <strong>Heterogeneous</strong> computing model.<br />
gorithms. A dedicated fpga board per<strong>for</strong>ms better <strong>for</strong> the particular streaming signal<br />
processing computation it was designed <strong>to</strong> handle. Even more specialized computing units,<br />
Application-Specific Integrated Circuits (asics) are also used <strong>to</strong> match very specific design<br />
2 . Similarly, some hardware accelera<strong>to</strong>rs provide functionalities dedicated <strong>to</strong> a specific<br />
task: the ClearSpeed Advance board [GG] accelerates the Intel Math Kernel Library<br />
(mkl), and fpga-based chips have been used <strong>to</strong> speed up Pho<strong>to</strong>shop TM image processing<br />
[Sin98]. Ten years later, Cray XD1 combined amd Opterons with Xilinx fpgas <strong>to</strong><br />
speedup Smith-Waterman algorithm [S<strong>to</strong>08]. The same approach is used in Intel’s Stellar<strong>to</strong>n,<br />
the combination of an A<strong>to</strong>m E600 and an Altera fpga, <strong>to</strong> find a balance between<br />
per<strong>for</strong>mance and flexibility, which shows the mainstream interest in such plat<strong>for</strong>ms.<br />
The devices enumerated above do not collaborate in a purely decentralized way. A<br />
host device, generally a gpp, is in charge of scheduling the computational tasks among<br />
the hardware accelera<strong>to</strong>rs, in a master-slave fashion. In many cases, each device has its own<br />
memory and has <strong>to</strong> communicate with the others in order <strong>to</strong> share computational tasks:<br />
this is a strong paradigm shift from sequential programming <strong>to</strong> distributed computing that<br />
emphasizes the first difficulty of heterogeneous computing: the plat<strong>for</strong>m model. It also<br />
has an influence on the memory model.<br />
Heterogeneity is also found in the computational units themselves, as illustrated by<br />
Figure 2.2. Figure 2.2a describes a typical von Neumann architecture and Figure 2.2b<br />
describes the opencl view of a generic computational device.<br />
In a recent article [Wol11], Michael Wolfe goes through no fewer than eight levels of<br />
parallelism that must be mastered <strong>to</strong> reach exascale per<strong>for</strong>mance. Heterogeneity is found at<br />
several levels, as illustrated by the French experimental grid Grid5000 [INR], a gathering<br />
of 9 clusters counting around 1, 500 nodes, almost 3, 000 processors and more than 7, 000<br />
cores. Heterogeneity is a fundamental characteristic of this grid: at the node level, with<br />
nodes from Altix, Bull, Carri System, Dell, hp, ibm or Sun; at the socket level<br />
with, <strong>for</strong> instance nVidia Tesla S1070 available at the Grenoble site; at the core level<br />
2. The main advantages over fpgas are the ability <strong>to</strong> cus<strong>to</strong>mize the circuit <strong>for</strong>m, lower unit costs and<br />
full cus<strong>to</strong>mization possibility.
28 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
Private<br />
Memory 0<br />
PE 0<br />
. . .<br />
Local<br />
Memory 0<br />
I/O<br />
Arithmetic<br />
Logic<br />
Unit<br />
Memory<br />
Control<br />
Unit<br />
(a) von Neumann Architecture.<br />
Private<br />
Memory M0<br />
PE M0<br />
Global/Constant Memory Data Cache<br />
Global Memory<br />
. . .<br />
Private<br />
Memory 1<br />
PE 0<br />
Constant Memory<br />
(b) Generic opencl node architecture.<br />
. . .<br />
Local<br />
Memory N<br />
Figure 2.2: von Neumann architecture vs. opencl architecture.<br />
Private<br />
Memory MN<br />
PE MN
2.2. INFLUENCE ON PROGRAMMING MODEL 29<br />
with 17 different kinds of processors from two main families (amd Opteron and Intel<br />
Xeon), and thus at the vec<strong>to</strong>r level with different supported Streaming simd Extension<br />
(sse) versions. Not <strong>to</strong> mention that different instruction sets are supported from one node<br />
<strong>to</strong> another. An application that is not aware of the specificities of each node it is running<br />
on cannot achieve minimal execution times.<br />
From the single processing element per computational unit found in homogeneous computing,<br />
heterogeneous computing considers the availability of multiple processing elements<br />
per device. In Flynn’s taxonomy [Fly72], this means a move from Single Instruction<br />
stream, Single Data stream (sisd) paradigm <strong>to</strong> either simd or mimd. A survey [BDH + 10]<br />
by André Rigland Brodtkorb et al. further refines this view and enumerates possible<br />
organizations of a modern computational device. The main concept is that acceleration<br />
is achieved through specialization, so specialized hardware with specialized instructions or<br />
organizations is used. It shows the second difficulty related <strong>to</strong> heterogeneous computing:<br />
the execution model.<br />
Moreover the memory described in Figure 2.2b is not flat, and sharing among processing<br />
elements is manda<strong>to</strong>ry; rather, a memory hierarchy is exposed inside the computational<br />
device, in addition <strong>to</strong> the memory partitioning imposed by remote execution. The potential<br />
benefits <strong>for</strong> the developer are optimized cache management and data movement handling,<br />
but it also seriously complicates the development of applications. This combination of<br />
distributed memory and hierarchical memory hierarchy is the third difficulty posed <strong>to</strong><br />
developers by heterogeneous computing: the memory model.<br />
2.2 Influence on Programming Model<br />
The heterogeneous computing model has an important impact on the programming<br />
models used <strong>to</strong> effectively program the devices. Considering the three difficulties raised in<br />
the section above, it is no surprise that many programming models have been proposed <strong>to</strong><br />
develop applications <strong>for</strong> heterogeneous architectures.<br />
The memory model involves ideas from the distributed computing community.<br />
Message passing pro<strong>to</strong>cols have been reviewed in [McB94]. Among them, active messages<br />
[vECGS92] offer a particularly elegant and efficient solution <strong>to</strong> distributed memory<br />
management, and Message Passing Interface (mpi) [WW94] has appeared as a leading and<br />
standardized solution. Such interfaces enable fine grain control over data movement and<br />
potentially higher per<strong>for</strong>mance, at the expense of manual management of data consistency.<br />
Several approaches have been proposed <strong>to</strong> relieve developers from the manual declarations<br />
of data transfers: one alternative is <strong>to</strong> use a shared virtual memory space, as described in<br />
survey [IS99]; another is <strong>to</strong> use Remote Procedure Call (rpc) and <strong>to</strong> rely on the runtime or<br />
the language <strong>to</strong> au<strong>to</strong>matically transfer arguments [BN84]. Both approaches face the problem<br />
of scheduling calls over the underlying architecture. However, this <strong>to</strong>pic is far beyond<br />
the scope of this thesis, <strong>for</strong> we only consider heterogeneous plat<strong>for</strong>ms with a single host<br />
and a single accelera<strong>to</strong>r. Au<strong>to</strong>matic generation of data transfers and scheduling of computational<br />
tasks was active during the day of High Per<strong>for</strong>mance Fortran [Wol96, ACIK97]
30 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
Device 2 code<br />
Compiler 2<br />
Object 2<br />
manual<br />
Sequential code<br />
manual<br />
Device 1 code<br />
Compiler 1<br />
Object 1<br />
manual<br />
Device 0 code<br />
Compiler 0<br />
Object 0<br />
Figure 2.3: Impact of heterogeneous architecture on compilation.<br />
host code<br />
Host<br />
Compiler<br />
Host Object<br />
and gpus have brought renewed interest in the <strong>to</strong>pic [LVM + 10, JPJ + 11]. Another issue<br />
of distributed computing is the difference of data representation across architectures. It<br />
<strong>for</strong>ces the use of a common representation or adds translation costs <strong>to</strong> all data transfers.<br />
The load-work-s<strong>to</strong>re idiom is typical of rpc. It puts a new constraint on the host code,<br />
because it serializes the processing of a computational task by a computational device in<br />
three <strong>to</strong> five steps:<br />
1. allocate: the host allocates memory on the remote accelera<strong>to</strong>r. On embedded devices,<br />
this steps may be optional, <strong>for</strong> memory allocation is managed by the user;<br />
2. load: the host transfers the data from its memory <strong>to</strong> the allocated remote (accelera<strong>to</strong>r)<br />
memory;<br />
3. work: the accelera<strong>to</strong>r per<strong>for</strong>ms computation on the loaded data and notifies the host<br />
of the computation end;<br />
4. s<strong>to</strong>re: the host transfers the data back from the remote memory <strong>to</strong> its own memory;<br />
5. deallocate: the host frees memory allocated in step 1. Likewise, this step may be<br />
optional.<br />
The main drawback of this approach is that only step 3 contributes <strong>to</strong> acceleration. All<br />
the other steps actually slow down the process.<br />
The architecture model implies that each device involved in a heterogeneous computation<br />
may use a different instruction set. Devices generally have <strong>to</strong> be programmed in<br />
different languages. For a collection of n different accelera<strong>to</strong>rs, it means n different versions<br />
of the code <strong>to</strong> maintain and <strong>to</strong> evolve simultaneously if we use a single accelera<strong>to</strong>r at a<br />
time. Figure 2.3 illustrates this concept.<br />
The interactions get more complex as several accelera<strong>to</strong>rs are combined in the same<br />
heterogeneous plat<strong>for</strong>m. Moreover, because accelera<strong>to</strong>rs are not general purpose processors,<br />
they are likely <strong>to</strong> require cross compilation. Code is strongly dependant on the device<br />
architecture combination, either on source or binary <strong>for</strong>m, which is a limitation <strong>for</strong> application<br />
portability. This limitation can however be overcome by bundling all possible object
2.3. HARDWARE CONSTRAINTS 31<br />
code or source code, at the expense of larger binaries, and have a host code aware of the<br />
possible change of hardware, as supported by middle-wares such as StarPU [ATNW09]<br />
or kaapi [HRF + 10]. An interesting consequence of Figure 2.3 is that the host code is<br />
relatively independent of the device codes, except <strong>for</strong> the development part. Following<br />
Amdhal’s law [Amd67], the host should also executes the less computation intensive part<br />
of the code. As a result, there are very few constraints on this part of the code.<br />
The execution model varies a lot across accelera<strong>to</strong>rs: there is little in common<br />
between a pipelined vec<strong>to</strong>r processor and a pure simd processor. However, most hardware<br />
accelera<strong>to</strong>rs get their speedup from parallelism, and many programming models have<br />
been proposed <strong>for</strong> parallel computing. Among them we denote au<strong>to</strong>matic parallelization<br />
through loop nest analysis [Lam74, IT88, WL91b, IJT91, DSV96] or <strong>for</strong> irregular applications<br />
[DUSsH93], partitioned global address space languages [CDMC + 05], domain-specific<br />
language like Chunk [WC03] or libraries like opengl [SA94]. Stream processing [Ste97]<br />
takes advantage of a limited <strong>for</strong>m of parallelism. Task level parallelism has also gained<br />
in popularity in the multicore era with languages like Cilk [BJK + 95], as well as directive<br />
oriented parallelization such as Open Multi Processing (openmp) [Ope11]. This small illustration<br />
of the large set of approaches introduced <strong>to</strong> bridge the gap between the user and<br />
various parallel paradigms should convince anybody that there is no holy grail in the field.<br />
An interesting note however is that most approaches listed above involve either the C or<br />
the Fortran languages. Similar approaches <strong>for</strong> more recent languages like Chapel or X10,<br />
or general purpose language like Java, have been proposed but they receive less audience,<br />
certainly due <strong>to</strong> the amount of legacy codes. It also shows that the diversity of the hardware<br />
au<strong>to</strong>matically leads <strong>to</strong> ad-hoc responses and a one-<strong>to</strong>-one binding between compilers<br />
and hardware plat<strong>for</strong>ms, the hardware vendor provides a means <strong>to</strong> program the device,<br />
either a language, a C extension or a library and the user must cope with it. A notable<br />
success of this approach is the nVidia cuda [NVI11] language that make it possible <strong>to</strong><br />
program nVidia graphical devices. The host code is not subject <strong>to</strong> these considerations<br />
and the language used <strong>to</strong> develop it is not as relevant <strong>for</strong> high per<strong>for</strong>mance computing as<br />
<strong>for</strong> device codes, providing a binding <strong>to</strong> a lower level language—say C—is available, like<br />
the interaction between gpgpu and high level languages, as provided by the gpulib. 3<br />
2.3 Hardware Constraints<br />
In the previous section, we described three aspects of the heterogeneous computing<br />
model. A key aspect is that there is not a unique heterogeneous computing model, but a<br />
collection of models, one per hardware target, all of them fitting in a general and extensible<br />
model, as illustrated by the feature diagram from Figure 2.4.<br />
The <strong>to</strong>pic of this dissertation is not <strong>to</strong> propose yet another taxonomy of parallel machines<br />
[Dun90, Che94, HP06], even with respect <strong>to</strong> the limited scope of heterogeneous<br />
machines. As a consequence, we selected a limited number of features among the existing<br />
ones, based on their presence on the following hardware:<br />
3. gpulib [MMG08] is the evolution of the pystream project, a Python binding <strong>for</strong> cuda.
32 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
memory<br />
rom ram<br />
Hardware<br />
Device<br />
shared distributed<br />
optional feature<br />
manda<strong>to</strong>ry feature<br />
a feature is manda<strong>to</strong>ry<br />
isa Acceleration<br />
Specialization Parallelism<br />
Figure 2.4: Example of hardware feature diagram.<br />
simd mimd<br />
– a desk<strong>to</strong>p computer with several cores and a modern gpu board;<br />
– a lap<strong>to</strong>p with several cores with vec<strong>to</strong>r instruction units;<br />
– an embedded processor with a single processor and a vec<strong>to</strong>r instruction unit;<br />
– an embedded device with a fpga-based accelera<strong>to</strong>r.<br />
The above targets exhibit three main sources of heterogeneity that must be dealt with:<br />
Instruction Set Architecture: The presence or absence of the following features at the<br />
instruction level require compiler support:<br />
– vec<strong>to</strong>r registers or instructions;<br />
– complex numbers;<br />
– maximum number of operands per instruction (generally 2 or 3);<br />
– supported operations.<br />
Memory: As many applications are memory-bound, taking in<strong>to</strong> account the hardware<br />
specificities of memory is often critical:<br />
– memory size;<br />
– memory hierarchy;<br />
– cache management;<br />
– distributed memory;<br />
– Read Only Memory (rom);<br />
– Direct Memory Access (dma) flexibility;<br />
– dma speed.<br />
Acceleration Features: One of the motivations <strong>for</strong> heterogeneous machines is per<strong>for</strong>-
2.3. HARDWARE CONSTRAINTS 33<br />
memory<br />
ram<br />
shared<br />
optional feature<br />
manda<strong>to</strong>ry feature<br />
Multicore<br />
Device<br />
isa Acceleration<br />
Parallelism<br />
mimd simd<br />
Figure 2.5: Multicore with vec<strong>to</strong>r unit feature diagram.<br />
mance 4 , per<strong>for</strong>mance that comes from one or more of the following features:<br />
– specialized computation unit;<br />
– mimd execution mode;<br />
– simd execution mode.<br />
Some important aspects are set aside in this dissertation, especially the memory / cache<br />
hierarchy and the availability of asynchronous dma. Both are critical <strong>to</strong> achieve high per<strong>for</strong>mance:<br />
caches misses and false sharing can drastically reduce per<strong>for</strong>mances on Central<br />
Processing Units (cpus), gpus’ shared memory has a faster access delay and asynchronous<br />
data transfers are commonly used <strong>to</strong> overlap communications and computation. Those<br />
trans<strong>for</strong>mations, although often critical <strong>to</strong> reach high throughput, are not manda<strong>to</strong>ry <strong>to</strong><br />
build a working compiler: we follow a conservative approach that aims at producing reasonably<br />
efficient code, not <strong>to</strong> be highly specialized <strong>for</strong> a single target, following the saying<br />
“have a working code be<strong>for</strong>e you consider optimizing it”.<br />
Taking advantage of read-only memory, shared memory or vec<strong>to</strong>r types is optional,<br />
while acceleration is a manda<strong>to</strong>ry feature specific <strong>to</strong> the hardware. A particular hardware<br />
device that fits in this model can be described using a restricted feature diagram. Figure 2.5<br />
shows the feature diagram of a typical multicore device with vec<strong>to</strong>r units.<br />
The difference between an optional and a manda<strong>to</strong>ry feature is important <strong>for</strong> per<strong>for</strong>mance:<br />
<strong>for</strong> a code <strong>to</strong> be executed on a specific piece of hardware, it must be aware of all the<br />
manda<strong>to</strong>ry features. A code aware of optional features would turn this extra knowledge<br />
in<strong>to</strong> per<strong>for</strong>mance boost, e.g. <strong>for</strong> nVidia’s gpgpus, the use of shared memory is optional<br />
but is critical <strong>to</strong> the per<strong>for</strong>mance of some applications, while unnecessary <strong>for</strong> others.<br />
4. Depending on the context, it can also be development costs or power consumption.
34 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
Definition 2.1. A source-<strong>to</strong>-binary compiler is a compiler that translates its input code<br />
in<strong>to</strong> machine code.<br />
Examples of source-<strong>to</strong>-binary compilers include icc, Intel compiler <strong>for</strong> the C++ language<br />
or nvcc or nVidia compiler <strong>for</strong> the cuda language.<br />
Mitrion-C used <strong>for</strong> C <strong>to</strong> vhdl translation ;<br />
c2h also used <strong>for</strong> C <strong>to</strong> vhdl translation ;<br />
gcc vec<strong>to</strong>r types used <strong>for</strong> au<strong>to</strong>matic multimedia instruction sets manipulation ;<br />
cuda used <strong>for</strong> nVidia gpgpu code generation ;<br />
opencl used <strong>for</strong> generic hardware accelera<strong>to</strong>r code generation.<br />
The goal of a compiler like c2h or nvcc is <strong>to</strong> generate hardware code or circuit from<br />
standard C code. The accepted idea concerning such high level synthesis is that what is<br />
gained in development time is lost in per<strong>for</strong>mance. However, a recent publication [CDL11]<br />
shows this approach can both reduce development time and increase per<strong>for</strong>mance.<br />
This kind of genera<strong>to</strong>r must ensure an original code matches all hardware features, and<br />
may use its optional features. Because some features are in conflict with the language, or<br />
because it is easier <strong>to</strong> support a subset of it, they move from standard C <strong>to</strong> dialects. The<br />
dialect is a reflection of the hardware features and we call them hardware constraints.<br />
A hardware constraint is embodied by a restriction or extension of the original language<br />
and <strong>for</strong>ms the core of the difficulty <strong>to</strong> program hardware devices using vendor compiler.<br />
2.4 Note About the C Language<br />
As shown above, most languages designed <strong>to</strong> abstract hardware complexity are extensions<br />
or dialects of the C language. His<strong>to</strong>rically, there have been three major versions<br />
of this language: C K&R (1978), C89 (1989) and C99 (1999). A Greek Athenian comic<br />
dramatist once quoted<br />
High thoughts must have high language.<br />
Aris<strong>to</strong>phanes, Frogs, 405 B.C<br />
However, many compilers still rely on C89 and do not benefit from the advantages of<br />
C99, not <strong>to</strong> mention the planned C1X. As a consequence, critical features such as native<br />
complex numbers and variable length arrays are not used, and are replaced by structures<br />
and pointers, respectively. This greatly lowers the expressiveness of the code, making it<br />
harder <strong>to</strong> maintain, and also harder <strong>to</strong> compile. Figure 2.6 illustrates this difference on a<br />
complex matrix vec<strong>to</strong>r multiply.<br />
Most available benchmarks are written in C89; there are two typical arguments in favor<br />
of this statement:<br />
1. It is compatible with more C compilers. In particular, the C++ language is not<br />
compatible with certain features of C99, such as variable length arrays ; 5<br />
5. Contrary <strong>to</strong> the ISO/IEC 14882:2003 standard, the <strong>for</strong>thcoming C++0x standard includes most of<br />
the C99 specificities.
2.4. NOTE ABOUT THE C LANGUAGE 35<br />
typedef struct { double r,i; } Complex ;<br />
void matrix_vec<strong>to</strong>r_multiply ( int M, int N, Complex * m, Complex *v,<br />
Complex * out ) {<br />
int i,j;<br />
<strong>for</strong> (i =0;ir=out ->i =0.;<br />
<strong>for</strong> (j =0;jr+=mi ->r*vi ->r-mi ->i*vi ->i;<br />
out ->i+=mi ->r*vi ->i+mi ->i*vi ->r;<br />
}<br />
}<br />
}<br />
(a) C89 version.<br />
void matrix_vec<strong>to</strong>r_multiply ( int M, int N, complex m[M][N], complex<br />
v[N], complex out [N]) {<br />
<strong>for</strong> ( int i =0;i
36 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
C99 vs. C89 speedup<br />
cfar ct db fdfir ga pm qr svd tdfir<br />
Figure 2.7: Comparison of two versions of the HPEC Challenge benchmark: C89 vs. C99.<br />
2. A direct code translation in<strong>to</strong> assembly favors the pointer versions.<br />
gnu C Compiler (gcc), Intel C++ Compiler (icc), ibm’s, hp’s and pgi’s compiler<br />
all support C99, so Argument 1 is mostly wrong. Among modern industrial compilers,<br />
only Microsoft’s does not provide this feature. We have carried out experiments on<br />
the High Per<strong>for</strong>mance Embedded Computing (hpec) Challenge benchmarks [LRW05], a<br />
benchmark written in C89, <strong>to</strong> verify how the switch <strong>to</strong> C99 impacts per<strong>for</strong>mance. The<br />
original version has been completely rewritten <strong>to</strong> take advantage of C99 features, then<br />
both the old and the new version have been compiled with icc version 12.0.3 on a desk<strong>to</strong>p<br />
computer. Each benchmark is run 100 times and the median value is picked. Figure 2.7<br />
shows results normalized against the original version using the -O3 flag. A result greater<br />
than one means the C99 version executes faster.<br />
Figure 2.8 show the result of C <strong>to</strong> C99 conversion on the CoreMark benchmark [Con].<br />
The behavior of both gcc and icc are evaluated and normalized with respect <strong>to</strong> the<br />
original version. The metric used is the number of iterations per seconds, as returned by<br />
the benchmark. gcc compiler flags used are -O3 -ffast-math -march=native and icc’s<br />
are -O3. A desk<strong>to</strong>p station running a 2.6.38-2-686 GNU/Linux on 2 Intel(R) Core(TM)2<br />
Duo CPU T9600 @ 2.80GHz is used <strong>to</strong> run all the benchmarks presented above.<br />
The same trans<strong>for</strong>mations have been per<strong>for</strong>med on the linpack benchmark [DLP03],<br />
used <strong>to</strong> compare machines ranking in the <strong>to</strong>p500. Results are displayed in Figure 2.9. It<br />
shows a small slowdown <strong>for</strong> the readability gain displayed in Figure 2.6.<br />
We observe that using C99 can imply a small per<strong>for</strong>mance loss, and icc is more impacted<br />
than gcc. This is due <strong>to</strong> two causes:<br />
– some kernels take advantage of pointer arithmetic <strong>to</strong> per<strong>for</strong>m optimized iterations<br />
over two-dimensional arrays, while indexed arrays suffer from non-optimized address
2.5. OPENCL PROGRAMMING MODEL ANALYSIS 37<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
C99 vs. C89 speedup<br />
icc gcc<br />
Figure 2.8: Comparison of two versions of the Coremark benchmark: C89 vs. C99.<br />
computations;<br />
– using data allocated on the heap as array pointers involve complex cast from void*<br />
<strong>to</strong>, say, int (*)[n][m] that disrupts the pointer analysis.<br />
However, C99 array declarations provide a readability gain. More important, polyhedral<br />
analyses are made harder by the use of non-linear array accesses. For instance, the<br />
Plu<strong>to</strong> [BBK + 08] compiler only handles affine loop nests. The Paralléliseur Interprocedural<br />
de Programmes Scientifiques (pips) [AAC + 11] compiler framework suffers from the same<br />
limitations and most of its analyses lose in accuracy in the presence of non-affine subscript<br />
expressions. Trans<strong>for</strong>mations have been proposed <strong>to</strong> au<strong>to</strong>matically delinearize array references<br />
by separating elements of a non-affine equation in<strong>to</strong> affine groups [Mas92] or <strong>to</strong><br />
recover array from pointers [FO03]. In this dissertation, assumption that input codes are<br />
written using the high level constructs of the language is used and arrays are assumed not<br />
<strong>to</strong> be linearized.<br />
2.5 OpenCL Programming Model Analysis<br />
opencl [KOWG10] is a recent proposal <strong>to</strong> standardize the way hardware accelera<strong>to</strong>rs<br />
are programmed. It provides a unified language derived from C99 <strong>to</strong> write kernels and an<br />
Application Programming Interface (api) <strong>to</strong> manage kernel calls.<br />
It is interesting <strong>to</strong> analyse the differences between opencl and the C99 language 6 :<br />
simpler function handling: no recursion, no function pointer, no variable number of<br />
arguments;<br />
6. As specified in the opencl sdk reference manual.
38 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
C99 vs. C89 speedup<br />
icc gcc<br />
Figure 2.9: Comparison of two versions of the Linpack benchmark: C89 vs. C99.<br />
limited pointer support: pointer <strong>to</strong> types that are fewer than 32 bits wide cannot be<br />
dereferenced;<br />
no variable size structures such as variable-length arrays or structure with flexible arrays;<br />
s<strong>to</strong>rage qualifiers: s<strong>to</strong>rage qualifiers are <strong>for</strong>bidden and replaced by __global,<br />
__constant, __local or __private;<br />
image support: built-in types and function <strong>to</strong> manipulate 2D and 3D images;<br />
math support: built-in geometric and mathematic functions such as cos 7 or dot but no<br />
math library;<br />
vec<strong>to</strong>r support <strong>for</strong> all primary types with up <strong>to</strong> 16 elements per vec<strong>to</strong>r.<br />
Data transfers are managed through synchronous or asynchronous dma plus prefetching<br />
and memory fences. It is possible <strong>to</strong> manage multiple devices in a thread-safe fashion. The<br />
api also includes facilities <strong>to</strong> use opengl buffers or textures with opencl code.<br />
It appears clearly from these differences that opencl was designed primarily <strong>for</strong> gpu<br />
devices and signal/video processing, as shown by the built-in image support, vec<strong>to</strong>r types<br />
and the api sharing with opengl. Lower level architectures suffer from the type restriction<br />
and the absence of bit-fields. Moreover, opencl programming model targets hierarchical<br />
array of processing elements with corresponding hierarchical memory structure, as found<br />
in gpus and multicore, but not in fpgas. To partially address this shortcoming, a relaxed<br />
version of opencl, called opencl embedded profile, has been released. It lowers the<br />
7. Note however that the standard is less restrictive that the IEEE 754 specification concerning Unit<br />
in the Last Place (ulp).
2.5. OPENCL PROGRAMMING MODEL ANALYSIS 39<br />
Host Object<br />
<strong>Source</strong><strong>to</strong>-Binary<br />
Compiler 0<br />
Host code<br />
+<br />
Device code<br />
Host<br />
Compiler<br />
<strong>Source</strong><strong>to</strong>-Binary<br />
Compiler 1<br />
<strong>Source</strong><strong>to</strong>-Binary<br />
Compiler 2<br />
Object 0 Object 1 Object 2<br />
Figure 2.10: Compilation flow in opencl.<br />
requirements on data types (no 64 bit integers, no 3D images), on floating-point compliance<br />
(Inf and NaN not required, lower requirements <strong>for</strong> some function accuracy) and on<br />
hardware capacity (minimal image height/width, local memory size).<br />
So opencl proposes a generic, library-based, approach <strong>to</strong> the heterogeneity of hardware<br />
targets problem. The core idea is <strong>to</strong> expose in a standard api the host calls <strong>to</strong> the different<br />
hardware devices. The devices code is generated at runtime using Just In Time (jit)<br />
compilation from a code shipped in the code as a textual representation, using a source<strong>to</strong>-binary<br />
compiler. It changes Figure 2.3 in<strong>to</strong> Figure 2.10, where only one device code<br />
exists.<br />
A unique representation of the device code is critical <strong>for</strong> source code portability. It<br />
achieves source code portability at the expense of deporting the complexity <strong>to</strong> source-<strong>to</strong>binary<br />
compilers. Since its first release, opencl has been used <strong>to</strong> generate code <strong>for</strong> the<br />
following targets:<br />
multiple nVidia’s gpgpu combined with multicores through the Nvidia’s opencl<br />
compiler ;<br />
multiple amd’s gpgpu combined with multicores through ATI Stream Software<br />
Development Kit ;<br />
Intel Multicore through Intel R○ opencl sdk;<br />
sse-enabled Multicore through the Fixstars opencl Cross Compiler;<br />
Cell engine through IBM opencl sdk.<br />
To our knowledge, no company supports fpga code generation and no paper on<br />
this subject has been published yet. In addition <strong>to</strong> the complexity of the transla-
40 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
tions listed above, this may be due <strong>to</strong> the existence of several <strong>to</strong>ols <strong>to</strong> per<strong>for</strong>m this<br />
task [KBM07, GCB07, GNB08]. As a consequence, switching <strong>to</strong> opencl would imply<br />
short-term development costs that overtake the long-term benefit of code portability.<br />
opencl does not relieve developers from host-side tasks: communication management<br />
is still handled manually, the api <strong>for</strong> marshalling arguments being at a rather low level 8 .<br />
As a consequence, input code has <strong>to</strong> be completely rewritten and split in<strong>to</strong> two codes: the<br />
host code with opencl calls and the kernel code.<br />
Also, opencl does not guarantee the per<strong>for</strong>mance portability: an opencl kernel tuned<br />
<strong>for</strong> a specific plat<strong>for</strong>m is not guaranteed <strong>to</strong> behave as well on another plat<strong>for</strong>m. For<br />
instance, a per<strong>for</strong>mance study [KSA + 10] shows that moving from one gpu <strong>to</strong> another<br />
requires adjusting kernel parameters <strong>to</strong> achieve the best results.<br />
From an engineering point of view, there are two aspects where the compilation scheme<br />
given in Figure 2.10 is limited in two ways. Firstly, there is no sharing of common optimizations<br />
between different opencl compilers. Actually the design does not prevent such<br />
sharing but, as shown in the above enumeration, the trend is <strong>to</strong> have each vendor develop<br />
its own version of an opencl compiler <strong>to</strong> support its hardware. In fact, these compilers<br />
are more basic compilers than optimizing compilers, in the sense that their primary goal<br />
is <strong>to</strong> generate code <strong>for</strong> the hardware, not <strong>to</strong> optimize user code. opencl leaves this task<br />
either <strong>to</strong> application developers, or <strong>to</strong> compiler developers. Given the cost <strong>to</strong> develop a<br />
full-fledged optimizing compiler, the task is generally left <strong>to</strong> developers.<br />
Still, opencl paves the way <strong>to</strong> heterogeneous computing normalization. In spite of<br />
its limitations, its programming model has received a good following, and the number of<br />
compilers that supports it, as well as software development <strong>to</strong>ols (debuggers, profilers,<br />
etc.), is quickly growing. Chapter 3 proposes an extended approach that borrows several<br />
ideas from the opencl model but makes host-side development easier, while achieving a<br />
good level of per<strong>for</strong>mance and en<strong>for</strong>cing compilation pass reuse.<br />
2.6 Other Programing Models<br />
<strong>Heterogeneous</strong> computing was first found in clusters of machines, where different nodes<br />
had different processors, and is now making its way <strong>to</strong> desk<strong>to</strong>p computers. Because per<strong>for</strong>mance<br />
of heterogeneous computations is linked <strong>to</strong> the proper scheduling of the different<br />
tasks that compose the program, many papers have studied the decomposition of a program<br />
in<strong>to</strong> independent tasks and their scheduling. [BSB + 01] compares several static approaches<br />
<strong>to</strong> the problem, while [SSM08] uses a s<strong>to</strong>chastic model. Scheduling decisions can also be<br />
dynamic, as in [ZWZD93]. Recently, frameworks such as Hadoop [Whi09] have emerged<br />
<strong>to</strong> provide file system and job scheduling integration.<br />
However, the number of devices involved in heterogeneous clusters and in heterogeneous<br />
computers is at a different scale, and the data transfer rate between the network connections<br />
in a cluster and the Peripheral Component Interconnect (pci) connections inside a<br />
8. Close <strong>to</strong> the way arguments are pushed on the stack during a traditional function call.
2.6. OTHER PROGRAMING MODELS 41<br />
computer introduces different fac<strong>to</strong>rs. For this reason, heterogeneous computers tend <strong>to</strong> be<br />
simpler <strong>to</strong> use efficiently. In particular, the scheduling issues are limited <strong>to</strong> a dozen nodes<br />
(e.g. eight <strong>for</strong> the Cell Processor [PAB + 06]). As a consequence, we do not focus on this<br />
<strong>to</strong>pic in this thesis.<br />
Apart from the opencl model discussed in Section 2.5, other models have been proposed.<br />
The case of pgi is particularly relevant because it is an industrial compiler, thus<br />
driven by user needs and working solutions. [Wol10] proposed an accelera<strong>to</strong>r model coupled<br />
<strong>to</strong> a high level programming model mostly targeted <strong>to</strong> gpus. It is based on compiler<br />
directives and thus benefits from the associated incremental development concept. The<br />
only required directive is #pragma acc, and the others can be used <strong>to</strong> refine the compiler’s<br />
analyses (e.g. <strong>to</strong> specify data movement or loop scheduling), an approach <strong>to</strong> parallelization<br />
that has been shown <strong>to</strong> provide good results [KS99]. This approach relieves developers<br />
from most low level manipulations and makes it possible <strong>to</strong> think of the code in terms of<br />
kernels only. The issue of per<strong>for</strong>mance portability is handled by bundling several versions<br />
of the kernel—one per targeted hardware—in the same binary.<br />
The directive approach is also taken by Hybrid Multicore Parallel Programming<br />
(hmpp) [BB09], but in that case nothing is au<strong>to</strong>matic and all the decisions must be made<br />
by developers and through directives. The advantage is that the user has greater control<br />
over its application behavior, at the expense of less au<strong>to</strong>mation.<br />
Hardware-software co-design [Wol94], and especially hardware-compiler codesign<br />
[ZPS + 96, WGN + 02], is an alternate approach where the hardware and the<br />
software are evolving hand-in-hand, whereas in the current situation, the software is<br />
struggling <strong>to</strong> match hardware evolution. This approach has been taken in the Delft<br />
workbench [PBV06], which relies on the molen [PBV07] machine organization. Retargetable<br />
compilation is achieved through the use of reconfigurable hardware <strong>to</strong> provide<br />
a user-specific instruction set, and of code trans<strong>for</strong>mation aware of reconfigurable architectures,<br />
e.g. <strong>to</strong> hide reconfiguration latencies with a specific instruction scheduling<br />
algorithm.<br />
A similar approach that also by-passes the limitation induced by communication overhead<br />
has been proposed: the Convey HC-1 computer [Bre10] is a hybrid system that uses<br />
Xilinx fpgas as co-processors, with the specificity of the processor and the co-processor<br />
sharing a virtual memory. The co-processor is accessed through user-defined instructions<br />
that rely on the concept of personalities—hardware level description of intrinsics exposed<br />
at the user level. Host and accelera<strong>to</strong>r code are consequently mixed <strong>to</strong>gether in a unique<br />
source code. The benefits of this approach is that developers are relieved from the management<br />
of data transfers. However, writing personalities involves writing a hardware<br />
description of each intrinsic, so part of the problem related <strong>to</strong> heterogeneous computing is<br />
not solved.<br />
The usage of an Application-Specific Instruction set Processor (asip) is another way<br />
<strong>to</strong> exploit heterogeneous computing. In a nutshell, instead of relying on a general purpose<br />
instruction set implemented on a general purpose processor, a dedicated processor is built <strong>to</strong><br />
run a single soc. This approach is only viable if the process of generating the processor can<br />
be au<strong>to</strong>mated: a compilation step is naturally involved <strong>to</strong> generate such processors. In such
42 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />
situations, retargetability of the compiler is a key property <strong>to</strong> be able <strong>to</strong> generate a wide<br />
range of processors. Such a compiler is described in [GLGP06], using a compilation flow<br />
that involves a compiler infrastructure parametrized by a processor model and a Hardware<br />
Description Language (hdl) genera<strong>to</strong>r. The processor model contains an instruction set<br />
model, and both of them are described as a matter of functional units, cus<strong>to</strong>m data types,<br />
connectivity, s<strong>to</strong>rage, etc.<br />
2.7 Conclusion<br />
<strong>Heterogeneous</strong> computing and multicores are the two current most important keywords<br />
<strong>for</strong> High Per<strong>for</strong>mance Computing (hpc) and low power. The amount of available hardware<br />
implementing different programming models makes it hard <strong>for</strong> developers <strong>to</strong> adapt existing<br />
software <strong>to</strong> new architectures. Because of the fast pace of evolution, any porting ef<strong>for</strong>t or<br />
per<strong>for</strong>mance improvement may be jeopardized by a hardware change. In this situation,<br />
ef<strong>for</strong>ts have been made by the hardware community <strong>to</strong> leverage their interface, and by the<br />
software community <strong>to</strong> propose new languages or libraries <strong>to</strong> ease the port of applications.<br />
In between, the compiler community tries hard <strong>to</strong> generate efficient glue code between the<br />
hardware layer and the software layer. The difficulty lies in the choice of a combination of<br />
a programming model and a programming language that is simple enough <strong>for</strong> developers <strong>to</strong><br />
use, but sufficiently rich so that the compiler can extract enough in<strong>for</strong>mations <strong>for</strong> efficient<br />
hardware code generation. Pragmatically, a subset of the C99 language is used, <strong>to</strong> limit<br />
the cost of the technology shift, from both source code and developer points of view. The<br />
opencl standard paves the way <strong>for</strong> such an approach but it suffers from several design<br />
limitations.<br />
The contribution of this chapter is <strong>to</strong> present a state of the art of the existing alternatives<br />
<strong>for</strong> heterogeneous computing, centered on compilation aspects. A study of existing<br />
C dialects shows the advantages of using the C language <strong>to</strong> program hybrid architectures,<br />
and the usual hardware limitations exposed by the language. Three different benchmarks<br />
have been manually converted <strong>to</strong> C99-style variable-length array declaration <strong>to</strong> show the<br />
per<strong>for</strong>mance impact of a higher level description. Although current compilers generate<br />
slightly less efficient code from C99 input, the gain in expressiveness favors maintainability<br />
and makes it easier <strong>to</strong> apply high-level trans<strong>for</strong>mations such as those from the polyhedral<br />
model.
Chapter 3<br />
Compiler Design <strong>for</strong> <strong>Heterogeneous</strong><br />
Architectures<br />
Vieux Pont de Dinan, Ille-et-Vilaine c○ iris.din / Flick<br />
Until the end of the 20 th century, only one kind of architecture was used <strong>to</strong> build general<br />
purpose computers. As a consequence, typical compilers have been built <strong>to</strong> efficiently target<br />
those architectures: a front-end parses the input code in<strong>to</strong> an Internal Representation, a<br />
middle-end per<strong>for</strong>ms various optimizations (hopefully language-independent), either at the<br />
basic block level, loop level, function level or program level, and a back-end emits targetspecific<br />
assembly code. These three components have been studied <strong>for</strong> years in the literature<br />
and described intensively, as shown by the periodically updated Dragon Book [ALSU06].<br />
The complexity growth seen in compiler architecture and crystallized by heterogeneous<br />
plat<strong>for</strong>m favors a more modular compiler framework. Indeed, it seems reasonable <strong>to</strong> reuse<br />
existing compilers, software and libraries that already per<strong>for</strong>m efficiently a specific task.<br />
Combining them instead of reimplementing the wheel over again and again should be<br />
possible. Let us take the example of Graphical Processing Unit (gpu) code generation<br />
<strong>for</strong> regular kernels. PLuTo [BBK + 08] is an example of this approach: they combine a<br />
C front-end, a polyhedral optimizer, a Compute Unified Device Architecture (cuda) code<br />
genera<strong>to</strong>r and nVidia’s cuda compiler <strong>to</strong> generate efficient gpu kernels.<br />
Additionally, build processes are growing in complexity in order <strong>to</strong> assemble object<br />
files generated from different languages by different compilation chains. For instance, the<br />
43
44 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
C<br />
Fortran<br />
Java<br />
C Frontend<br />
Fortran Frontend<br />
Java Frontend<br />
Common<br />
Optimization<br />
Infrastructure<br />
x86 Backend<br />
ARM Backend<br />
MIPS Backend<br />
Figure 3.1: A classical 3-phases retargetable compiler architecture.<br />
x86<br />
ARM<br />
MIPS<br />
sloccount <strong>to</strong>ol finds out 210 <strong>Source</strong> Lines Of Code (sloc) in common.mk, the generic<br />
Makefile infrastructure bundled with the cuda distribution. It makes code generation<br />
more difficult: not only code <strong>for</strong> different targets must be generated, but a way <strong>to</strong> link them<br />
later on must be found. This task is already non-trivial in a homogeneous environment,<br />
and it quickly becomes complex in a heterogeneous environment.<br />
This chapter studies the impact of the heterogeneous computing model, presented in<br />
Chapter 2, on traditional compiler organization. It proposes the combination of a rich<br />
compiler infrastructure with a flexible pass manager as a framework <strong>to</strong> match the new<br />
architecture constraints. This approach focuses on modularity, re-usability, retargetability<br />
and flexibility of the compiler design.<br />
To begin with, Section 3.1 studies the adequacy between mainstream compiler infrastructures<br />
and heterogeneous machines; and proposes a model <strong>to</strong> represent programmable<br />
pass management, a critical aspect of compiler design <strong>for</strong> modularity. Then, Section 3.2<br />
argues in favor of using source-<strong>to</strong>-source trans<strong>for</strong>mations, using source files as a common<br />
medium between all existing <strong>to</strong>ols. Finally, Section 3.3 introduces a high level Application<br />
Programming Interface (api) <strong>to</strong> build pass managers, the entities in charge of managing<br />
the chaining of the compiler passes. It exposes a sufficient abstraction of the compiler<br />
internals <strong>to</strong> developers that want <strong>to</strong> contribute at the pass level. The whole scheme is<br />
illustrated by the Pythonic PIPS (pyps) interface developed on <strong>to</strong>p of the Paralléliseur<br />
Interprocedural de Programmes Scientifiques (pips) compiler infrastructure. Related work<br />
is studied in Section 3.4.<br />
3.1 Extending Compiler Infrastructures<br />
3.1.1 Existing Compiler Infrastructures<br />
A compiler is typically separated in<strong>to</strong> three parts, (see Figure 3.1):<br />
The Front End is responsible <strong>for</strong> converting the input source code in<strong>to</strong> compiler’s Internal<br />
Representation (ir). A single compiler can have many front-ends. For instance<br />
gnu C Compiler (gcc) offers a front-end <strong>for</strong> Ada, C, C++, Fortran, Java, Chill,<br />
ObjectiveC, Pascal, etc.<br />
The Middle End is in charge of the code optimizations, that are independent of the
3.1. EXTENDING COMPILER INFRASTRUCTURES 45<br />
Host code<br />
+<br />
Device code<br />
<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 0<br />
Host Compiler <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 1<br />
<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 2<br />
Compiler Infrastructure<br />
<strong>Source</strong>-<strong>to</strong>-Binary Compiler 0<br />
<strong>Source</strong>-<strong>to</strong>-Binary Compiler 1<br />
<strong>Source</strong>-<strong>to</strong>-Binary Compiler 2<br />
Figure 3.2: Improved compilation flow <strong>for</strong> heterogeneous computing.<br />
Host Object<br />
Object 0<br />
Object 1<br />
Object 2<br />
input language or of the output target. Some are parametrized (e.g. unrolling) and<br />
their parameter are target-dependent. Other can benefit from additional assumptions<br />
due <strong>to</strong> the input language, e.g. no aliasing between parameters in Fortran77. There<br />
is usually a single middle end per compiler infrastructure—this is the case <strong>for</strong> gcc,<br />
Low Level Virtual Machine (llvm) and Open64.<br />
The Back End generates target-specific assembly code generation from the ir. In a<br />
similar manner <strong>to</strong> front ends, there can be several back ends in a single compiler<br />
infrastructure. For instance, llvm can produce assembly code <strong>for</strong> the following targets,<br />
as of version 2.7 : x86, sparc, powerpc, alpha, arm, mips, cellspu, pic16, xcore,<br />
msp430, systemz, blackfin, cbackend, msil, cppbackend, mblaze.<br />
For the purpose of code generation <strong>for</strong> heterogeneous hardware from raw source files<br />
(i.e. without directives or language extensions), only the middle and the back-end are<br />
affected. In Open Computing Language (opencl), the host compiler is assisted by several<br />
source-<strong>to</strong>-binary compilers, one <strong>for</strong> each targeted device. That is a middle end and a back<br />
end <strong>for</strong> each targeted device. This scheme could be improved by merging the middle ends<br />
of each source-<strong>to</strong>-binary compiler in<strong>to</strong> a generic middle end and as many back-ends. This<br />
approach, while tempting, is not possible because, as we have shown in Chapter 2, there is<br />
a lot of diversity in hardware accelera<strong>to</strong>rs, so the code trans<strong>for</strong>mations involved can only<br />
be shared partially. One alternative is <strong>to</strong> provide a common compilation infrastructure, on<br />
which all middle-ends are based. Figure 3.2 illustrates this idea by splitting each source-<strong>to</strong>binary<br />
compiler in<strong>to</strong> two parts: a source-<strong>to</strong>-source compiler, called the hardware language<br />
transla<strong>to</strong>r, and a source-<strong>to</strong>-binary compiler. The hardware language transla<strong>to</strong>r manages<br />
the target-specific optimization process at the source level and the source-<strong>to</strong>-binary is solely<br />
in charge of the translation <strong>to</strong> binary code. both are built using basic blocks found in the<br />
compiler infrastructure.<br />
Section 3.2.1 proposes an approach <strong>to</strong> make annotated code compatible with Figure 2.3.<br />
gcc, llvm or Open64 are examples of such compiler infrastructures. Let us now propose<br />
a very generic definition of a compiler infrastructure.<br />
Definition 3.1. A compiler infrastructure is a set of passes and analyses organized by a<br />
consistency manager and made available <strong>to</strong> compiler developers through a pass manager.
46 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
<strong>Compilers</strong> & Tools<br />
p4a pipscc pypsearch sac terapyps<br />
Analyses<br />
DFG, array regions...<br />
Pass Manager<br />
pyps tpips<br />
Consistency Manager<br />
pipsmake<br />
Passes<br />
inlining, unrolling...<br />
Internal Representation<br />
Pretty Printers<br />
C, Fortran, XML...<br />
Figure 3.3: pips as a generic compiler infrastructure sample.<br />
Passes and analyses are defined <strong>for</strong>mally in Section 3.1.2, so we just give in<strong>for</strong>mal<br />
definitions.<br />
Definition 3.2. A pass is a code trans<strong>for</strong>mation that modifies its input source <strong>to</strong> generate<br />
a new version.<br />
For instance, loop unrolling, inlining or <strong>for</strong>ward substitution are passes. In the following,<br />
passes are also referred as trans<strong>for</strong>mations.<br />
Definition 3.3. An analysis produces an abstraction of the code <strong>to</strong> be used by further<br />
passes.<br />
A call graph, a dependence graph or a polyhedral model are results of analyses.<br />
Definition 3.4. A consistency manager is a component in charge of providing up-<strong>to</strong>-date<br />
analyses <strong>to</strong> the passes.<br />
Definition 3.5. A pass manager is a component in charge of chaining the passes.<br />
Figure 3.3 shows the hierarchical organization of these components, based on the infrastructure<br />
of pips. The infrastructures of gcc [Nov06] and llvm [LA03] follow a similar<br />
pattern.<br />
Typically, a compiler <strong>for</strong> a single target applies the same sequence of passes on each<br />
function from its input code. As stated in gcc’s manual [Wik09]:<br />
Its [the pass manager’s] job is <strong>to</strong> run all the individual passes in the correct<br />
order, and take care of standard bookkeeping that applies <strong>to</strong> every pass.<br />
The gcc pass manager works on a sequence of passes, as shown by the implementation<br />
of the init_optimization_passes function. An excerpt of this function taken from gcc<br />
source code is given in Listing 3.1.
3.1. EXTENDING COMPILER INFRASTRUCTURES 47<br />
void init_optimization_passes ( void )<br />
{<br />
struct tree_opt_pass **p;<br />
# define NEXT_PASS ( PASS ) (p = next_pass_1 (p, & PASS ))<br />
/* Interprocedural optimization passes . */<br />
p = & all_ipa_passes ;<br />
NEXT_PASS ( pass_early_ipa_inline );<br />
NEXT_PASS ( pass_early_local_passes );<br />
NEXT_PASS ( pass_ipa_cp );<br />
NEXT_PASS ( pass_ipa_inline );<br />
[...]<br />
Listing 3.1: gcc pass manager initialization.<br />
# dead code elimination + constant propagation + inlining<br />
opt input .bc -o output .bc -dce - constprop -inline<br />
Listing 3.2: Dynamic phase ordering using the llvm pass manager command line interface.<br />
The passes use a function pointer bool (*gate)(void) <strong>to</strong> potentially guard its execution<br />
and unsigned int(*execute)(void) runs the pass. The pass order is essentially fixed, but<br />
it is possible <strong>to</strong> turn off some of the passes using the gate function and <strong>to</strong> sidestep some of<br />
the issues using the plug-in mechanism. This intrinsically static pass scheduling is a severe<br />
limitation <strong>for</strong> iterative compilation <strong>to</strong>ols, and its linearity does not match heterogeneous<br />
compilation requirements.<br />
The pass manager used in llvm is more sophisticated: it can register several kind of<br />
passes, depending on the type of object they work on—the whole program, a function, a<br />
call graph, a loop, a basic block or some machine code—while gcc’s passes only work on<br />
functions. Regarding pass management, compiler developers can either rely on the existing<br />
one or provide their own implementation. A Command Line Interface (cli) on <strong>to</strong>p of the<br />
pass manager, called opt, allows <strong>to</strong> dynamically change the phase ordering as presented in<br />
Listing 3.2.<br />
Iterative compilation is also possible. However, the lack of advanced control structures<br />
over pass scheduling hardly makes it a candidate <strong>for</strong> heterogeneous computing. A compiler<br />
<strong>for</strong> a heterogeneous plat<strong>for</strong>m must apply different sequences <strong>to</strong> different parts of the code,<br />
say one per target. As the compiler not only optimizes the code <strong>for</strong> a specific device, but<br />
also modifies the code <strong>to</strong> meet the hardware constraints, it can be expected that unusual<br />
combination patterns happen. The compilers built during this thesis and presented in<br />
Chapter 7 validate this assumption.
48 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
3.1.2 A Simple Model <strong>for</strong> Code Trans<strong>for</strong>mations<br />
The core of a compiler optimizer is the pass ordering. This section studies from a <strong>for</strong>mal<br />
point of view the interaction between passes, inspects the consequences of this <strong>for</strong>malism<br />
and elaborates on several trans<strong>for</strong>mation composition rules.<br />
3.1.2.1 Trans<strong>for</strong>mations: Definition and Compositions<br />
Let us first define <strong>for</strong>mally what a trans<strong>for</strong>mation is. Let P be the set of well-<strong>for</strong>med<br />
programs 1 , F the set of all possible functions 2 and T the set of code trans<strong>for</strong>mations.<br />
Definition 3.6. A program is a n-tuple of functions:<br />
∀p ∈ P, ∃n ∈ N : p ∈ F n<br />
The first element of a program is called the entry point of the program. The cardinality<br />
of a program is given by the | · | opera<strong>to</strong>r.<br />
If p is a program, we denote Vin(p) be the set of its possible input values; and given<br />
vin ∈ Vin(p), P(p, vin) denotes the results of the evaluation of p.<br />
Definition 3.7. A code trans<strong>for</strong>mation is a P → P application.<br />
The identity function on T is denoted idT . The set T , <strong>to</strong>gether with the ◦ operation,<br />
the function composition, and idT , is not a group because many trans<strong>for</strong>mations are not<br />
injective and thus have no inverse.<br />
Proof. The trans<strong>for</strong>mation that evaluates constant expressions is not injective, as it produces<br />
int a = 4; from either int a = 3+1; or int a = 2*2;.<br />
The <strong>for</strong>mer definition of a trans<strong>for</strong>mation ignores an important aspect of code trans<strong>for</strong>mations:<br />
they can fail. For instance loop fusion of two loops fails if the second loop<br />
carries backward dependencies on the first. A loop-carried backward dependency can prevent<br />
loop vec<strong>to</strong>rization, or the pass can even crash because of a lousy implementation 3 .<br />
As a consequence, we introduce an error state and propose:<br />
Definition 3.8. A code trans<strong>for</strong>mation is an application P → (P ∪ {error}) that either<br />
succeeds or preserves the semantics of the program or fails.<br />
And one can revert <strong>to</strong> the previous state by defining a failsafe opera<strong>to</strong>r:<br />
1. We call “well-<strong>for</strong>med programs” programs whose behavior is not undefined according <strong>to</strong> its language<br />
norms.<br />
2. The term module is used in pips instead of the more commonly used term function. As a consequence,<br />
the term module is used in some code excerpts.<br />
3. This un<strong>for</strong>tunately happens a lot in research compilers where many PhD students—just like me—<br />
contribute.
3.1. EXTENDING COMPILER INFRASTRUCTURES 49<br />
Definition 3.9. The failsafe opera<strong>to</strong>r ˜· : T → T is defined by<br />
∀t ∈ T , ∀p ∈ P,<br />
<br />
˜t(p)<br />
t(p)<br />
=<br />
p<br />
if t(p) =error<br />
otherwise<br />
and then a failsafe composition:<br />
Definition 3.10. The failsafe composition ˜◦ : T × T → T is defined by<br />
∀(t0, t1) ∈ T 2 , t1 ˜◦ t0 = ˜t1 ◦ ˜t0<br />
Chaining trans<strong>for</strong>mations can be done using the ˜◦ opera<strong>to</strong>r and most compilers are<br />
using this semantics as their primary way <strong>to</strong> compose trans<strong>for</strong>mations. New passes can<br />
be defined as the composition of existing passes, promoting modularity instead of monolithic<br />
passes. For instance a pass that generates Open Multi Processing (openmp) code<br />
can be written as the failsafe combination of loop fusion, reduction detection, parallelism<br />
extraction and directive generation instead of a single directive generation. Lattner advocates<br />
[Lat11] <strong>for</strong> this low granularity in pass design.<br />
Still, the fact that a trans<strong>for</strong>mation fails carries some important in<strong>for</strong>mation. In the<br />
example above, if the loop vec<strong>to</strong>rization succeeds, a vec<strong>to</strong>r instruction can be generated,<br />
otherwise parallelism extraction may be tried. This kind of behavior is represented by a<br />
conditional composition:<br />
Definition 3.11. The conditional composition ◦ : T × T × T → T is defined by<br />
∀(t0, t1, t2) ∈ T 3 , ∀p ∈ P, ((t1, t2) ◦ t0)(p) =<br />
(t1 ◦ t0)(p) if t0(p) = error<br />
t2(p) otherwise<br />
The ◦ opera<strong>to</strong>r is not used in llvm or gcc, although it provides interesting perspectives.<br />
Let us assume the existence of 3 trans<strong>for</strong>mations tgpu, tsse and <strong>to</strong>mp that convert a<br />
sequential C code in<strong>to</strong> a code with cuda calls, Streaming simd Extension (sse) intrinsic<br />
calls and openmp directives, respectively. Then the following expression:<br />
(idT , tsse ˜◦ <strong>to</strong>mp) ◦ tgpu<br />
means try <strong>to</strong> trans<strong>for</strong>m the code in<strong>to</strong> a gpu code or if it fails try <strong>to</strong> generate openmp<br />
directives then sse intrinsics, whether openmp directives were generated or not. It builds<br />
a decision tree that allows complex compilation strategies.<br />
If the intended behavior is <strong>to</strong> s<strong>to</strong>p as soon as an error occurs, and <strong>to</strong> keep on applying<br />
trans<strong>for</strong>mations otherwise, as in the sequence:<br />
(t3, idT ) ◦ t2, idT<br />
<br />
◦ (t1, idT ) ◦ t0<br />
<br />
then writing the default skip trans<strong>for</strong>mation is bothersome. Thus we define an error propagation<br />
opera<strong>to</strong>r:
50 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
Definition 3.12. The error propagation opera<strong>to</strong>r ◦ : T × T → T is defined by<br />
∀(t0, t1) ∈ T 2 , t1 ◦ t0 = (t1, idT ) ◦ t0<br />
which makes it possible <strong>to</strong> rewrite the above example as<br />
t3 ◦ t2 ◦ t1 ◦ t0<br />
For instance, let us assume the existence of a trans<strong>for</strong>mation <strong>to</strong>pt_dma that optimizes<br />
the usage of Direct Memory Access (dma) functions by trying <strong>to</strong> remove redundant ones,<br />
<strong>to</strong> merge them, etc. 4 It is not relevant <strong>to</strong> apply it if no dma function was generated by<br />
a tgen_dma trans<strong>for</strong>mation. Supposing that tgen_dma returns an error if it failed <strong>to</strong> generate<br />
dma operations, this kind of interaction can be represented using the expression:<br />
<strong>to</strong>pt_dma ◦ tgen_dma.<br />
3.1.2.2 Parametric Trans<strong>for</strong>mations<br />
Many trans<strong>for</strong>mations are parametric. For instance, loop unrolling not only takes a<br />
program as input, but it also needs a particular loop in a particular function and an unroll<br />
rate <strong>to</strong> be completely defined. To represent this, we introduce the concept of trans<strong>for</strong>mation<br />
genera<strong>to</strong>r<br />
Definition 3.13. An application f is a parametric trans<strong>for</strong>mation if there exists a set A<br />
so that f : A → T .<br />
For instance loop unrolling is a parametric trans<strong>for</strong>mation <strong>for</strong> which A = F × L × N ∗ ,<br />
where first argument is the function <strong>to</strong> work on, the second argument the loop statement<br />
<strong>to</strong> unroll and the third argument is the unroll rate.<br />
Two particular classes of trans<strong>for</strong>mation genera<strong>to</strong>rs are commonly found in compilers:<br />
function trans<strong>for</strong>mation and loop trans<strong>for</strong>mation.<br />
Definition 3.14. A parametric trans<strong>for</strong>mation g : A → T is a function trans<strong>for</strong>mation if<br />
there exists a set B so that A = F × B.<br />
Definition 3.15. A parametric trans<strong>for</strong>mation g : A → T is a loop trans<strong>for</strong>mation if<br />
there exists a set B so that A = F × L × B.<br />
3.1.2.3 From Model <strong>to</strong> Implementation<br />
Moving from a <strong>for</strong>mal description of pass opera<strong>to</strong>rs <strong>to</strong> a programming language can be<br />
tricky. Although it is tempting <strong>to</strong> design a new language that directly reflects the ˜·, ˜◦ , ◦ ,<br />
and ◦ opera<strong>to</strong>r, we make the following points:<br />
– a trans<strong>for</strong>mation changes a program state and behaves like a method in Object<br />
Oriented Programming (oop);<br />
4. A similar trans<strong>for</strong>mation is described in Section 6.3.
3.1. EXTENDING COMPILER INFRASTRUCTURES 51<br />
– trans<strong>for</strong>mation genera<strong>to</strong>rs are methods granted with extra parameters; in particular<br />
function trans<strong>for</strong>mation genera<strong>to</strong>rs can be represented by methods of a hypothetic<br />
function class and loop trans<strong>for</strong>mation genera<strong>to</strong>rs can be represented by methods of<br />
a hypothetic loop class;<br />
– the composition opera<strong>to</strong>r is similar <strong>to</strong> the sequence opera<strong>to</strong>r found in many programming<br />
languages;<br />
– the failsafe opera<strong>to</strong>r is similar <strong>to</strong> a try ... catch block that surrounds every trans<strong>for</strong>mation;<br />
– the conditional composition is similar <strong>to</strong> if ... then ... else block;<br />
– The error propagation opera<strong>to</strong>r has a similar semantics as exception propagation.<br />
As a consequence, we choose <strong>to</strong> use a general purpose programming language instead<br />
of designing our own: the class hierarchy with the appropriate methods described in next<br />
section embodies all the concepts detailed in this section.<br />
3.1.3 Programmable Pass Management<br />
3.1.3.1 A Class Hierarchy <strong>for</strong> Pass Management<br />
The hierarchy between the host and the accelera<strong>to</strong>rs, a consequence of the masterworker<br />
paradigm, implies a similar hierarchy between the host code and the accelera<strong>to</strong>r<br />
code. However, it is not enough in practice: depending on the input code, a single function<br />
can have several loops candidate <strong>to</strong> offloading. Moreover, opencl takes as input a selfcontained<br />
code, so the notion of compilation unit—the set of variables, types and functions<br />
defined in the same source file— is also important. Because compilation <strong>for</strong> heterogeneous<br />
computing usually leads <strong>to</strong> the creation of new functions, it is also important <strong>to</strong> be able<br />
<strong>to</strong> keep track of them and adapt the processing according <strong>to</strong> their origin. This relationship<br />
must be visible <strong>to</strong> the pass manager because different compilation schemes apply <strong>to</strong><br />
each part. The model presented in the previous section confirmed by the experience gathered<br />
during the development of several heterogeneous compilers has led <strong>to</strong> the hierarchy<br />
described below:<br />
Program: Code trans<strong>for</strong>mations manipulate programs. Interprocedural analyses such as<br />
the call-graph computation and trans<strong>for</strong>mations such as constant propagation trans<strong>for</strong>m<br />
the program as a whole. This is the coarser trans<strong>for</strong>mation grain.<br />
Compilation units: Processing can be different depending on the enclosing compilation<br />
unit. A typical example is <strong>for</strong> hand-optimized runtime code, code passed <strong>to</strong> the compiler<br />
so that it has all the definitions required <strong>for</strong> interprocedural analyses of generated<br />
code (see Section 3.2.1). This code should only be analyzed by the compiler, but not<br />
modified. The same situation occurs when some properties of a source file have been<br />
certified: the certification does not hold any longer if the code is changed.<br />
Functions: We have modeled a program as a set of functions, and the pass manager<br />
generally works at this level of granularity. gcc works at this level. It is especially<br />
relevant <strong>for</strong> heterogeneous computing where some functions are executed on the host<br />
and some are executed on an accelera<strong>to</strong>r.
52 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
Workspace<br />
1<br />
1<br />
Program<br />
1<br />
1<br />
*<br />
*<br />
Maker<br />
Compilation<br />
Unit<br />
1<br />
*<br />
Figure 3.4: pyps class hierarchy.<br />
1<br />
Function Loop<br />
*<br />
*<br />
Loops: Many scientific programs spend a lot of time in loops and Single Instruction<br />
stream, Multiple Data stream (simd) parallelism is commonly found in loops. Exposing<br />
loops hierarchy <strong>to</strong> the pass manager is pertinent in many cases, <strong>for</strong> example<br />
<strong>to</strong> apply loop-level trans<strong>for</strong>mations, <strong>to</strong> outline a loop in a new function, (see Section<br />
4.6.2), or <strong>to</strong> isolate it from the rest of the memory, as in Section 6.1. Likewise<br />
the loop nest hierarchy provides significant in<strong>for</strong>mation concerning the code structure<br />
and potential candidate <strong>for</strong> loop trans<strong>for</strong>mations such as loop tiling or loop fusion.<br />
These relationships can be represented by the class hierarchy shown in Figure 3.4.<br />
The combination of control flow and hierarchical structure makes it possible <strong>to</strong> express<br />
constructs such as “select each loop nest of the program that does not belong <strong>to</strong> the<br />
runtime, compare its number of operations <strong>to</strong> its number of memory accesses and if it is<br />
computationally intensive enough, outline the loop nest in<strong>to</strong> a new function in the gpu<br />
space. Then trans<strong>for</strong>m this new function in<strong>to</strong> a suitable kernel <strong>for</strong> a gpu device”.<br />
Note that using the <strong>for</strong>malism presented above, this sentence can be denoted as:<br />
∀p ∈ P, ∀f ∈ F \ R, ∀l ∈ L,<br />
G(f, f ′ )◦ O(f, l, f ′ )◦ C(f, l) (p)<br />
where R is a set of functions representing the runtime, C : F × L → T is a loop<br />
trans<strong>for</strong>mation that raises an error if the given loop is not computationally intensive;<br />
O : F × L × F → T is a loop trans<strong>for</strong>mation that outlines the given loop in<strong>to</strong> a new<br />
function; and G : F × F → T is a function trans<strong>for</strong>mation that turns a function in<strong>to</strong> a<br />
kernel and a kernel call.<br />
Figure 3.4 also introduces a Maker class. This component represents the build process<br />
and is involved in both the source code generation process and its final compilation in<strong>to</strong><br />
machine code. Indeed if the compilation chain is <strong>to</strong> remain independent of the targeted<br />
language—remember everything is represented in C—a specialization step is needed. This<br />
specialization step, called post-processing, takes care of the various steps needed <strong>to</strong> switch<br />
from a C representation <strong>to</strong> the targeted dialect, say cuda or opencl. The Maker class<br />
plays this role and describes the final steps required <strong>to</strong> generate the proper external representation.<br />
Additionally, it generates a Makefile <strong>to</strong> au<strong>to</strong>mate the complex build process<br />
resulting from the combination of different targets in the same build process.<br />
0,1
3.1. EXTENDING COMPILER INFRASTRUCTURES 53<br />
3.1.3.2 Control Flow and Pass Management<br />
Control flow features are profitable in many situations. The use cases presented in this<br />
section are excerpts of existing compiler instances presented in Chapter 7. The examples<br />
are written using the pyps interface described in Section 3.3 but they should nevertheless<br />
be understandable.<br />
Sequences represent the ◦ opera<strong>to</strong>r. They are common <strong>to</strong> all pass managers. They are<br />
used each time trans<strong>for</strong>mation chaining is used.<br />
Methods represent the mathematical function definition. They are used <strong>to</strong> structure<br />
the compiler code and <strong>to</strong> en<strong>for</strong>ce re-ruse. For instance the generation process <strong>for</strong><br />
multimedia instructions is common <strong>to</strong> sse, Advanced Vec<strong>to</strong>r eXtensions (avx) and<br />
neon instruction sets with minor parameter changes (e.g. size of the vec<strong>to</strong>r register).<br />
It can be packed in<strong>to</strong> a function at the pass manager level as shown in Listing 7.1.<br />
Conditionals are extensions of the ◦ opera<strong>to</strong>r. They are used when the pass scheduling<br />
depends on compiler switches or on the current compilation status. Figure 3.5 shows<br />
two situations where it can occur: The pass manager has <strong>to</strong> activate or deactivate<br />
some passes, or modify the compilation scheme.<br />
# check if if_conversion is asked <strong>for</strong><br />
if params . get (’if_conversion ’):<br />
module . if_conversion ()<br />
(a) Using conditionals as switches.<br />
# optimize a module if it does not belong <strong>to</strong> the runtime<br />
if module .cu != " myruntime ":<br />
module . optimize ()<br />
(b) Using conditionals <strong>to</strong> change compilation scheme.<br />
Figure 3.5: Usage of conditionals at the pass manager level.<br />
Do Loops represent the map on a set, they are used <strong>to</strong> per<strong>for</strong>m repetitive tasks. Such<br />
tasks are often found in a pass manager, e.g. apply a pass <strong>to</strong> each loop or function<br />
of a set, or apply a trans<strong>for</strong>mation iteratively with varying parameters.<br />
Exceptions are related <strong>to</strong> the error state. They can be used by a pass <strong>to</strong> signal an<br />
unexpected situation. For instance, the inlining phase raises an exception if it has<br />
no caller, or an attempt <strong>to</strong> offload a loop nest fails if generic pointer are involved<br />
and the engine does not know how <strong>to</strong> transfer them on the hardware accelera<strong>to</strong>r.<br />
Listing 3.3 shows an example of such a situation.<br />
Trans<strong>for</strong>mation genera<strong>to</strong>rs are described as class methods with parameters. To feed<br />
these parameters, and more generally <strong>to</strong> feed conditionals or loop ranges, some in<strong>for</strong>mation<br />
concerning the code being compiled is necessary. It raises the issue of the pass manager
54 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
<strong>for</strong> kernel in gpu_kernels :<br />
kernel . generate_communications ()<br />
(a) Iterate over sets.<br />
<strong>for</strong> pattern in [" add ", " minus ", " mul "]:<br />
module . pattern_recognition ( pattern )<br />
(b) Iterate over parameters.<br />
Figure 3.6: Usage of loops at the pass manager level.<br />
try :<br />
module . isolate_statement ()<br />
# if isolate statement succeeds ,<br />
# go on up <strong>to</strong> kernel generation<br />
module . outline ()<br />
...<br />
except RuntimeError as re:<br />
print (" Unable ␣<strong>to</strong>␣ generate ␣ GPU ␣ code :␣"+ str (re ))<br />
# maybe try SSE instructions instead ?<br />
Listing 3.3: Usage of exceptions at the pass manager level.<br />
interface granularity: should all the compiler internals be exposed <strong>to</strong> the pass manager or<br />
should it only show a subset of relevant in<strong>for</strong>mation? The <strong>for</strong>mer approach is taken by<br />
Rudy et al. [RCH + 10]: they use the lua scripting language <strong>to</strong> expose all the ir <strong>to</strong> the pass<br />
manager, based on which the compiler per<strong>for</strong>ms gpu code generation, using an iterative<br />
algorithm expressed at the script level.<br />
The benefits of using such an approach are unclear from a separation of concerns point<br />
of view: high level code is mixed up with lower level code, without a clear boundary between<br />
the two concerns. We propose an alternative approach based on the following assessment: if<br />
an access <strong>to</strong> the ir is needed, then a pass should be used <strong>to</strong> simplify the pass manager’s job.<br />
Otherwise it can be done at the pass manager level. That is the high-level trans<strong>for</strong>mations<br />
should be managed by a high-level language given the proper abstraction, and the low<br />
level trans<strong>for</strong>mation should be managed by the native language that has an access <strong>to</strong> all<br />
the infrastructure capabilities. Eventually these languages could be the same, but the fact<br />
they address different needs, basically programmability vs. per<strong>for</strong>mance, should lead <strong>to</strong><br />
different engineering choices.
3.2. ON SOURCE-TO-SOURCE COMPILERS 55<br />
Front Ends <strong>Targets</strong><br />
compiler C C++ Fortran openmp cuda sse vhdl mpi<br />
pips <br />
pocc ≡<br />
mercurium <br />
cetus ≡ ≡<br />
rose <br />
gecos <br />
hmpp <br />
: feature officially supported ≡: feature mentioned in a paper<br />
Table 3.1: Comparison of source-<strong>to</strong>-source compilation infrastructures.<br />
3.2 On <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> <strong>Compilers</strong><br />
In the previous sections, we have separated the job of the source-<strong>to</strong>-source compiler<br />
from the job of the source-<strong>to</strong>-binary compiler. The source-<strong>to</strong>-source compiler takes care of<br />
language-independent trans<strong>for</strong>mations and optimizations based on a common framework,<br />
and source-<strong>to</strong>-binary compilers are used as back-ends.<br />
Definition 3.16. A source-<strong>to</strong>-source compiler is a compiler that takes a source code written<br />
in a high level language as input, and outputs code in a high level language.<br />
3.2.1 Exploring <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Opportunities<br />
His<strong>to</strong>rically, many trans<strong>for</strong>mations have been expressed at the source code level, especially<br />
parallelizing trans<strong>for</strong>mations. Many successful compilers are source-<strong>to</strong>-source compilers:<br />
HPFC [Coe93], cilk [FLR98], acotes [MAB + 10], or based on source-<strong>to</strong>-source<br />
compiler infrastructures pips [IJT91], suif [WFW + 94], rose [Qui00], cetus [iLJE03],<br />
mercurium [AMG + 99], or GeCoS [DMM + ]. Such infrastructures provide many code<br />
trans<strong>for</strong>mations relevant <strong>to</strong> our problem, such as parallelism detection algorithm, variable<br />
privatization, etc.<br />
Table 3.1 gathers the result of our study of the number of hardware targets <strong>for</strong> seven<br />
source-<strong>to</strong>-source research or industrial compilers still in use or in development. Most of<br />
them are used <strong>to</strong> generate code <strong>for</strong> more than one target.<br />
All the compilers considered so far take a C dialect as input language, and do not<br />
provide a binary interface. Thus, a hardware transla<strong>to</strong>r is required <strong>to</strong> generate C code as<br />
a result of its processing. As stated above, a post-processor is needed <strong>to</strong> adjust syntactic<br />
changes between plain C and the targeted dialect. Such modifications can easily be done<br />
at the source level using common techniques. The first one is regular expressions: many<br />
language adjustments, such as adding the triple chevron from cuda can be per<strong>for</strong>med with<br />
a relevant regular expression. However, most programming languages are not regular, so
56 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
tr<br />
source-<strong>to</strong>source<br />
compiler<br />
tr<br />
external <strong>to</strong>ol<br />
tr<br />
source-<strong>to</strong>source<br />
compiler<br />
tr<br />
Figure 3.7: <strong>Source</strong>-<strong>to</strong>-source cooperation with external <strong>to</strong>ols.<br />
regular expressions are not sufficient <strong>for</strong> textual substitution. When combined with the C<br />
macro processor that has a weaker rewriting engine but is capable of diverting function<br />
calls using macro functions, they can catch more pattern, although not providing a reliable<br />
<strong>to</strong>ol <strong>for</strong> all situations (e.g. no pairing of brackets). This combination has been successfully<br />
used <strong>for</strong> the three compiler implementations described in Chapter 7.<br />
In addition <strong>to</strong> the intuitive collaboration with source-<strong>to</strong>-binary compilers, source-<strong>to</strong>source<br />
compilers can also collaborate between each other <strong>to</strong> achieve their goal, using source<br />
files as a common medium, at the expense of extra processing <strong>for</strong> the additional switches<br />
between Textual Representation (tr) and ir. Figure 3.7 illustrates this generic behavior.<br />
For instance, optimization of loop nests can be delegated <strong>to</strong> the Polyhedral Compiler<br />
Collection (pocc) <strong>to</strong>ol, a compiler optimized <strong>for</strong> polyhedral trans<strong>for</strong>mations.<br />
More traditional advantages of source-<strong>to</strong>-source compilers include their debugging features,<br />
because the ir can be dumped as a tr at anytime, then compiled and executed. For<br />
the same reason, they are very pedagogical <strong>to</strong>ols and make it easier <strong>to</strong> illustrate the behavior<br />
of a trans<strong>for</strong>mation. As claimed above, many trans<strong>for</strong>mations, such as loop interchange<br />
or loop unrolling, are easily described as source-<strong>to</strong>-source trans<strong>for</strong>mations.
3.2. ON SOURCE-TO-SOURCE COMPILERS 57<br />
.c<br />
<strong>Source</strong>-<strong>to</strong>-<br />
<strong>Source</strong> Compiler<br />
.c<br />
Post-Processor<br />
.c dialect<br />
<strong>Source</strong>-<strong>to</strong>-<br />
Binary Compiler<br />
Machine<br />
Code<br />
Figure 3.8: <strong>Heterogeneous</strong> compilation stages.<br />
3.2.2 Impact of <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compilation on the Compilation<br />
Infrastructure<br />
There are as many C dialects as hardware devices. As a consequence, a source-<strong>to</strong>-source<br />
compiler that aims at generating code <strong>for</strong> several targets has two possibilities: either write<br />
as many pretty-printers as existing dialects, or regenerate C code and use an external <strong>to</strong>ol<br />
<strong>to</strong> per<strong>for</strong>m the translation. This approach requires a post-processing step <strong>to</strong> fill the gap<br />
between the C code augmented with runtime functions and the hardware language, as<br />
shown in Figure 3.8.<br />
The combination of Figure 3.8 with Figure 3.9 results in the final compilation infrastructure<br />
diagram described in Figure 3.9. The main achievement shown by this figure is<br />
that most developments can be done in a source-<strong>to</strong>-source infrastructure using a common<br />
ir. This is a great progress compared <strong>to</strong> the situation with opencl described in Figure 2.10<br />
on page 39: it favors re-usability.
58 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 0<br />
Host code Host Compiler <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 1<br />
<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 2<br />
<strong>Source</strong>-<strong>to</strong>-source Compiler Infrastructure<br />
PP 0<br />
PP 1<br />
PP 2<br />
<strong>Source</strong>-<strong>to</strong>-<br />
Binary<br />
Compiler 0<br />
<strong>Source</strong>-<strong>to</strong>-<br />
Binary<br />
Compiler 1<br />
<strong>Source</strong>-<strong>to</strong>-<br />
Binary<br />
Compiler 2<br />
Figure 3.9: <strong>Source</strong>-<strong>to</strong>-source heterogeneous compilation scheme.<br />
3.3 pyps, a High Level Pass Manager api<br />
Host<br />
Object<br />
Object 0<br />
Object 1<br />
Object 2<br />
PP n : Post-Processor<br />
The model presented in Section 3.1.2 leads <strong>to</strong> the design of an api <strong>for</strong> pass managers.<br />
All the compilers <strong>for</strong> heterogeneous devices presented in this thesis are based upon this<br />
pass manager. It uses a dynamic object-oriented scripting language, Python, <strong>for</strong> flexibility<br />
and ease-of-development without much per<strong>for</strong>mance loss, <strong>for</strong> the compiler passes are still<br />
implemented in a compiled language, C. It uses an object oriented language <strong>to</strong> represent the<br />
class hierarchy from Figure 3.4, <strong>to</strong> en<strong>for</strong>ce code reuse and <strong>to</strong> facilitate compiler composition.<br />
This section describes in details the methods exposed at the pass manager level <strong>for</strong> each<br />
of the class identified in the previous section: program, function, loop and maker.<br />
3.3.1 api Description<br />
This pass manager is implemented in Python on <strong>to</strong>p of the pips compiler infrastructure 5<br />
and named pyps. It consists in fewer than 700 sloc. In addition <strong>to</strong> the language properties<br />
mentioned above, Python has the advantage of having a rich set of libraries and a dynamic<br />
community. As an example of the benefits of using a mature and feature-rich language,<br />
combining pyps with the enhanced Python interpreter ipython has led <strong>to</strong> a powerful cli<br />
<strong>to</strong> pyps at almost no development cost, unlike other scripting <strong>to</strong>ols implemented over pips<br />
such as tpips. The integration with the C language is simple enough <strong>to</strong> allow an easy<br />
binding with pips libraries. Note however that the api design is completely independent<br />
from the underlying compiler infrastructure, which makes it a suitable candidate <strong>for</strong> other<br />
compiler infrastructures.<br />
In this api, two main entities are used <strong>to</strong> abstract the source-<strong>to</strong>-source compiler: a<br />
workspace and a Maker. The <strong>for</strong>mer represents a whole program and the trans<strong>for</strong>mation.<br />
The latter represents the global compilation scheme: post-processing, source-<strong>to</strong>-binary<br />
5. An overview of the pips compiler infrastructure is given in Appendix A.
3.3. PYPS, A HIGH LEVEL PASS MANAGER API 59<br />
compiler call, etc.<br />
A workspace provides the following methods:<br />
init(sources, flags ): a workspace is initialized from a set of files and preprocessor<br />
flags. Once created, it has knowledge of the full program code, and access <strong>to</strong> all the<br />
relevant runtimes;<br />
save(dir ): once all trans<strong>for</strong>mations have been per<strong>for</strong>med, this method is used <strong>to</strong> regenerate<br />
all source files without post-processing;<br />
build(dir, maker ) uses a Maker <strong>to</strong> save current workspace and per<strong>for</strong>m the postprocessing.<br />
The default Maker per<strong>for</strong>ms no post-processing and generates a Makefile<br />
<strong>for</strong> a traditional compilation scheme.<br />
compile(dir, maker ): gather all source files, save them in dir, post-processes them with<br />
maker and use the Makefile generated by the build method <strong>to</strong> compile them;<br />
checkpoint() saves current workspace state, returning an identifier;<br />
res<strong>to</strong>re(chk_id ) res<strong>to</strong>res workspace back <strong>to</strong> the state when chk_id was obtained.<br />
A typical use case is <strong>to</strong> create a workspace using init, per<strong>for</strong>m some trans<strong>for</strong>mations,<br />
save it <strong>to</strong> check the sequential result, generate a build chain using build and run<br />
compile <strong>to</strong> check the generated executable. The purpose of this dividing is <strong>to</strong> provide entry<br />
points—junction point in the aspect programming terminology—where compiler developers<br />
can insert their own code. Indeed the default workspace provide no direct facilities <strong>for</strong><br />
heterogeneous computing. Instead, a source-<strong>to</strong>-source compiler must inherits from it and<br />
implements the target-specific trans<strong>for</strong>mations using method overloading. More generically,<br />
given a code that contains n code fragments <strong>to</strong> be run on m distinct targets, transla<strong>to</strong>r<br />
developers must compose n workspaces <strong>to</strong>gether, using multiple inheritance and existing<br />
or newly created workspaces.<br />
Let us give a practical example: pyps is shipped with many workspaces types, including<br />
a workspace <strong>to</strong> instrument an application <strong>for</strong> benchmarking purpose, one <strong>to</strong> generate<br />
openmp code, one <strong>to</strong> generate avx code and one <strong>to</strong> generate cuda code. To build a<br />
compiler <strong>for</strong> a heterogeneous machine made of an nVidia gpu and several Intel cores<br />
with avx support—a classical hardware configuration nowadays—, one can rely on existing<br />
components and implement the compilation scheme as described in Listing 3.4. In<br />
this example, a new workspace that inherits from the existing one is created. In fact, a<br />
new compilation scheme is implemented at a high level, relying on existing ones. Code<br />
generation is au<strong>to</strong>matically <strong>for</strong>warded <strong>to</strong> the proper base class and does not need <strong>to</strong> be<br />
specified here.<br />
A source-<strong>to</strong>-source compiler typically inherits from the workspace class. The init<br />
method can be overridden <strong>to</strong> pass extra flags <strong>to</strong> the workspace and <strong>to</strong> provide additional<br />
source files <strong>to</strong> the compiler infrastructure. The latter can be third-party libraries stubs,<br />
declaration of functions used as parameters <strong>for</strong> a pattern-matching engine or runtime declarations<br />
needed by the code generation process. In a similar manner, the save method<br />
can be used <strong>for</strong> header substitution, i.e. <strong>to</strong> add extra header files and #include them 6 .<br />
6. When the ir cannot represent preprocessor symbols.
60 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
# load relevant packages<br />
import pyps , sac , openmp , cuda<br />
# assemble workspaces , order usually matters<br />
class my_workspace (sac , openmp , cuda ):<br />
pass<br />
# provide a per - function compilation scheme<br />
def my_compilation_scheme ( module ):<br />
w= module . workspace # recover current workspace<br />
chk =w. checkpoint () # save current state<br />
try : # CUDA code generation<br />
m. cuda () # raise an exception in case of failure<br />
except RuntimeError :<br />
w. res<strong>to</strong>re ( chk ) # res<strong>to</strong>re pre - CUDA state<br />
try : # OpenMP and AVX code generation<br />
m. openmp ()<br />
m. avx ()<br />
except : pass<br />
Listing 3.4: Example of workspace composition at the pass manager level using pyps.<br />
When the transla<strong>to</strong>r generates constructs that cannot be represented in C or use a specific<br />
compilation process, a specific Maker is fed <strong>to</strong> build <strong>to</strong> per<strong>for</strong>m the post-processing steps.<br />
For instance the multimedia instruction genera<strong>to</strong>r described in Section 7.4 overrides<br />
the init method <strong>to</strong> add its own runtime files <strong>to</strong> the workspace and then <strong>for</strong>ward the call<br />
<strong>to</strong> its parent. Likewise the save method is overridden <strong>to</strong> add a generic header file at the<br />
<strong>to</strong>p of each source file—a step that cannot be per<strong>for</strong>med earlier as it require an additional<br />
preprocessor run. The Maker class is extended <strong>to</strong> add special processing, and, depending<br />
on the maker the build method is given as arguments, it generates different code. With<br />
the default maker, the sequential version of the generated vec<strong>to</strong>r instructions is used. With<br />
an sse enabled maker, proper compiler flags and post-processing are activated.<br />
The composition of workspaces relies on two assumptions:<br />
1. all classes inheriting from workspace <strong>for</strong>ward calls <strong>to</strong> their parent;<br />
2. the compiler developer guarantees that the composition makes sense.<br />
Assumption 1 makes it possible <strong>to</strong> compose workspace and follows the idea that a<br />
workspace takes care of its target specific processing and delegates parts he does not know<br />
how <strong>to</strong> handle <strong>to</strong> its parent—in the end the default workspace manages the lef<strong>to</strong>ver. Assumption<br />
2 guarantees that the composition leads <strong>to</strong> error-free code: it does not make<br />
sense <strong>to</strong> generate avx calls inside a cuda kernel but it does make sense <strong>to</strong> do so in a loop<br />
annotated with openmp directives.<br />
A program is no more than a set of compilation units. All methods of programs are
3.3. PYPS, A HIGH LEVEL PASS MANAGER API 61<br />
available at the workspace level.<br />
A compilationUnit does not provide any additional method but is used as a structuring<br />
element by some passes. Indeed, an important characteristic of heterogeneous computing<br />
is that different compilation units may have different targets, and thus use different source<strong>to</strong>-binary<br />
compilers.<br />
A function provides the following methods:<br />
get_code():string : the fundamental feature of a source-<strong>to</strong>-source compiler is the capability<br />
of switching between the ir and the tr. This method builds the current tr as<br />
a string;<br />
set_code(code ): this method replaces current code by a new version given by a string.<br />
Combined with the previous method, it makes it possible <strong>to</strong> call an external <strong>to</strong>ol on<br />
the tr and use the output <strong>to</strong> build a new ir;<br />
callers, callees:functions : make it possible <strong>to</strong> navigate the (static 7 ) call graph;<br />
passXYZ (params ): all compiler trans<strong>for</strong>mations are exposed as function methods.<br />
Sub-classing of a function is used <strong>to</strong> provide new code trans<strong>for</strong>mations as a complex<br />
chaining of existing trans<strong>for</strong>mations.<br />
The Loop class provides the member below, mainly used <strong>for</strong> inter-pass communication.<br />
label: a read only value used <strong>to</strong> uniquely identify the loop and the statement holding it;<br />
pragma: a list of directives attached <strong>to</strong> the loop;<br />
loops: a list of all loops directly contained in this loop.<br />
The Maker class provides a unique method:<br />
generate(dir, sources ) eventually post-processes files given as sources and found in<br />
dir, and generates a Makefile <strong>to</strong> compile them.<br />
The generate method can be overridden <strong>to</strong> change the generated Makefile and <strong>to</strong><br />
add post-processing steps. For instance the Maker found in the openmp package adds the<br />
proper -fopenmp compilation flag <strong>to</strong> the Makefile.<br />
Let us illustrate the relevancy of this architecture on a practical situation: sse code<br />
generation from plain C code. The technical parts are detailed in Section 7.4, only the<br />
compiler architecture is described here.<br />
<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler: this compiler is a classical vec<strong>to</strong>rizer that generates C intrinsics<br />
<strong>to</strong> represent vec<strong>to</strong>r operations. Intrinsics are used <strong>for</strong> both data movement<br />
and vec<strong>to</strong>r operations and have a sequential version written in C. An example of such<br />
generated code is given in Listing 3.5, and its sequential implementation is given in<br />
Listing 3.6.<br />
Post-Processing: because generated code is still C, it can be executed, although not<br />
efficiently, on a sequential processor. The only post-processing step is <strong>to</strong> add the<br />
relevant #include <strong>to</strong> all source files.<br />
7. The call graph is static because we do not consider the case of function pointers.
62 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
void vadd_l99999 ( float a[4] , float b [4])<br />
{<br />
// PIPS : SAC generated v4sf vec<strong>to</strong>r (s)<br />
v4sf vec00 , vec10 ;<br />
SIMD_LOAD_V4SF ( vec00 , &a [1 -1]);<br />
SIMD_LOAD_V4SF ( vec10 , &b [1 -1]);<br />
l99999 : ;<br />
SIMD_ADDPS ( vec00 , vec00 , vec10 );<br />
SIMD_STORE_V4SF ( vec00 , &a [1 -1]);<br />
}<br />
Listing 3.5: sse C intrinsics generated <strong>for</strong> a scalar product.<br />
void SIMD_ADDPS ( float *dst , ...)<br />
{<br />
int i;<br />
va_list ap , ap_f ;<br />
va_start (ap , dst );<br />
<strong>for</strong> (i = 0; i
3.4. RELATED WORK 63<br />
3.3.2 Usage Example<br />
The five compilers presented in this document are based on the pyps api. The fact<br />
that we were able <strong>to</strong> write them is a first step <strong>to</strong>ward the validation of the api.<br />
As an example of the api validity, we have used pyps <strong>to</strong> per<strong>for</strong>m fuzz testing on the pips<br />
compiler infrastructure. Fuzz testing is a software testing technique that injects random<br />
input in<strong>to</strong> a piece of software <strong>to</strong> test its behavior. The technique used is described in<br />
Algorithm 1 and the equivalent pyps code is given in Listing 3.8.<br />
Data: p ← a program<br />
Data: g ← a function trans<strong>for</strong>mation genera<strong>to</strong>r<br />
binary ← compile(p);<br />
repeat<br />
f ← random_function(p);<br />
p ′ ← g(f)(p);<br />
binary ′ ← compile(p ′ );<br />
until exec(binary) = exec(binary ′ );<br />
Algorithm 1: Fuzz testing at the pass manager level.<br />
This simple program assumes that the input code has a reproducible output printed on<br />
standard output. We have used it conjointly with a random C code genera<strong>to</strong>r from Eric<br />
Eide and John Regehr [ER08]. This program generates random C programs with deep<br />
call graphs that prints a hash value representing their execution on standard output. We<br />
tested 10 trans<strong>for</strong>mations with this fuzzer. For each of them, an erroneous instance was<br />
found. 8<br />
3.4 Related Work<br />
<strong>Compilers</strong> have been built <strong>for</strong> many years. The first complete Fortran compiler was<br />
released by an IBM team in 1956 [Bac57]. The first beta release of gcc by Richard M.<br />
Stallman dates back <strong>to</strong> the 22 th of March, 1987. <strong>Compilers</strong> are known <strong>to</strong> be complex<br />
pieces of software that evolve slowly, while hardware keeps evolving at a steady rate. We<br />
have run David A. Wheeler’s sloccount [Whe01] on 2 leading open source compilation<br />
projects, gcc and llvm and reproduce its output in Table 3.2. It shows that a compiler<br />
project involves a lot of development skills: many languages are used and the <strong>to</strong>tal number<br />
of sloc gets over 2 · 10 6 <strong>for</strong> gcc and 5 · 10 5 <strong>for</strong> llvm. These projects accumulate several<br />
difficulties <strong>for</strong> new comers: low level languages, important code database, diversity of the<br />
languages used.<br />
To tackle this problem, several approaches have bee proposed by the research community.<br />
M. Zenger and M. Odersky proposed in [ZO01] a compiler framework <strong>to</strong> quickly<br />
8. By courtesy <strong>to</strong> pips development team, I only tested trans<strong>for</strong>mations I contributed <strong>to</strong>. And I fixed<br />
most of the found bugs.
64 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
import pyps<br />
import random , sys<br />
while True : # loop as long as no error is found<br />
# instanciate a compiler from the source in first argument<br />
w= pyps . workspace ( sys . argv [1])<br />
# compile it using default source -<strong>to</strong> - binary compiler<br />
b=w. compile ()<br />
# keep output as a reference<br />
( r_ref , o_ref , e_ref )=w. run (b)<br />
# pick a random function in the input code<br />
f= random . choice (w. all_functions )<br />
# select the trans<strong>for</strong>mation given in second argument<br />
# and apply it<br />
getattr (f,sys . argv [2])()<br />
# compile the trans<strong>for</strong>med code<br />
b=w. compile ()<br />
# get its output<br />
(r,o,e)=w. run (b)<br />
# close the compiler instance<br />
w. close ()<br />
# check output versus reference and eventually raise an error<br />
if r!= r_ref or o!= o_ref or e != e_ref :<br />
sys . exit (1)<br />
Listing 3.8: Fuzz testing with pyps.
3.4. RELATED WORK 65<br />
ansic 2100307 (48.66%)<br />
java 681858 (15.80%)<br />
ada 680664 (15.77%)<br />
cpp 594473 (13.77%)<br />
f90 79927 (1.85%)<br />
sh 47006 (1.09%)<br />
asm 44318 (1.03%)<br />
xml 29271 (0.68%)<br />
exp 18422 (0.43%)<br />
objc 15086 (0.35%)<br />
<strong>for</strong>tran 9849 (0.23%)<br />
perl 4462 (0.10%)<br />
ml 2814 (0.07%)<br />
pascal 2176 (0.05%)<br />
awk 1706 (0.04%)<br />
python 1486 (0.03%)<br />
yacc 977 (0.02%)<br />
cs 879 (0.02%)<br />
tcl 392 (0.01%)<br />
lex 192 (0.00%)<br />
haskell 109 (0.00%)<br />
(a) gcc sloccount report.<br />
cpp 453835 (88.82%)<br />
ansic 16764 (3.28%)<br />
asm 13711 (2.68%)<br />
sh 12828 (2.51%)<br />
python 4322 (0.85%)<br />
ml 4274 (0.84%)<br />
perl 2093 (0.41%)<br />
pascal 1489 (0.29%)<br />
exp 431 (0.08%)<br />
objc 334 (0.07%)<br />
xml 283 (0.06%)<br />
ada 235 (0.05%)<br />
lisp 187 (0.04%)<br />
csh 117 (0.02%)<br />
f90 36 (0.01%)<br />
(b) llvm sloccount report.<br />
Table 3.2: sloccount reports <strong>for</strong> the gcc and llvm compilers.<br />
experiment with new language features. They focus on two aspects: the extension of the internal<br />
representation and the composition of compiler components. The <strong>for</strong>mer is achieved<br />
through the use of extensible algebraic types, <strong>to</strong> extend simultaneously the Abstract Syntax<br />
Tree (ast) and the existing phases, and the latter rely on an original design pattern<br />
called Context Component, <strong>to</strong> provide an extensible and hierarchical component system.<br />
In a keynote talk [Coo04], K. D. Cooper emphasises the limitation of existing compiler<br />
architectures <strong>for</strong> iterative compilation at the pass level and favors the use of complex patterns,<br />
beyond list scheduling, supported by a flexible architecture <strong>to</strong> explore several phase<br />
orderings.<br />
Given the difficulty <strong>to</strong> master existing compiler implementations, several recent papers<br />
have focused on the interaction between enlightened compiler developers and compiler<br />
infrastructures. Lattner and Adve introduced in [LA03] a modular pass manager <strong>for</strong><br />
llvm. Later on, gcc overcame its own limitations thanks <strong>to</strong> a plug-in mechanism described<br />
in “Plugins” chapter of [Wik09]. Likewise, the extensible micro-architectural optimizer<br />
described in [HRTV11] relies on the flexibility of their pass manager <strong>to</strong> load additional<br />
passes at run time using a plug-in mechanism.
66 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />
Rudy et al. presented in [RCH + 10] an interactive pass manager based on the lua<br />
language [Ier06]. The pass manager is used <strong>to</strong> scan various parameters of polyhedral<br />
trans<strong>for</strong>mations such as loop unrolling rate or blocking size, and <strong>to</strong> selectively apply trans<strong>for</strong>mations<br />
such as loop interchange. The resulting code is turned in<strong>to</strong> a problem-specific<br />
cuda kernel and the most efficient one is selected, as well as the associated set of trans<strong>for</strong>mations.<br />
The advantage of the approach is that the compiler configuration <strong>for</strong> the specific<br />
target is s<strong>to</strong>red as the pass manager script itself, which can be reused without re-evaluation<br />
<strong>for</strong> further re-compilation.<br />
In [Yi11], separation between the analyses and the trans<strong>for</strong>mations is en<strong>for</strong>ced at the<br />
pass-manager level: A compiler is used <strong>to</strong> generate a valid sequence of pass, as a script<br />
written in a dedicated language, then this sequence is executed by the pass manager. This<br />
approach tries <strong>to</strong> decouple analyses and trans<strong>for</strong>mations, but cannot verify the validity<br />
of the generated script (otherwise it would require the genera<strong>to</strong>r <strong>to</strong> effectively execute<br />
the trans<strong>for</strong>mations) and does not favor reuse of the compiler infrastructure, as the pass<br />
manager executes passes written in another compiler.<br />
Our approach basically turns a compiler in<strong>to</strong> a program trans<strong>for</strong>mation system, which<br />
is an active research area. FermaT [War99] focuses on program refinement and composition<br />
is limited <strong>to</strong> sequences. cil [NMRW02] only provides a flag-based composition<br />
system, that is activation or deactivation of passes in a predefined sequence without taking<br />
in<strong>to</strong> account any feedback from the processing. The stratego [OOVV05] software<br />
trans<strong>for</strong>mation framework does not separate concepts as clearly as we do, but it uses the<br />
concept of strategies <strong>to</strong> describe the chaining of trans<strong>for</strong>mations using dedicated opera<strong>to</strong>rs,<br />
an approach similar <strong>to</strong> ours (see Section 3.1.2). However, the effective implementation relies<br />
on a new language, while we choose <strong>to</strong> map the concepts <strong>to</strong> existing constructs. Work<br />
on optimix by Aßmann [Aß96] proposes considering asts as graphs and passes as graph<br />
trans<strong>for</strong>mations, and <strong>to</strong> use a graph rewriting system <strong>to</strong> specify trans<strong>for</strong>mations. All the<br />
trans<strong>for</strong>mations and their composition are done at the ast level.<br />
In parallel <strong>to</strong> the pass management <strong>to</strong>pic, a growing number of studies has been conducted<br />
over the past few years <strong>to</strong> tackle heterogeneous plat<strong>for</strong>ms. In [ABCR10], a scheme<br />
<strong>to</strong> couple Just In Time (jit) compilation <strong>for</strong> multiple targets is described. They designed<br />
a new Object Oriented (oo) language named lime which has the property of being convertible<br />
in<strong>to</strong> two representations: one targets regular Central Processing Units (cpus) and<br />
one targets Field Programmable Gate Arrays (fpgas) through a complex compilation flow<br />
that involves the verilog language. An originality of the approach is that they provide<br />
a runtime that plays a bridging role between the two representations, allowing a so called<br />
“mixed mode execution” where the best representation is selected at runtime. In fact, the<br />
jit approach is orthogonal <strong>to</strong> our concern. It tackles a slightly different problem: code<br />
portability. Once a code has been compiled <strong>for</strong> a target, is it possible <strong>to</strong> retarget it <strong>for</strong><br />
another hardware, without the need of generating another binary? In the static compilation<br />
model, it is not possible 9 <strong>to</strong> address new hardware, hardware that is not known at<br />
9. pgi’s compiler works around this limitation by bundling several binaries, one per target, in the same<br />
executable. It does provide a kind of retargetability, but it does not address the issue of supporting new
3.5. CONCLUSION 67<br />
compilation time, while theoretically, an update of the jit compiler does.<br />
Ocelot [DKYC10] is a similar project that translates Parallel Thread eXecution (ptx)<br />
code in<strong>to</strong> x86 emulation code, amd gpu code or llvm code <strong>for</strong> parallel execution on multicore,<br />
thanks <strong>to</strong> a jit compilation infrastructure. Choosing ptx as a front end language<br />
provides the advantage of having a direct description of the parallelism, but it does not<br />
address legacy code.<br />
An obvious approach <strong>to</strong> tackle heterogeneity while taking advantage of existing trans<strong>for</strong>mations<br />
is <strong>to</strong> extend traditional compilers <strong>to</strong> support heterogeneous targets. For instance,<br />
gomet [BL10] is a gcc-based compiler that propose the following compilation<br />
flow: firstly, take advantage of gcc <strong>to</strong> parse input code, generate gimple representation<br />
and apply high-level passes such as Simple Static Assignment (ssa) or polyhedral<br />
loop trans<strong>for</strong>mations. Then build an interprocedural hierarchical dependence graph that<br />
is combined with a high-level target description <strong>to</strong> generate source files <strong>to</strong> be compiled <strong>for</strong><br />
the target. The target architecture description takes in<strong>to</strong> account three characteristics: a<br />
cost model <strong>to</strong> decide offloading profitability, the parallelism description <strong>for</strong> simd and/or<br />
mimd code generation, and the current load <strong>for</strong> runtime-scheduling.<br />
This approach shares several aspects with our methodology, but code generation is<br />
per<strong>for</strong>med under the assumption that target hardware compiler accepts plain C code as<br />
input, which is indeed the case <strong>for</strong> openmp and the cell engine, but not <strong>for</strong> gpus or fpgabased<br />
processors. It explains the lack of Instruction Set Architecture (isa) description<br />
in the architecture model. Moreover, the architecture description consists in a set of C<br />
functions that implement the correct behavior, which lacks in abstraction and flexibility.<br />
The compilation process is hard-coded <strong>for</strong> each target and not retargetable.<br />
3.5 Conclusion<br />
In this chapter, we have presented the impact of heterogeneous computing on compiler<br />
infrastructures. We have pointed out that the use of the C language with the proper<br />
conventions is a good choice of ir. We have described a compilation infrastructure that<br />
takes advantage of this choice combined with source-<strong>to</strong>-source capabilities <strong>to</strong> en<strong>for</strong>ce code<br />
reuse and make it easier <strong>to</strong> interact with third-party <strong>to</strong>ols. Based on a new model <strong>for</strong><br />
the combination of code trans<strong>for</strong>mations, we have specified an api <strong>for</strong> pass managers,<br />
called pyps, and combined it with a generic programming language <strong>to</strong> end up with a<br />
flexible way <strong>to</strong> design compilers <strong>for</strong> heterogeneous devices. The pyps api and its Python<br />
implementation are publicly available as part of the pips project.<br />
This api is used <strong>to</strong> combine the code trans<strong>for</strong>mations needed <strong>to</strong> match the hardware<br />
constraints identified in Chapter 2 and presented in the next three chapters.<br />
architectures.
68 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES
Chapter 4<br />
Representing the Instruction Set<br />
Architecture in C<br />
Pont de Sainte Catherine, Plounevezel, Finistère c○ Jean-Claude Even<br />
In a keynote given at the Fusion Developers Summit 2011, in Bellevue, Washing<strong>to</strong>n,<br />
Phil Rogers announced that<br />
The Fusion System Architecture (fsa) is Instruction Set Architecture (isa)<br />
agnostic <strong>for</strong> both Central Processing Units (cpus) and Graphical Processing<br />
Units (gpus). This is very important because we’re inviting partners <strong>to</strong> join<br />
us in all areas; other hardware companies <strong>to</strong> implement fsa and join in the<br />
plat<strong>for</strong>m. . .<br />
Under the hood, the Fusion System Architecture (fsa) relies on a virtual isa. In Chapter<br />
3, we state that keeping the Internal Representation (ir) independent from the targeted<br />
hardware is a requirement <strong>to</strong> reach a good level of abstraction. But is it possible <strong>to</strong> represent<br />
all the refinements of targeted isa in an ir that stays close <strong>to</strong> the C language [ISO99]?<br />
Quoting Brian Kernighan [Ker03],<br />
C is perhaps the best balance of expressiveness and efficiency that has ever<br />
been seen in programming languages. (. . . ) It was so close <strong>to</strong> the machine<br />
69
70CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
that you could see what the code would be (and it wasn’t hard <strong>to</strong> write a good<br />
compiler), but it still was safely above the instruction level and a good enough<br />
match <strong>to</strong> all machines that one didn’t think about specific tricks <strong>for</strong> specific<br />
machines.<br />
This reminds us that the C language was designed <strong>to</strong> be “close-<strong>to</strong>-the-metal”. So even if<br />
the chosen ir remains as close as possible <strong>to</strong> the C language, it can be sufficiently low level<br />
<strong>to</strong> express some of the specificities of the targeted isa. This is examined in Section 4.1. We<br />
then go through all the aspects of an isa and show that, provided some conventions and<br />
minor trans<strong>for</strong>mations, it is possible <strong>to</strong> adapt a C code <strong>to</strong> meet isa constraints. Section 4.2<br />
examines native data types; Section 4.3 reviews the use of specific registers, Section 4.4 details<br />
the link between intrinsics and instructions and Section 4.5 goes through the difference<br />
related <strong>to</strong> the memory architecture. Issues related <strong>to</strong> function boundaries are examined in<br />
Section 4.6 and external libraries are examined in Section 4.7.<br />
4.1 C as a Common Denomina<strong>to</strong>r<br />
The C language is the de fac<strong>to</strong> standard <strong>to</strong> program low-level devices. In this section,<br />
we study the relationships between the standard language and the dialects used <strong>to</strong> program<br />
hardware accelera<strong>to</strong>rs and argue the C language can be used <strong>to</strong> embody some aspects of<br />
the isa.<br />
4.1.1 C Dialects and <strong>Heterogeneous</strong> Computing<br />
A problem raised by heterogeneous architectures from a compiler point of view is the<br />
choice of a suitable ir. Representing all hardware specificities in a unique language is<br />
not a feasible task. Although there has been recent work [PBdD11] <strong>to</strong> express hardware<br />
constraints at the ir level, representing all of them in a common ir is not easy. Some<br />
compilers have however chosen <strong>to</strong> extend their ir <strong>to</strong> represent target-specific features: Low<br />
Level Virtual Machine (llvm) integrates vec<strong>to</strong>r types <strong>to</strong> their basic types. Alternatively,<br />
a target-specific language can be used as a basis <strong>for</strong> other architectures: fcuda [PGS + 09]<br />
translates Compute Unified Device Architecture (cuda) kernels in<strong>to</strong> Field Programmable<br />
Gate Array (fpga) circuits and swan [HF11] translates cuda codes in<strong>to</strong> Open Computing<br />
Language (opencl) ones.<br />
An opposite approach is <strong>to</strong> use the versatility of the C language <strong>to</strong> represent both high<br />
level concepts and low level concepts using an ir that matches the initial language without<br />
target-specific extensions. In essence, this is similar <strong>to</strong> the concept of language virtualization.<br />
The definition <strong>for</strong> language virtualization given by Hassan Chafi et al. [CDM + 10] is<br />
the following:<br />
A programming language is virtualizable with respect <strong>to</strong> a class of embedded<br />
languages if and only if it can provide an environment <strong>to</strong> these embedded<br />
languages that makes the embedded implementations essentially identical <strong>to</strong>
4.1. C AS A COMMON DENOMINATOR 71<br />
C dialects<br />
Handel-C [LWFK02] Mitrion-C c2h cuda opencl<br />
parent language C C++ C C++ C99<br />
target fpga fpga fpga gpu gpu & manycore<br />
Table 4.1: C dialects and targeted hardware.<br />
corresponding stand-alone language implementations in terms of expressiveness,<br />
per<strong>for</strong>mance and safety—with only modestly more ef<strong>for</strong>t than implementing the<br />
simplest possible complete embeddings.<br />
We propose a relaxed definition <strong>for</strong> a programming language that can be derived from<br />
another programming language providing minor rewriting while maintaining an equivalent<br />
semantic.<br />
Definition 4.1. A programming language L0 targeting a hardware H0 is embodied by<br />
another programming language L1 targeting an hardware H1 if there exists a code trans<strong>for</strong>mation<br />
τ : L0 → L1 so that the execution on H0 of a program P written in L0 yields the<br />
same operational result as the execution of τ(P ) on H1.<br />
This definition is deliberately imprecise on the term of “yields the same operational<br />
result”, because execution order of varying type size introduces output changes that may<br />
not be significant <strong>to</strong> the end user.<br />
Table 4.1 lists a number of languages used <strong>to</strong> program hardware accelera<strong>to</strong>rs and their<br />
target plat<strong>for</strong>ms. It perfectly shows that C dialects are often chosen as an interface between<br />
the programmer and the hardware.<br />
4.1.2 From the ISA <strong>to</strong> C<br />
In [JRR99], Jones et al. presented a language called C-- that aims <strong>to</strong> be<br />
The interface between high-level compilers and retargetable, optimizing<br />
code genera<strong>to</strong>rs.<br />
This language reduces the possibilities of the C language <strong>to</strong> make it easier <strong>to</strong> manipulate.<br />
It is quite low level and does not match the requirements of a high-level language <strong>to</strong> abstract<br />
concepts, but it paves the way <strong>for</strong> the idea of using a subset of C as an ir.<br />
C extensions such as Embedded-C [ISO08] have also been proposed <strong>to</strong> handle some<br />
specificities of heterogeneous computing, like dedicated registers or fixed point arithmetic,<br />
etc., but more important <strong>to</strong> us the availability of multiple address spaces. In the ISO/IEC<br />
TR 18037:2008 specification, a global address space is assumed and additional ones can<br />
be declared, in a nested fashion, and used as a qualifier. However, operations on disjoint<br />
address spaces are not allowed, thus data communication must be handled through<br />
overlapping address spaces or with intrinsic functions.<br />
In the context of heterogeneous computing, an ir must achieve two goals:
72CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
__m128 _mm_set1_ps ( float );<br />
typedef struct {<br />
float data [4];<br />
} __m128 ;<br />
Listing 4.1: Broadcast a single value in sse.<br />
Listing 4.2: Vec<strong>to</strong>r type emulation in C.<br />
1. make it easy <strong>for</strong> compiler developers <strong>to</strong> write passes and analyses;<br />
2. make it easy <strong>for</strong> compiler developers <strong>to</strong> write back-ends.<br />
The Hierarchical Control Flow Graph (hcfg) used in Paralléliseur Interprocedural<br />
de Programmes Scientifiques (pips) [CJIA11] somehow matches the first point under the<br />
constraint of source-<strong>to</strong>-source compilation, the idea being that the code hierarchy maps<br />
the hardware hierarchy. For instance if you target Instruction Level Parallelism (ilp),<br />
you can focus on loops and ignore the rest of the program; <strong>for</strong> task parallelism, you can<br />
focus on functions, etc. The second point raises the question: how <strong>to</strong> model hardware<br />
specific isa at the ir level? As described in [PH09], each piece of hardware has its own<br />
isa, and it is not sustainable <strong>to</strong> extend the ir <strong>to</strong> match each new hardware specificity, or<br />
each time a new feature appears. This is however the path taken by some compilers such<br />
as rose [Qui00, SQ03] where Parallel Thread eXecution (ptx) concepts are exhibited at<br />
the ir level. A direct consequence of this approach is that all existing algorithms must be<br />
extended <strong>to</strong> take this new constructs in<strong>to</strong> account. In this section, we propose an alternate<br />
approach that consists in modeling enough of the isa at the C level <strong>to</strong> leverage existing<br />
analyses.<br />
Let us take a short example, taken from the C <strong>to</strong> Streaming simd Extension (sse)<br />
compiler described in Chapter 7. To duplicate a single single precision floating point value<br />
in all 4 slots of an sse register, an intrinsic is available with the signature given in Listing 4.1<br />
Instead of adding vec<strong>to</strong>r type support <strong>to</strong> the ir, one could emulate its behavior using<br />
an array type, encapsulated in a structure <strong>to</strong> be able <strong>to</strong> use it as a returned value from a<br />
function, as shown in Listing 4.2, and provide a sequential implementation that per<strong>for</strong>ms<br />
the same computations and have the same memory effects. That way the same result is<br />
produced and the data dependencies are still correct. In this case, it leads <strong>to</strong> the code in<br />
Listing 4.3.<br />
To validate our approach, we have written a replacement of the header file xmmintrin.h<br />
that does not make use of vec<strong>to</strong>r extensions. Said otherwise, we embody the sse extension<br />
using the C language. An excerpt of this file is shown in Appendix D. It has been<br />
successfully tested on an sse-based implementation of the SHA-1 algorithm: the project<br />
is compiled twice, once using the default configuration and once using the same configuration<br />
plus and additional flag <strong>to</strong> tell the compiler <strong>to</strong> use our sequential implementation<br />
of xmmintrin.h. The same input are processed and we verify we get the same checksum.
4.2. NATIVE DATA TYPES 73<br />
__m128 _mm_set1_ps ( float v) {<br />
__m128 res = { { v ,v ,v ,v } };<br />
return res ;<br />
}<br />
Listing 4.3: Sequential implementation of _mm_set1_ps.<br />
Additionally, we measure that the sequential implementation runs 25 times slower than<br />
the optimized version, mainly due <strong>to</strong> repeated copy between union data types 1 , plus the<br />
absence of parallelism. 2<br />
4.2 Native Data Types<br />
As a result of the specialization of hardware devices, it is more common <strong>to</strong> find a device<br />
that does not support a data type than a device that supports a data type not supported<br />
by the C language. This section examines how <strong>to</strong> handle languages that do not support<br />
all the type constructions allowed by the C language.<br />
4.2.1 Scalar Types<br />
A common case however is the absence of floating point type. The usual fall-back is<br />
fixed-point arithmetic. This approach is favored when a Floating-Point Unit (fpu) would<br />
be <strong>to</strong>o expensive, in area or energy consumption, or when the accuracy loss is not a problem.<br />
Kum et al. [KKS00] propose an approach <strong>to</strong> convert C programs with floating-point types<br />
<strong>to</strong> equivalent program with fixed-point arithmetic.<br />
Some hardware support standard type with unusual sizes. For instance, the terapix<br />
architecture [BLE + 08] uses integer registers of 36 bits. This kind of situation is easily<br />
emulated using a type definition. See how the C99 standard headers defines int32_t and<br />
int64_t types.<br />
4.2.2 Records<br />
Handling the absence of structures requires more care. The obvious solution is <strong>to</strong> split<br />
each variable that has a structure type in<strong>to</strong> as many variables as the number of fields.<br />
The first step builds the set of all final types involved in a type definition. To do so we<br />
introduce a function split_struct : T × I → P(T × I) characterized by the following<br />
induction rules defined on the LuC language (see Appendix B):<br />
1. These copies could be avoided by using the C++ language, constant references and return value<br />
optimization, at the expense of losing C compatibility.<br />
2. Neither gnu C Compiler (gcc), Intel C++ Compiler (icc) or pips are capable of au<strong>to</strong>matically<br />
revec<strong>to</strong>rizing the sequential code because of the additional structure copies.
74CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
final_types(int, id) = {〈int, id〉}<br />
final_types(float, id) = {〈float, id〉}<br />
final_types(complex, id) = {〈complex, id〉}<br />
<br />
final_types(type[expr], id) =<br />
final_types(struct id s { fields }, id) = <br />
〈t,i〉∈final_types(type,id)<br />
〈t,i〉∈fields<br />
〈t[expr], i〉<br />
final_types(t, new_id(id, i))<br />
where “new_id” is a function that constructs a new identifier unique <strong>to</strong> the program p<br />
using a prefix pre such that no variable declared in p is prefixed by “pre”.<br />
new_id : I × I → I<br />
(i0, i1) ↦→ prei0_i1<br />
Then all references are rewritten using the rename : R → R function defined by the<br />
induction rules:<br />
rename(id) = id if id is a scalar<br />
rename(ref [expr]) = rename(ref )[renamee(expr)]<br />
where renamee : E → E is defined by:<br />
rename(ref.id) = rename(kref )_id<br />
renamee(cst) = cst<br />
renamee(ref = rename(ref )<br />
renamee(expr op expr) = renamee(expr) op renamee(expr)<br />
This trans<strong>for</strong>mation is illustrated by Listing 4.4, under the assumption of pre =__.
4.2. NATIVE DATA TYPES 75<br />
typedef struct { int f0; float f1; float f2 [3] } my;<br />
my var ;<br />
var .f0 =2;<br />
var .f2[var .f0 ]=4.2;<br />
int __var_f0 ; float __var_f1 ;<br />
float __var_f2 [3];<br />
__var_f0 =2;<br />
__var_f2 [ __var_f0 ]=4.2;<br />
⇓<br />
Listing 4.4: Example of structure removal.<br />
This trans<strong>for</strong>mation is problematic in three cases:<br />
– when a variable of structure type is passed as a function parameter;<br />
– when the address of a variable involved in a structure is taken;<br />
– when the sizeof opera<strong>to</strong>r is used.<br />
These situations cannot be handled by our <strong>to</strong>y language but arise in practice. For the<br />
first case, it is still possible <strong>to</strong> extend the function definition <strong>to</strong> take one parameter per<br />
final_types(t) element, as illustrated by Listing 4.5. In the second case we cannot do much<br />
because the memory layout is completely changed by the trans<strong>for</strong>mation and we cannot<br />
assume any previous pointer arithmetics is still correct. The third case is easy <strong>to</strong> handle,<br />
by replacing the sizeof expression by the actual type size. 3<br />
4.2.3 Arrays<br />
It is also possible <strong>to</strong> handle the absence of array types in two circumstances: if all arrays<br />
have fixed size and constant indices in access functions. 4 This is simply done by creating<br />
as many variables as the array size and then use a renaming convention <strong>for</strong> all constant<br />
array indices. The declaration trans<strong>for</strong>mation d : T × I → (T × I) + is given by:<br />
d(int, id) = {〈int, id〉}<br />
d(float, id) = {〈float, id〉}<br />
d(complex, id) = {〈complex, id〉}<br />
d(type[cst] id) =<br />
i=cst <br />
i=1<br />
d(type, new_id(id, cst))<br />
3. At that point we are already at the specialization step, so portability across hardware plat<strong>for</strong>m is no<br />
longer an issue and considerations like “the size of a type is plat<strong>for</strong>m dependant” are no longer relevant.<br />
4. This situation was found in code au<strong>to</strong>matically generated from the Faust programming language<br />
[GO03].
76CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
typedef struct { double re ,im; } complex ;<br />
void cmul ( complex *res , const complex *c0 , const complex *c1) {<br />
res ->re=c0 ->re*c1 ->re - c0 ->im*c1 ->im;<br />
res ->im=c0 ->re*c1 ->im + c0 ->im*c1 ->re;<br />
}<br />
/* ... */<br />
complex a={1 ,0} ,b={0 ,1} , c;<br />
cmul (&c ,&a ,&b);<br />
void cmul ( double * res_re , double * res_im ,<br />
const double * c0_re , const double *c0_im ,<br />
const double * c1_re , const double * c1_im ) {<br />
* res_re =* c0_re * * c1_re - * c0_im * * c1_im ;<br />
* res_im =* c0_re * * c1_im + * c0_im * * c1_re ;<br />
}<br />
/* ... */<br />
double a_re =1, a_im =0; b_re =0, b_im =1,c_re , c_im ;<br />
cmul (& c_re ,& c_im ,& a_re ,& a_im ,& b_re ,& b_im );<br />
⇓<br />
Listing 4.5: Structure removal in the presence of function call.<br />
and the reference r : R → R renaming is similarly given by<br />
r(id) = id if id is a scalar<br />
r(ref [cst]) = r(preref )_cst<br />
r( ref.id) = r( ref ).id<br />
which is similar <strong>to</strong> the structure case and suffers the same limitations.<br />
A common case is the absence of array type but support <strong>for</strong> pointers. The trans<strong>for</strong>mation<br />
from arrays <strong>to</strong> pointers requires two steps:<br />
1. a linearization step, where multi-dimensional arrays are converted <strong>to</strong> uni-dimensional<br />
ones;<br />
2. an array-<strong>to</strong>-pointer conversion step, using the equivalence between a[i] and *(a+i).<br />
Listing 4.6 illustrates the two steps of this trans<strong>for</strong>mation.<br />
4.3 Registers<br />
Depending on the isa, some registers can only be used <strong>for</strong> particular operations. For<br />
instance, the x86 architecture distinguishes between general purpose registers and floating
4.4. INSTRUCTIONS 77<br />
// initial code<br />
float a[n ][3];<br />
a [2* i ][1]++;<br />
// after linearization<br />
float a [3* n];<br />
a [2* i *3+1]++;<br />
// after pointer conversion<br />
float *a= alloca ( sizeof (*a)*n *3);<br />
*(a+2* i *3+1)++;<br />
⇓<br />
⇓<br />
Listing 4.6: Two-step trans<strong>for</strong>mation from multi-dimensional arrays <strong>to</strong> pointers.<br />
point registers. In the ptx [NVI10], special registers are dedicated <strong>to</strong> thread identifiers<br />
(%tid) and warp identifiers (%warpid).<br />
According <strong>to</strong> the compilation scheme proposed in Figure 3.9, low level trans<strong>for</strong>mations<br />
are taken care of by the vendor compiler. However, there are situations where such compilers<br />
are not available and only an assembler is given—we found ourselves in this situation<br />
<strong>for</strong> the terapix machine.<br />
In that case, a specific naming scheme is used <strong>to</strong> distinguish specific registers from<br />
others, and hardware-specific heuristics are used <strong>to</strong> distinguish general-purpose registers<br />
from others.<br />
Let us take the example of the terapix architecture [BLE + 08] that uses three kinds of<br />
registers with three prefixes: im <strong>for</strong> image pointers, ma <strong>for</strong> mask pointers and re <strong>for</strong> scalar<br />
variables. For each variable, depending on its type (array, pointer or scalar), it is possible <strong>to</strong><br />
determine whether it needs the re prefix. Then, <strong>to</strong> distinguish between mask data, s<strong>to</strong>red<br />
in a small read-only memory, and image data, s<strong>to</strong>red in a bigger read-write memory, we<br />
use a heuristic: any array that is written at least once goes <strong>to</strong> the image memory, and<br />
an array that is only read goes in the mask memory if its memory footprint is statically<br />
known <strong>to</strong> be less than a given constant. Listing 4.7 illustrates this trans<strong>for</strong>mation <strong>for</strong> a<br />
kernel that lightens up an image by a constant, given in a mask.<br />
4.4 Instructions<br />
Simple instructions usually have a C equivalent: set and move can be represented as<br />
assignments, though more complex functions may be needed <strong>to</strong> represent load from remote<br />
memory, as is the case <strong>for</strong> vec<strong>to</strong>r register. Likewise read and write from external devices
78CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
void launcher_0_microcode ( int I, int *in , int *out , int * cst )<br />
{<br />
int j;<br />
int *out0 , * in0 ;<br />
in000 = in00 ;<br />
out000 = out00 ;<br />
<strong>for</strong> (j = 0; j < I_18 ; j += 1) {<br />
* out000 = * in000 +* cst ;<br />
in000 = in000 +1;<br />
out000 = out000 +1;<br />
}<br />
}<br />
⇓<br />
void launcher_0_microcode ( int * FIFO2 , int *FIFO1 , int *FIFO0 , int N0)<br />
{<br />
int *im0 , *im1 , *im2 , *im3 , * ma4 ;<br />
int re0 ;<br />
ma4 = FIFO2 ;<br />
im3 = FIFO1 ;<br />
im2 = FIFO0 ;<br />
im0 = im2 ;<br />
im1 = im3 ;<br />
<strong>for</strong> ( re0 = 0; re0 < N0; re0 += 1) {<br />
* im1 = * im0 +* ma4 ;<br />
im0 = im0 +1;<br />
im1 = im1 +1;<br />
}<br />
}<br />
Listing 4.7: Using naming convention <strong>to</strong> distinguish registers <strong>for</strong> the terapix architecture.
4.4. INSTRUCTIONS 79<br />
can be abstracted as memcpy using the proper memory location abstractions, as discussed<br />
in Section 4.5.<br />
Basic arithmetic and logic operations are available in C, and if not (e.g. Fused Multiply-<br />
Add (fma) or min, max opera<strong>to</strong>rs) can be represented by an equivalent function.<br />
Control flow instructions such as branching, looping, conditional branch, indirect<br />
branch, jumps, etc. all have their C equivalent. If a complex control-flow is missing in the<br />
targeted language, it is generally possible <strong>to</strong> lower it <strong>to</strong> the desired level. Such trans<strong>for</strong>mations<br />
include the trans<strong>for</strong>mation of <strong>for</strong> loops <strong>to</strong> while loops, repeat until <strong>to</strong> while,<br />
or even while <strong>to</strong> go<strong>to</strong>s. if then else tests can be split in<strong>to</strong> two if then blocks, etc.<br />
The maximum number of operands can be restricted by the isa. In that case, complex<br />
C expressions can be split <strong>to</strong> take this constraint in<strong>to</strong> account.<br />
Parallel instructions as found in Advanced Vec<strong>to</strong>r eXtensions (avx) instruction sets<br />
can be emulated by their sequential counterparts, which correspond <strong>to</strong> one of the possible<br />
ways of scheduling the parallel operations.<br />
4.4.1 Instruction Selection<br />
Specialized instruction sets are used <strong>to</strong> speedup executions. At the extreme, it has led<br />
<strong>to</strong> Complex Instruction Set Computer (cisc). This specialization is often seen in dsp:<br />
fma, saturated arithmetic, min/max, etc. While not manda<strong>to</strong>ry <strong>to</strong> obtain a correct code,<br />
taking advantage of these instructions is manda<strong>to</strong>ry <strong>for</strong> per<strong>for</strong>mance.<br />
The process of mapping the target-independent ir <strong>to</strong> a target-specific instruction set<br />
is called instruction selection and is one of the last steps <strong>to</strong> be per<strong>for</strong>med in a traditional<br />
compiler. The problem is generally solved by using dynamic programing on expression<br />
trees [AJ75] or by sub-graph partitioning [API03].<br />
In our case, this process should be delegated <strong>to</strong> the source-<strong>to</strong>-binary compiler and not<br />
<strong>to</strong> the source-<strong>to</strong>-source compiler. However, it may be necessary <strong>to</strong> per<strong>for</strong>m this step:<br />
– be<strong>for</strong>e Single Instruction stream, Multiple Data stream (simd) instruction generation.<br />
For instance the neon instruction set supports a vec<strong>to</strong>rized fma, so the fma pattern<br />
must be found prior <strong>to</strong> simdization.<br />
– when the source-<strong>to</strong>-binary compiler does not per<strong>for</strong>m complex instruction selections.<br />
Basically, a few instruction have <strong>to</strong> be converted <strong>to</strong> function calls at the source level. As<br />
the expression tree of these few expressions either do not overlap (e.g. fma and maximum)<br />
or are subsets one of another (e.g. maximum and saturated add), a greedy algorithm<br />
per<strong>for</strong>ms well enough.<br />
An instruction is described by its name, the number and type of operands and the<br />
expression tree. In a very source-<strong>to</strong>-source manner, we represent an instruction by a regular<br />
C function, using the function name as instruction identifier, parameters as operands and<br />
body as expression tree. 5 From a list of patterns, the algorithm iteratively per<strong>for</strong>ms a<br />
5. This approach is quite similar <strong>to</strong> the way C intrinsics move assembly operations at the C function<br />
level. For instance gcc defines the intrinsic __sync_bool_compare_and_swap <strong>to</strong> issue an a<strong>to</strong>mic<br />
compare-and-swap. Compiling the call with gcc 4.6.1 on a x86 computer translates in<strong>to</strong> the assembly call<br />
lock cmpxchgl that can only be represented in C through an asm(...) call.
80CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
pattern-matching pass <strong>for</strong> each element of the sequence.<br />
4.4.2 N-Address Code Generation<br />
When a program contains long expressions, it is sometime needed <strong>to</strong> break these expressions<br />
in<strong>to</strong> smaller ones. This trans<strong>for</strong>mation is called N-address code generation. It<br />
is generally used <strong>for</strong> assembly code generation. It can however be useful at the source<br />
level, either because the back-end compiler takes assembly code as input, or <strong>to</strong> create<br />
opportunities <strong>for</strong> further trans<strong>for</strong>mations like invariant code motion.<br />
The problem can first be stated as follows: given a statement in the <strong>for</strong>m id = expr,<br />
where id is an identifier and an integer n ∈ N ∗ , trans<strong>for</strong>m it <strong>to</strong> a sequence of assignments<br />
so that any expression in the right hand side of the assignment does not involve more than<br />
n references.<br />
Let us define the trans<strong>for</strong>mation a : N + × S → S :<br />
n, id = cst ↦→ id = cst (4.1)<br />
n, id = ref ↦→ id = ref (4.2)<br />
<br />
id = e0 op e1 if depth(e0) + depth(e1) < n<br />
n, id = e0 op e1 ↦→<br />
a(n, id = e0); a(n, id0 = e1); id = id op id0 otherwise<br />
(4.3)<br />
where id 0 is a new identifier unique <strong>to</strong> the program and depth : E → N is defined by<br />
cst ↦→ 1<br />
id ↦→ 1<br />
id = id op expr ↦→ 1 + depth(expr)<br />
expr 0 op expr 1 ↦→ depth(expr 0) + depth(expr 1)<br />
Let us prove that this trans<strong>for</strong>mation is correct, that is, given n ∈ N + , a memory state<br />
σ ∈ Σ and a statement s ∈ S so that s = id = expr, we have:<br />
S(s, σ) = S(a(n, s), σ)<br />
Proof. We use a recursive reasoning. Input statement is unchanged by Equations (4.1)<br />
and (4.2) so the equality is verified <strong>for</strong> these two cases.
4.4. INSTRUCTIONS 81<br />
Case (4.3) when depth(expr 0) + depth(expr 1) ≥ n leads <strong>to</strong><br />
S(a(n, s), σ)<br />
=S(a(n, id = expr 0) ; a(n,id 0 = expr 1) ; id = id op id 0, σ)<br />
=S(a(n, id = id op id 0), S(a(n, id 0 = expr 1 ), S(a(n, id = expr 0), σ)))<br />
=S(id = id op id 0, S(id 0 = expr 1 , S(id = expr 0, σ))) from recursion hypothesis<br />
<br />
=S(id = id op id 0, S(id 0 = expr 1 , σ[R(id, σ) → E(expr 0, σ)]))<br />
=S(id = id op id 0, σ ′ [R(id 0, σ ′ ) → E(expr 1, σ ′ ))<br />
<br />
=S(id = id op id 0, σ ′ [R(id 0, σ) → E(expr 1, σ)])<br />
=σ ′′ [R(id, σ ′′ ) → E(id op id 0, σ ′′ )]<br />
=σ ′′ [R(id, σ ′′ ) → E(id, σ ′′ ) op E(id 0, σ ′′ )]<br />
=σ ′′ [R(id, σ ′′ ) → E(id, σ ′′ ) op E(id 0, σ ′′ )]<br />
=σ ′′ [R(id, σ ′′ ) → σ ′′ (R(id, σ ′′ )) op σ ′′ (R(id 0, σ ′′ ))]<br />
σ ′′<br />
Because id 0 is an identifier, ∀(σ0, σ1) ∈ Σ 2 , R(id 0, σ0) = R(id 0, σ1).<br />
=σ ′′ [R(id, σ) → σ ′′ (R(id, σ)) op σ ′′ (R(id 0, σ))]<br />
=σ ′′ [R(id, σ) → E(expr 0, σ) op E(expr 1, σ))]<br />
=S(id = expr 0 op expr 1, σ)[R(id 0, σ) → E(expr 1, σ)]<br />
The update of location R(id 0, σ) can be safely ignored as it is a new unique identifier.<br />
In the definition of the problem, we state that an assignment must not contain more<br />
then n references. However, a reference can itself contain expressions. Let us define an<br />
auxiliary function ar : N + × R → (R × S) with the following syntactic rules:<br />
n, id ↦→ 〈id, ;〉<br />
n, ref [expr] ↦→ 〈id 0[id 1], sr ; id 0 = rr ; id 1 = expr〉 where 〈rr, sr〉 = ar(n, ref )<br />
n, ref.id ↦→ 〈id 0.id, sr ; id 0 = rr〉 where 〈rr, sr〉 = ar(n, ref )<br />
Given a positive value n ∈ N + , a memory state σ ∈ Σ and a reference r ∈ R, let us<br />
prove that:<br />
σ ′
82CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
〈rr, sr〉 = ar(n, r),<br />
R(r, σ) = R(rr, S(sr, σ))<br />
Proof. We use an inductive reasoning over r. The property is true <strong>for</strong> case id as the<br />
denoted statement is unchanged.<br />
Consider the case r = ref [expr], that is<br />
ar(n, r) = 〈rr, sr〉 = 〈id 0[id 1], s ′ r ; id 0 = r ′ r ; id 1 = expr〉 where 〈r ′ r, s ′ r〉 = ar(n, ref )<br />
We have<br />
R(rr, S(sr, σ))<br />
=R(rr, S(s ′ r ; id 0 = r ′ r ; id 1 = expr, σ))<br />
=R(rr, S(id 1 = expr, S(id 0 = r ′ r,<br />
σ ′<br />
<br />
S(s ′ r, σ))))<br />
=R(rr, S(id 1 = expr, σ ′ [id0 → σ ′ (R(r ′ r, σ ′ ))]))<br />
<br />
=R(rr, S(id 1 = expr,<br />
σ ′′ ′<br />
<br />
=R(rr,<br />
σ ′′ [id1 → E(expr, σ ′′ )])<br />
=R(id 0[id 1], σ ′′ ′ )<br />
=R(id 0, σ ′′ ′ )[E(id 1, σ ′′ ′ )]<br />
=R(id 0, σ ′′ ′ )[σ ′′ ′ (id1)]<br />
=R(id 0, σ ′′ ′ )[E(expr, σ ′′ )]<br />
σ ′′<br />
σ ′ [id0 → σ ′ (R(rr, σ))]))<br />
=R(id 0, σ ′′ ′ )[E(expr, σ[id0 → . . . , id1 → . . . ])]<br />
=R(id 0, σ ′′ ′ )[E(expr, σ)]<br />
=R(r ′ r, σ ′ )[E(expr, σ)]<br />
=R(r ′ r, S(s ′ r, σ))[E(expr, σ)]<br />
=R(r ′ r, σ ′ )[E(expr, σ)]<br />
=R(ref , σ)[E(expr, σ)]<br />
=R(ref [expr], σ)<br />
Next, consider the case r = ref.id, that is<br />
ar(n, r) = 〈rr, sr〉 = 〈id 0.id, s ′ r ; id 0 = r ′ r〉 where 〈r ′ r, s ′ r〉 = ar(n, ref )
4.5. MEMORY ARCHITECTURE 83<br />
We have<br />
R(rr, S(sr, σ))<br />
=R(rr, S(s ′ r ; id 0 = r ′ r, σ))<br />
=R(rr, S(S(id 0 = r ′ r,<br />
σ ′<br />
<br />
S(s ′ r, σ))))<br />
=R(rr, S(σ ′ [id0 → σ ′ (R(r ′ r, σ ′ ))]))<br />
=R(rr, σ ′ [id0 → σ ′ (R(ref ))])<br />
=R(id 0.id, σ ′ [id0 → σ ′ (R(ref ))])<br />
=R(id 0, σ ′ [id0 → σ ′ (R(ref ))]).id<br />
=R(ref , σ ′ [id0 → σ ′ (R(ref ))]).id<br />
=R(ref , σ).id<br />
=R(r, σ)<br />
The application of ar generates valid input <strong>for</strong> a so that code that contains no more<br />
than n identifiers per statement can be generated.<br />
4.5 Memory Architecture<br />
The logical memory model <strong>for</strong> the C language is flat. Qualifiers can be used <strong>to</strong> change<br />
some of the s<strong>to</strong>rage properties: const, volatile, register, etc. Heap and stack concepts<br />
are not linked <strong>to</strong> the C language.<br />
However, heterogeneous machines have different separated memories. It is still possible<br />
<strong>to</strong> emulate these different memories using different address spaces <strong>for</strong> different memories,<br />
much like user space is separated from kernel space. To distinguish between two memory<br />
spaces, we use a naming convention over the variable type name. For instance, variables<br />
allocated on a gpu have their type name prefixed by __gpu_. An other option is the<br />
qualifier extension proposed <strong>for</strong> embedded C [ISO08] but it implies an extension of the ir.<br />
4.6 Function calls<br />
4.6.1 Removing Function Calls<br />
Some isa simply do not support function calls. At the source level, inlining [CH89,<br />
JM99] is a convenient solution <strong>to</strong> sidestep the issue without breaking the code structure<br />
as stack emulation or indirect go<strong>to</strong> would, as long as recursive calls are not involved.
84CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
Although inlining seems a straight-<strong>for</strong>ward trans<strong>for</strong>mation, there are several details<br />
that must taken care of in the context of source-<strong>to</strong>-source compilation, because generated<br />
code must still be correct:<br />
1. if the inlined function contains static declarations, these declarations must be made<br />
global;<br />
2. if the inlined function contains references <strong>to</strong> global variables, they must be declared<br />
with external s<strong>to</strong>rage at the call site; 6<br />
3. if the inlined function contains references <strong>to</strong> enumeration or structure fields that<br />
are not visible from the caller compilation unit, they must be redeclared in this<br />
compilation unit; 7<br />
4. most naming conflicts can be dealt with using an extra block declaration, but not<br />
when static variables are promoted as globals, neither <strong>for</strong> label names nor <strong>for</strong> effective<br />
parameter names;<br />
5. return instructions must be replaced by a go<strong>to</strong>, possibly preceded by an assignment.<br />
A basic inlining simulates by-copy parameter passing through additional assignments<br />
and generates a lot of go<strong>to</strong>s. Yet it is possible <strong>to</strong> use <strong>for</strong>ward substitution [Muc97] <strong>to</strong><br />
remove the spurious assignments and go<strong>to</strong> elimination [Ero95] <strong>to</strong> restructure the control<br />
flow.<br />
4.6.2 Outlining<br />
Most languages dedicated <strong>to</strong> hardware accelera<strong>to</strong>rs use functions <strong>to</strong> separate the host<br />
code from the accelera<strong>to</strong>r code, generally using a new qualifier (e.g. __kernel__) <strong>to</strong> identify<br />
accelera<strong>to</strong>r functions. However, the code fragment that is <strong>to</strong> be promoted as a kernel is<br />
usually part of another statement, so it is necessary <strong>to</strong> extract a function from a code<br />
fragment—a statement. This trans<strong>for</strong>mation is called outlining.<br />
4.6.2.1 Outlining Algorithm<br />
There exist two main ways of passing parameters:<br />
by copy: during function call, the <strong>for</strong>mal parameter is replaced by a copy of the actual<br />
parameter. It holds the same value but has a (possibly) different name and a different<br />
location. This is the passing mode used in C.<br />
by reference: during function call, the <strong>for</strong>mal parameter is directly replaced by the actual<br />
parameter. It holds the same value and has the same location. It is emulated in C<br />
by passing the address of a variable as parameter instead of the variable. This is the<br />
passing mode used in Fortran<br />
6. This also includes function calls.<br />
7. This also includes calls <strong>to</strong> static functions.
4.6. FUNCTION CALLS 85<br />
by constant reference: During function call, the <strong>for</strong>mal parameter is directly replaced<br />
by the effective parameter as in by reference parameter passing, but it is guaranteed<br />
that the corresponding memory locations are not written in the function body.<br />
Passing a parameter by copy generates a copy of the full variable. A common optimization<br />
<strong>for</strong> a variable that uses a non-trivial memory size is <strong>to</strong> pass those parameters by<br />
reference in order <strong>to</strong> avoid the extra copies. To make it clear that the variable is read-only,<br />
it is a common practice <strong>to</strong> pass it as a constant reference. 8<br />
The goal when designing the outlining trans<strong>for</strong>mation is <strong>to</strong> pass as few parameters<br />
as possible <strong>to</strong> the generated functions, with the more restrictive and efficient parameter<br />
passing mode.<br />
Let outline : S × Σ → P(R) × P(R) × P(R) be a function that maps a statement<br />
in a memory state <strong>to</strong> a triplet containing the parameters passed by copy, the parameters<br />
passed by reference and the parameters passed by constant reference. It is defined by:<br />
<br />
{r | r ∈ (Ri(s, σ) − Ro(s, σ)) ∧ typeof(r) ∈ Tscalar}<br />
<br />
outline(s, σ) = Ro(s, σ)<br />
{r | r ∈ (Ri(s, σ) − Ro(s, σ)) ∧ typeof(r) ∈ Tscalar}<br />
where Tscalar is the set of all scalar types.<br />
Each reference gathered by the function is textually used as an effective parameter<br />
and replaced in s by a new identifier of the corresponding type. Note that Ri(σ, s) and<br />
Ro(σ, s)) au<strong>to</strong>matically filter out private variables and locally declared variables.<br />
Listing 4.8 illustrates the outlining process, in which the internal loop is outlined as a<br />
new function kernel.<br />
The example from Listing 4.8 can be further improved by using a variant of common<br />
subexpression elimination: the variable i is only used <strong>to</strong> compute the sub-arrays in[i]<br />
and out[i]. It is possible <strong>to</strong> detect this situation and generate code as in Listing 4.9.<br />
The basic idea <strong>to</strong> per<strong>for</strong>m this trans<strong>for</strong>mation is <strong>to</strong> scan all statement references and<br />
look <strong>for</strong> a constant prefix. Let first introduce the concept of reference prefix.<br />
Definition 4.2. Given (r, r ′ ) ∈ R 2 , r ′ is a prefix of r, denoted r ′ ≺ r, if and only if<br />
∃n ∈ N ∗ , ∃(e1, . . . , en) ∈ E n : r = r ′ [e1][. . . ][en]<br />
For instance a[2*i] is a prefix of a[2*i][k+1].<br />
Given a set of references R ∈ outline(s, σ) and a reference r ∈ R, if ∃r ′ ∈ R, s.t. r ′ ≺ r,<br />
and if ∀k ∈ 1 : n, ∀x ∈ refs(s), x ∈ Rw(s, σ) then the prefix is constant with respect <strong>to</strong><br />
statement s and a constant prefix reference is found. r ′ is used as the effective parameter<br />
instead of r and the substitution in s is per<strong>for</strong>med accordingly.<br />
8. This practice is extensively used in C++.
86CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
void erode ( int n, int m, int in[n][m], int out [n][m]) {<br />
<strong>for</strong> ( int i =0;i
4.6. FUNCTION CALLS 87<br />
4.6.2.2 Using Outlining <strong>to</strong> Reduce Compilation Time<br />
Outlining can also be useful <strong>to</strong> reduce compilation time. Indeed many compiler algorithms<br />
have polynomial if not exponential complexity. 9 In that case it is beneficial <strong>to</strong> split<br />
a complex function in<strong>to</strong> several parts that can be considered independently: if the complexity<br />
c(s) of a statement trans<strong>for</strong>mation verifies the property c(s0; s1) > c(s0) + c(s1)<br />
then it is profitable <strong>to</strong> split computations.<br />
For instance a sequence of two loops can be outlined in two distinct functions and<br />
compiled separately. Of course this decision can kill some optimization opportunities, and<br />
thus must be used with care. We have however found it especially useful in some situations,<br />
like the generation of vec<strong>to</strong>r instructions in a function that contains two loop nests that<br />
cannot be merged.<br />
As many trans<strong>for</strong>mations focus on loop nests, we propose <strong>to</strong> isolate each loop nest in<br />
one single function using outlining and <strong>to</strong> apply further trans<strong>for</strong>mations on these functions.<br />
Note that loop fusion is tried be<strong>for</strong>e outlining, because it would be difficult <strong>to</strong> apply it after<br />
outlining. This process is described in Algorithm 2.<br />
Data: f ← a function<br />
Result: l, a list of functions<br />
loop_fusion(f);<br />
k ← 0;<br />
l ← ∅;<br />
<strong>for</strong> l ∈ outer_loops(f) do<br />
outline(l, fk);<br />
l ← l ∪ {fk};<br />
k ← k + 1;<br />
end<br />
return l;<br />
Algorithm 2: Compilation complexity reduction with outlining.<br />
A variant of Algorithm 2 is used <strong>for</strong> Multimedia Instruction Set (mis) generation: after<br />
some loop tiling <strong>to</strong> improve locality, the code that per<strong>for</strong>ms the computations on a tile<br />
is outlined <strong>to</strong> a new function, because each loop can be considered independently of the<br />
other.<br />
We carried out the following experiment on pips convex array region analysis: starting<br />
from two consecutive matrix multiplications, each loop nest is outlined <strong>to</strong> a new function<br />
and its innermost loop is unrolled by a given rate. It results in<strong>to</strong> several code versions with<br />
increasingly number of statements. We then per<strong>for</strong>med the computation of the convex<br />
array regions on each version.<br />
Figure 4.1 shows that an overhead linked <strong>to</strong> the inter-procedural analysis exists, but<br />
when the loop body holds enough statements, it is beneficial <strong>to</strong> apply the analysis on<br />
separated functions.<br />
9. Some cases of loop fusion are solvable in polynomial time and others are NP-complete [Dar99].
88CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />
analysis time (s)<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
without outlining<br />
with outlining<br />
0 20 40 60 80 100 120 140<br />
unroll rate<br />
Figure 4.1: Using outlining <strong>to</strong> reduce analyse compilation time on an unrolled sequence of<br />
matrix multiplications.<br />
4.7 Library Calls<br />
The C language is minimalist: little functionality is present in the language and many<br />
concepts are implemented in libraries (e.g. threading support, logging, multi-threading,<br />
etc.). As a consequence, the use of libraries is very common, and we cannot assume the<br />
input code does not contain library calls.<br />
The problem with libraries is that their source code may not be available on the targeted<br />
plat<strong>for</strong>m, or a similar library does exists but with a slightly different Application<br />
Programming Interface (api). This raises two issues:<br />
1. How do we per<strong>for</strong>m an inter-procedural analysis on extern calls?<br />
2. How can accelera<strong>to</strong>r code call an external library?<br />
To handle this problem, we introduce the concept of stub broker. A stub broker is a<br />
runtime library that interacts with the compiler <strong>to</strong> manages a collection of functions by<br />
representing a function as a 3-tuple 〈stub, seq, {〈archi, impl〉}〉, where stub is a C stub of<br />
the function that has the same memory effects and is analyzable by a compiler; seq is a<br />
sequential version of the function that has the same semantics as the original function 10 . A<br />
couple 〈archi, impl〉 s<strong>to</strong>res the implementation of the function <strong>for</strong> a particular architecture.<br />
The compiler infrastructure per<strong>for</strong>ms requests <strong>to</strong> the function broker, asking <strong>for</strong> a function<br />
10. It may be similar <strong>to</strong> stub but can also <strong>for</strong>ward the call <strong>to</strong> external libraries, something that a stub<br />
cannot do because it must be analyzable by the compiler infrastructure, thus be self-contained.
4.8. CONCLUSION 89<br />
and either a stub, a sequential version or a particular architecture version. The broker<br />
answers with the appropriate code, or an error <strong>to</strong> notify the infrastructure that the request<br />
cannot be fulfilled.<br />
During parsing and analysis of its input, the compiler infrastructure asks <strong>for</strong> stubs.<br />
During translation from ir <strong>to</strong> Textual Representation (tr), it asks <strong>for</strong> a sequential version<br />
and during specialization, i.e. during the post processing step described in Figure 3.9, the<br />
target-specific implementation is used.<br />
A particular case of external calls is the I/O. Depending on the hardware, I/O may be<br />
limited <strong>to</strong> data transfers through a Direct Memory Access (dma) call or it can be done<br />
through different devices (screen/printer/socket/. . . ). As the input code is not hardwarespecific,<br />
it cannot be aware of these facilities. However, the stub broker abstraction can<br />
benefit from them <strong>to</strong> provide working implementations of external calls using hardwarespecific<br />
calls.<br />
4.8 Conclusion<br />
In this chapter, we have enumerated the different aspects of an isa and we have shown<br />
that most characteristics can be represented at the C level using the proper conventions.<br />
To this end, we have listed basic trans<strong>for</strong>mations that can be incrementally used <strong>to</strong> lower<br />
a C code down <strong>to</strong> assembly-like code, while being still compatible with a C compiler:<br />
conversion from arrays <strong>to</strong> pointers, structure removal, constant array scalarization, naddress<br />
code generation, inlining, outlining, instruction selection, etc. have been detailed<br />
as source-<strong>to</strong>-source trans<strong>for</strong>mations. This set of fine grain trans<strong>for</strong>mations en<strong>for</strong>ces reuse<br />
and adaptability <strong>to</strong> the target.<br />
This approach en<strong>for</strong>ces the principle of “C as an Internal Representation” and is the<br />
key <strong>to</strong> be able <strong>to</strong> use a source-<strong>to</strong>-source compiler as a bridge between regular C code and<br />
C dialects as the ones used <strong>to</strong> program many hardware accelera<strong>to</strong>rs.<br />
Experimental results are given in Chapter 7, Sections 7.3, 7.4 and 7.2. The validity<br />
of the proposed trans<strong>for</strong>mations is illustrated in Chapter 7. Next chapter discusses the<br />
impact of parallelism constraints on compilers <strong>for</strong> heterogeneous plat<strong>for</strong>ms.
90CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C
Chapter 5<br />
Parallelism with Multimedia<br />
Instructions<br />
Pont de Saint Goustan, Morbihan c○ Gwenael AB / flickr<br />
A recurring feature of hardware accelera<strong>to</strong>rs is their use of parallelism <strong>to</strong> provide<br />
speedup. This parallelism can take various <strong>for</strong>ms and levels, generally a mixture of Single<br />
Instruction stream, Multiple Data stream (simd) and Multiple Instruction stream, Multiple<br />
Data stream (mimd) parallelism, as found in General Purpose gpu (gpgpu). Both<br />
kinds of parallelism have been studied <strong>for</strong> a long time, with a focus on loop parallelization:<br />
hyperplane loop trans<strong>for</strong>mation [Lam74], handling of control dependence [AKPW83], loop<br />
vec<strong>to</strong>rization [AK87], parallelism extraction [WL91b], supernode partitioning [IT88], communication<br />
optimizations [DUSsH93], interaction with caching [KK92] and tiling. [DSV96,<br />
AR97, YRR + 10] David F. Bacon et al. wrote an interesting survey [BGS94] on compiler<br />
trans<strong>for</strong>mations <strong>for</strong> High Per<strong>for</strong>mance Computing (hpc) that includes many loop trans<strong>for</strong>mations.<br />
Vivek Sarkar studied the au<strong>to</strong>matic selection of trans<strong>for</strong>mations based on a<br />
cost model [Sar97]. These techniques have been applied successfully in both research compilers,<br />
SUIF [WFW + 94], Polaris [PEH + 93], Paralléliseur Interprocedural de Programmes<br />
Scientifiques (pips) [IJT91, AAC + 11], Rose [Qui00], Pocc [PBB10], and production com-<br />
91
92 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
pilers, IBM XL [Sar97], Low Level Virtual Machine (llvm) [GZA + 11], gnu C Compiler<br />
(gcc) [TCE + 10], Intel C++ Compiler (icc) [DKK + 99], pgi [Wol10] and are used <strong>to</strong><br />
detect or extract parallelism and overcome some parallelization issues.<br />
In this chapter, we focus on two aspects of code parallelization: Instruction Level Parallelism<br />
(ilp) and reduction parallelization. The <strong>for</strong>mer takes advantage of intra-loop or<br />
intra-sequence parallelization opportunity using the Multimedia Instruction Sets (miss)<br />
available in most modern processors and is detailed in Section 5.1. The latter is a critical<br />
concern when a code involving a reduction must be mapped on a pure simd hardware<br />
and is addressed in Section 5.2. Section 5.3 proposes a simple model based on the parallelism<br />
found in remote accelera<strong>to</strong>rs <strong>to</strong> decide whether it is profitable or not <strong>to</strong> offload a<br />
computation.<br />
5.1 Super-word Level Parallelization<br />
Many processors now have a small vec<strong>to</strong>r unit, used by the main processor as a small<br />
accelera<strong>to</strong>r <strong>to</strong> speedup regular computations, typically multimedia applications. Modern<br />
Central Processing Units (cpus) have 128-bit (e.g. arm Cotex-A), 256-bit (e.g. Intel<br />
Sandy Bridge) or even 512-bit (e.g. Intel larabee) vec<strong>to</strong>r units. To au<strong>to</strong>matically take<br />
advantage of this extra computing power, there are two approaches: vec<strong>to</strong>r parallelism, at<br />
the loop level, and Super-word Level Parallelism, at the block level. This section presents<br />
an algorithm <strong>to</strong> combine both approaches at source level while maintaining retargetability:<br />
the proposed algorithm is parametrized by the targeted mis.<br />
5.1.1 Related Work<br />
Several approaches are able <strong>to</strong> take advantage of miss such as those found in Intel,<br />
amd and arm processors. Writing inline assembly code remains the best option <strong>for</strong> those<br />
who seek out speedup, but prohibitive development costs, difficulty of maintenance and<br />
limited portability all limit this approach <strong>to</strong> only critical code segments. For instance, the<br />
source code of the open source project mplayer contains many multimedia kernels that<br />
use manually tuned assembly code, such as the excerpt in Listing 5.1.<br />
Figure 5.1 illustrates several abstractions that can leverage the code portability beyond<br />
plain assembly:<br />
intrinsics: C functions that directly map <strong>to</strong> a sequence of one or more assembly instructions.<br />
This option remains a low-level one, but it is portable across compilers;<br />
vec<strong>to</strong>r types: syntactic sugar has been added in gcc with predefined vec<strong>to</strong>r types, but it<br />
is not portable on other compilers. Moreover it only deals with arithmetic opera<strong>to</strong>rs,<br />
so it only exposes a limited set of operations. The ArBB library [NSL + 11] uses a<br />
similar approach based on C++ templates and opera<strong>to</strong>r overloading. ;<br />
au<strong>to</strong>-vec<strong>to</strong>rization: <strong>for</strong> simple cases, an alternative is <strong>to</strong> let the compiler au<strong>to</strong>matically<br />
vec<strong>to</strong>rize the sequential version. It is the only approach that does not change the<br />
development cost, but it offers few guarantees of per<strong>for</strong>mance.
5.1. SUPER-WORD LEVEL PARALLELIZATION 93<br />
__asm__ volatile (<br />
" movd ␣␣␣␣␣␣␣␣␣␣␣%4,␣%% xmm5 ␣\n"<br />
" pxor ␣␣␣␣␣␣␣%% xmm7 ,␣%% xmm7 ␣\n"<br />
" pshuflw ␣$0 ,%% xmm5 ,␣%% xmm5 ␣\n"<br />
" movdqa ␣␣␣␣␣␣␣␣␣%6,␣%% xmm6 ␣\n"<br />
" punpcklqdq ␣%% xmm5 ,␣%% xmm5 ␣\n"<br />
" movdqa ␣␣␣␣␣␣␣␣␣%5,␣%% xmm4 ␣\n"<br />
"1:␣\n"<br />
" movq ␣␣␣␣␣␣ (%2 ,%0) , ␣%% xmm0 ␣\n"<br />
" movq ␣␣␣␣␣␣ (%3 ,%0) , ␣%% xmm1 ␣\n"<br />
" punpcklbw ␣␣%% xmm7 ,␣%% xmm0 ␣\n"<br />
Listing 5.1: Excerpt from libmpcodecs/vf_gradfun.c file from the mplayer source tree.<br />
Nonetheless, most developers of non time-critical code rely on au<strong>to</strong>matic vec<strong>to</strong>rization.<br />
In that field, proprietary compilers, such as icc, still outper<strong>for</strong>ms open-source ones like<br />
gcc or llvm on their processors. This is checked using the linpack [DLP03] benchmark<br />
<strong>to</strong> compare llvm, gcc and icc vec<strong>to</strong>rization engines. Figure 5.2 shows the score of icc,<br />
gcc and llvm on a desk<strong>to</strong>p station running 2.6.38-2-686 GNU/Linux on a 2 Intel(R)<br />
Core(TM)2 Duo CPU T9600 @ 2.80GHz. icc version 12.0.3 is run using the -O3 flag,<br />
llvm version 2.7 is run using the -O3 -march=native -ffast-math, and gcc version<br />
4.6.1 is run using the -O3 -march=native -ffast-math flags.<br />
Of course gcc and <strong>to</strong> a lower extent llvm support a wider variety of targets, including<br />
arm processors, than icc. On the other hand, icc achieves better per<strong>for</strong>mance. Nonetheless,<br />
from application developers point of view, au<strong>to</strong> vec<strong>to</strong>rization is the way <strong>to</strong> go, provided<br />
the generated code vec<strong>to</strong>rization is efficient.<br />
In fact, from a compiler developer point of view, the following points are strong constraints:<br />
– instruction sets are in constant evolution: Figure 5.3 summarizes their evolution over<br />
the past ten years and shows a steady evolution from Matrix Math eXtension (mmx)<br />
<strong>to</strong> Advanced Vec<strong>to</strong>r eXtensions (avx) in x86 processors <strong>for</strong> example;<br />
– debugging generated code (or intermediate code) is difficult;<br />
– integrating new code trans<strong>for</strong>mations is a long-term task;<br />
– dealing with code written in a legacy instruction set is difficult.<br />
This means that in addition <strong>to</strong> the constraints of au<strong>to</strong>-vec<strong>to</strong>rization and efficiency of<br />
generated code, compilers must also be retargetable <strong>to</strong> keep up with hardware design<br />
pace.<br />
Several solutions have been proposed <strong>to</strong> tackle the challenge of efficient simd code<br />
generation: the detailed view of icc internals given in [Bik04] shows that per<strong>for</strong>mance<br />
is reached through the intensive use of loop vec<strong>to</strong>rization techniques [BGS94] and that it<br />
relies on re-rolling techniques <strong>to</strong> vec<strong>to</strong>rize already manually unrolled loops. For obvious economical<br />
reasons, it does not focus on retargetability issues, whereas llvm [LA04, CSY10]<br />
and gcc [GS04, RNZ07] do. These latter both provide au<strong>to</strong>-vec<strong>to</strong>rizers, in early stage
94 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
movaps -24(% ebp ), % xmm0<br />
movaps -40(% ebp ), % xmm1<br />
addps %xmm1 , % xmm0<br />
movaps %xmm0 , -56(% ebp )<br />
# include < xmmintrin .h><br />
foo () {<br />
__m128 v0 ,v1 ,v2;<br />
v0= _mm_add_ps (v1 ,v2 );<br />
}<br />
# include < xmmintrin .h><br />
foo () {<br />
__v4sf v0 ,v1 ,v2;<br />
v0=v1+v2;<br />
}<br />
foo () {<br />
float v0 [4] , v1 [4] , v2 [2];<br />
<strong>for</strong> ( int i =0;i
5.1. SUPER-WORD LEVEL PARALLELIZATION 95<br />
2e+06<br />
1.5e+06<br />
1e+06<br />
500000<br />
0<br />
FLOPS<br />
icc gcc llvm<br />
Figure 5.2: Comparison of llvm, gcc and icc vec<strong>to</strong>rizers using linpack.<br />
avx<br />
sse4.2<br />
sse3<br />
sse2<br />
sse<br />
3DNow!<br />
mmxPentium<br />
I<br />
Pentium III<br />
AMD I<br />
Pentium 4 ’Prescott’<br />
Pentium 4<br />
Sandy Bridge<br />
Core I7<br />
1996 1998 2000 2002 2004 2006 2008 2010 2012<br />
Figure 5.3: Multimedia Instruction Set his<strong>to</strong>ry <strong>for</strong> x86 processors.
96 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
<strong>for</strong> llvm [CSY10], more advanced <strong>for</strong> gcc, with two approaches. One [RNZ07] is<br />
based on the combinations of loop vec<strong>to</strong>rization techniques and Super-word Level Parallelism<br />
(slp) [LA00, SCH03, SHC05], the other relies on the strength of the polyhedral<br />
model [TNC + 09, GZA + 11].<br />
Indeed, the his<strong>to</strong>ric approach <strong>to</strong> simd instruction generation is inherited from loop<br />
vec<strong>to</strong>rization techniques, where loop nests are first optimized (e.g. using loop fusion, interchange,<br />
skewing, distribution . . . ) then strip mined by the register width. Larsen et al.<br />
introduced in [LA00] the slp algorithm, which uses a pattern matching algorithm <strong>to</strong> find<br />
vec<strong>to</strong>r instructions in code sequences, and thus capturing potentially more parallelism that<br />
the previous method. However, it cannot optimize loop nests with the same efficiency as<br />
polyhedron-based approaches. A good illustration of this assessment is the introduction<br />
of the unroll-and-jam trans<strong>for</strong>mation in the context of slp. It is a particular case of loop<br />
tiling combined with loop unrolling [WL91a].<br />
Some papers [ZC98, BGGT02, JMH + 05] focus on the discovery of instruction-set specific<br />
patterns, e.g. Fused Multiply-Add (fma) or horizontal-add. Finding such patterns<br />
provide significant improvement <strong>for</strong> some kernels depending on the target architecture.<br />
The growing number of mis and the steady pace at which they evolve has led <strong>to</strong> several<br />
attempt <strong>to</strong> build a retargetable vec<strong>to</strong>rizer: swarp [PBSB04] used annotated C code <strong>to</strong><br />
describe simd instruction patterns and combines this representation with a generic pattern<br />
matching engine. This approach offers a very flexible way <strong>to</strong> describe mis, but as the<br />
register width grows, as one can expect considering the <strong>for</strong>thcoming larabee that contains<br />
a 512-bit vec<strong>to</strong>r processing unit, pattern matching gets slower and less practical. Moreover,<br />
the pattern description proved, from its author, <strong>to</strong> be “<strong>to</strong>o hard <strong>to</strong> maintain”. The approach<br />
of [HEL + 09] achieves retargetability through detailed description of each instruction at the<br />
source-level. Pre-processing phases are in charge of applying unrolling and scalar expansion<br />
be<strong>for</strong>e their vec<strong>to</strong>rization engine.<br />
Recently, joint work of Nuzman et al. [NRD + 11] has proposed an interesting new<br />
approach <strong>to</strong> the problem. The vec<strong>to</strong>rization engine generates calls <strong>to</strong> an abstract mis<br />
parametrized by the vec<strong>to</strong>r length. A Just In Time (jit) compiler is in charge of the<br />
generation of target-dependant code at execution time. Likewise, optimization related <strong>to</strong><br />
data alignment can be deferred. That way, there is no need <strong>to</strong> recompile an application<br />
<strong>to</strong> benefit from the vec<strong>to</strong>r instruction unit. But because vec<strong>to</strong>rization is per<strong>for</strong>med in a<br />
target-independent way at compile time, this approach does not take advantage of some<br />
optimization opportunities: intra-loop vec<strong>to</strong>rization or slp, cycle-shrinking [BGS94], etc.<br />
It is also difficult <strong>to</strong> per<strong>for</strong>m a vec<strong>to</strong>rization profitability estimation without in<strong>for</strong>mation<br />
about the target.<br />
In spite of the many ef<strong>for</strong>ts from the research community, production compilers either<br />
target (relatively) efficiently a single architecture, e.g. icc, or inefficiently multiple<br />
architectures, e.g. gcc.
5.1. SUPER-WORD LEVEL PARALLELIZATION 97<br />
typedef float v4sf [4];<br />
Listing 5.2: Sample of representation of a vec<strong>to</strong>r register using C type.<br />
5.1.2 A Meta-Multimedia Instruction Set<br />
To achieve retargetability, decoupling the code trans<strong>for</strong>mations from the targets is necessary:<br />
we propose a generic and parametric mis as a single target <strong>for</strong> the whole trans<strong>for</strong>mation<br />
process. This kind of meta-mis has already been proposed in the past [Roj04, Sch09].<br />
This instruction set is parametrized by the vec<strong>to</strong>r size, but unlike their approach, this<br />
size is set at compile time and not at execution time, which provides more vec<strong>to</strong>rization<br />
opportunities. Another difference with existing approaches is that all instructions and<br />
vec<strong>to</strong>r types are described in C, following the principle given at Section 4.1, so that the<br />
compiler infrastructure can analyse them and per<strong>for</strong>m trans<strong>for</strong>mations that are correct<br />
with respect <strong>to</strong> the sequential implementation.<br />
The supported simd types are vec<strong>to</strong>rs of scalars, simple/double floating point, and<br />
8 <strong>to</strong> 128 bits integers. The length of these vec<strong>to</strong>rs is a parameter. They are represented<br />
internally as plain arrays of the according types and sizes. A typedef is used <strong>to</strong> wrap these<br />
vec<strong>to</strong>r types <strong>to</strong> ease code generation. Listing 5.2 illustrates this approach <strong>for</strong> a vec<strong>to</strong>r of 4<br />
single precision floats. The typedef naming convention used is gcc’s. Complex numbers<br />
are treated as arrays of two elements <strong>to</strong> be compatible with the vec<strong>to</strong>r representation.<br />
The vec<strong>to</strong>r operations supported by the meta-mis are taken from the union of avx and<br />
neon mis, restricted <strong>to</strong> pure simd instructions:<br />
– common mathematical opera<strong>to</strong>rs: addition, subtraction, multiplication and division;<br />
– trigonometric functions: sine, cosine;<br />
– comparison: equal, greater/lesser than;<br />
– multiply-addition operation, implemented in neon and proposed in avx, but not yet<br />
implemented in the Sandy Bridge architecture;<br />
– logical operations;<br />
– data movement: packed and unpacked loads and s<strong>to</strong>res;<br />
– memory reorganization operations: shuffle and broadcast.<br />
Non-simd instructions such as haddps from Streaming simd Extension (sse) 4.2, an<br />
horizontal add on single precision floats, are more difficult <strong>to</strong> deal with because the pattern<br />
matching algorithm has a non-linear complexity. 1<br />
The meta-mis is parametric in the vec<strong>to</strong>r length: at code generation time and depending<br />
on the targeted mis, all operations are generated <strong>for</strong> the given vec<strong>to</strong>r length. Figure 5.4<br />
shows a sample of usage of this mis <strong>for</strong> a vec<strong>to</strong>r size of 128 bits.<br />
A n-instance of the meta-mis, denoted n-mis, is the set of all types and operations <strong>for</strong><br />
a vec<strong>to</strong>r of n bits. For instance, Figure 5.4 is an example of the 128-mis. Given an n-mis,<br />
it is possible <strong>to</strong> generate a set of patterns that fully describes the mis in a <strong>for</strong>m easier <strong>to</strong><br />
1. The authors of the paper on swarp gave up with the generic approach when avx was released<br />
because the pattern matching algorithm did not scale well <strong>for</strong> 256-bits vec<strong>to</strong>r registers.
98 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
v4sf vec0 , vec1 , vec2 ;<br />
/*...*/<br />
<strong>for</strong> (i0 = 0; i0
5.1. SUPER-WORD LEVEL PARALLELIZATION 99<br />
void SIMD_MULADD_PS ( float w[4] , float x[4] , float y[4] , float z [4]) {<br />
<strong>for</strong> ( int i =0;i
100 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
<strong>for</strong> (i0 = 0; i0
5.1. SUPER-WORD LEVEL PARALLELIZATION 101<br />
a [0] = a [0] + b [0] * c [0]; (s0)<br />
a [1] = a [1] + b [1] * c [1]; (s1)<br />
a [1] = a [1] + b [1] * c [2]; (s2)<br />
Listing 5.7: C excerpt <strong>to</strong> illustrate statement closeness.<br />
5.1.4 Generation of Optimized simd Instructions<br />
5.1.4.1 Statement Closeness<br />
An important aspect of the generation of efficient simd code is the usage of packed data.<br />
To create such packs, we introduce the notion of closeness between two statements that<br />
match the same pattern. It represents the likelihood <strong>to</strong> <strong>for</strong>m a perfectly packed operation<br />
from statements. Intuitively, two statements are close if they share the same pattern and<br />
each involved array reference from a statement is close <strong>to</strong> an array reference in the other<br />
statement. For instance, in Listing 5.7, statements s1 is closer <strong>to</strong> s0 than s2 because it<br />
has more references close <strong>to</strong> s0 than s2.<br />
Let s0 = p, r0 0, . . . , r0 <br />
n−1 and s1 = p, r1 0, . . . , r1 <br />
n−1 be two statements that match<br />
the same pattern p. The statement closeness c(s0, s1) is given by<br />
where<br />
cmax ∈ N is chosen so that:<br />
¯d(r 0 i , r 1 i ) =<br />
c(s0, s1) =<br />
<br />
n−1<br />
k=0<br />
¯d(r 0 k, r 1 k) 2<br />
cmax : r 0 i = r 1 i<br />
d(r 0 i , r 1 j ) : otherwise<br />
∀r 0 i , r 1 j , d(r 0 i , r 1 j ) = ∞ ⇒ d(r 0 i , r 1 j ) < cmax<br />
Given a statement so, a set of statements {s0, . . . , sn} that share the same pattern as<br />
so can be ordered using the following comparison function:<br />
⎧<br />
⎨<br />
cmp(si, sj) =<br />
⎩<br />
−1 : c(so, si) < c(so, sj)<br />
0 : c(so, si) = c(so, sj)<br />
1 : c(so, si) > c(so, sj)<br />
which ensures that the statement with the most and closest memory references <strong>to</strong> so are<br />
ranked first. This method is used in the “select_closest” function below.<br />
5.1.4.2 Parametric Vec<strong>to</strong>r Instruction Generation Algorithm<br />
The algorithm presented in this section generates optimized vec<strong>to</strong>r instructions given<br />
a register width w, a basic block denoted b and a set of patterns denoted patterns. It
102 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
is inspired from the preliminary work of François Ferrand [Fra03]. The originality of<br />
this algorithm, with respect <strong>to</strong> the original version of Samuel Larsen and Saman P.<br />
Amarasinghe [LA00], lies in the generation of load, s<strong>to</strong>re and shuffle operations. It is<br />
presented in Algorithm 3 and makes an extensive use of the “statement closeness” between<br />
two statements that share the same pattern (see Section 5.1.4.1).<br />
A block of statements, b, is processed statement by statement. For each s of these<br />
statements that matches a simd pattern from the input patterns, the statements that can<br />
be moved right after it, according <strong>to</strong> the dependence graph, are extracted by function<br />
extract_no_conflict. Among them, those that match the same pattern are selected by<br />
function extract_isomorphics. They are then ordered using the comparison function introduced<br />
in Section 5.1.4.1 and the w − 1 first elements are extracted <strong>to</strong> <strong>for</strong>m a pack. A<br />
set of loaded vec<strong>to</strong>rs and s<strong>to</strong>red vec<strong>to</strong>rs is then derived from this pack.<br />
Let vi denotes the i th vec<strong>to</strong>r of the pack, that is<br />
vi = r i 0, . . . , r i w−1<br />
If the corresponding memory locations are written, then a s<strong>to</strong>re from the vec<strong>to</strong>r <strong>to</strong> the memory<br />
location is unconditionally generated, a binding between the vec<strong>to</strong>r and the memory<br />
locations is added <strong>to</strong> live_registers and all the previous vec<strong>to</strong>rs referencing these locations<br />
are removed from live_registers. If the memory locations are read, live_registers is scanned<br />
<strong>for</strong> an existing vec<strong>to</strong>r that already holds their values in the same order. If any, no load<br />
is generated and the previous vec<strong>to</strong>r is used. If ∀r ∈ vi, r = r i 0, a broadcast operation is<br />
generated. Otherwise all permutations of the memory locations are checked. If a binding<br />
exists, then a shuffle operation, an operation that per<strong>for</strong>ms a permutation of the register<br />
content, is generated instead of a load. In all cases, the association between the vec<strong>to</strong>r and<br />
the memory references is s<strong>to</strong>red in live_registers.<br />
5.1.5 Pattern Discovery<br />
The vec<strong>to</strong>rization algorithm proposed in Section 5.1.4.2 is only efficient <strong>for</strong> sequences<br />
that contain enough statements <strong>to</strong> reveal patterns. This is sometimes the case <strong>for</strong> manually<br />
unrolled loops, such as the one found in the linpack benchmark. 3 To obtain more<br />
patterns, we successively apply well-known loop vec<strong>to</strong>rization techniques: loop interchange<br />
<strong>to</strong> improve locality, loop tiling <strong>to</strong> favor data packing, and finally loop unrolling. Data<br />
dependences inherited from reductions are removed through expansion of the reduction<br />
variables.<br />
5.1.6 Loop Tiling<br />
The loop tiling strategy used <strong>for</strong> vec<strong>to</strong>rization is rather simple: given a loop nest L of<br />
depth n with an innermost loop body BL, the tiling matrix is chosen as a diagonal matrix,<br />
3. In that case, the unroll rate, 5, is not a power of 2 and offers very poor intra-loop parallelism.
5.1. SUPER-WORD LEVEL PARALLELIZATION 103<br />
Data: w ← width of vec<strong>to</strong>r register<br />
Data: patterns ← set of patterns characterizing the instruction set<br />
Data: b ← list of statements<br />
Result: list of potentially vec<strong>to</strong>rized statements<br />
visited ← ∅;<br />
new_b ← ∅;<br />
live_registers ← ∅;<br />
while b = ∅ do<br />
s ← head(b);<br />
if s ∈ visited then<br />
visited ← visited ∪ {s};<br />
if match(s, pattern) then<br />
nconflict ← extract_no_conflict(tail(b), s);<br />
iso_stats ← extract_isomorphics(nconflict, s);<br />
if iso_stats = ∅ then<br />
simd_s ← select_closest(iso_stats, s, w);<br />
load_s ← gen_load(simd_s, live_registers);<br />
s<strong>to</strong>re_s ← gen_s<strong>to</strong>re(simd_s, live_registers);<br />
update_live_registers(simd_s, live_registers);<br />
new_b ← new_b ; load_s;<br />
new_b ← new_b ; simd_s;<br />
new_b ← new_b ; s<strong>to</strong>re_s;<br />
<strong>for</strong> s ′ ∈ simd_s do<br />
visited ← visited ∪ {s ′ };<br />
end<br />
else<br />
new_b ← new_b ; s;<br />
update_live_registers(s, live_registers);<br />
end<br />
else<br />
new_b ← new_b ; s;<br />
update_live_registers(s, live_registers);<br />
end<br />
end<br />
b ← tail(b);<br />
end<br />
Algorithm 3: Parametric vec<strong>to</strong>r instruction generation algorithm.
104 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
<strong>for</strong> (it = 0; it 4* it +3)<br />
<strong>for</strong> ( jt1 = 0; jt1 4* jt1 +3)<br />
<strong>for</strong> ( i11 = 4* it; i11
5.1. SUPER-WORD LEVEL PARALLELIZATION 105<br />
Data: w ← width of vec<strong>to</strong>r register<br />
Data: prog ← whole program<br />
Result: vec<strong>to</strong>rized program<br />
<strong>for</strong> f ∈ functions(prog) do<br />
if_conversion(f) ([AKPW83])<br />
n_address_code_generation(3, f)<br />
<strong>for</strong> l ∈ loops(f) do<br />
loop_interchange(f, l) (if profitable)<br />
loop_tiling(f, l) (see § 5.1.6)<br />
end<br />
<strong>for</strong> l ∈ innermost_loops(f) do<br />
unroll(f, l, w)<br />
end<br />
reduction_parallelization(f) (see § 5.2)<br />
<strong>for</strong> b ∈ basic_blocks(f) do<br />
scalar_renaming(f, b) ([ASU86])<br />
slp(f, b, w) (see § 5.1.4.2)<br />
end<br />
dead_code_elimination(f)<br />
redundant_load_s<strong>to</strong>re_elimination(f) (see § 6.3)<br />
end<br />
Algorithm 4: Hybrid vec<strong>to</strong>rization at the pass manager level.
106 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
5.2 Reduction Parallelization<br />
A reduction is in<strong>for</strong>mally defined as the processing of a data structure of n elements <strong>to</strong><br />
compute n−1 or fewer values. For instance the computation of an his<strong>to</strong>gram over a dataset<br />
of n elements split in k < n categories is a reduction. Formally, a reduction operation occurs<br />
when an associative opera<strong>to</strong>r, say ⊗, operates on a variable x as in x = x ⊗ expression and<br />
x is not referenced in expression.<br />
Code with reductions are not parallel and are indeed a bottleneck <strong>for</strong> many scientific<br />
applications. For instance [GPZ + 01] reported the presence of reductions in several hpc<br />
benchmarks and measured that the parallelization of those reductions led <strong>to</strong> an average<br />
speedup of ×2.7 on 16 processors. For that reason, techniques <strong>to</strong> parallelize reductions<br />
have been developed.<br />
Parallelization of reduction algorithms themselves has been well studied and efficient<br />
algorithms are available [KRS90, Lei92]. The challenge is first <strong>to</strong> detect the reduction,<br />
then <strong>to</strong> parallelize it depending on the hardware features. The <strong>for</strong>mer is a well known<br />
subject [JD89, ZC91] and involves the detection of the reduction pattern, a check of the<br />
reduction opera<strong>to</strong>r property and a check of the data dependencies with the loop body—<br />
reduction parallelization is only valid if the reduction variable is not used outside of the<br />
reduction. The latter requires more attention. A common way <strong>to</strong> parallelize reductions is <strong>to</strong><br />
place them in<strong>to</strong> a critical section, as proposed in Open Multi Processing (openmp) [Ope11]<br />
<strong>for</strong> non-a<strong>to</strong>mic reductions, or <strong>to</strong> use a<strong>to</strong>mic version of the reduction opera<strong>to</strong>rs when they<br />
are available, but it puts all the contention in a single place. To overcome this, parallel<br />
prefixes [LF80, Ble89] are generally used. They rely on the associativity of the reduction<br />
opera<strong>to</strong>r <strong>to</strong> per<strong>for</strong>m partial reductions in parallel.<br />
However, these generic algorithms do not take advantage of the specific hardware feature<br />
that may optimize the parallel reduction. In [GPZ + 01], Marìa Jesùs Garzaràn proposed<br />
a hardware design that makes it possible <strong>to</strong> per<strong>for</strong>m reductions efficiently thanks <strong>to</strong> the<br />
delegation of the merging phase <strong>to</strong> the hardware. Field Programmable Gate Array (fpga)<br />
design <strong>for</strong> such algorithms exist [Zim97] and take in<strong>to</strong> account the speedup/area ratio.<br />
More recently, versions of parallel prefix have been implemented <strong>for</strong> nVidia Graphical<br />
Processing Unit (gpu) [SHG08], while the Brook language [BFH + 04] provides built in<br />
support <strong>for</strong> reductions on gpu.<br />
This leads <strong>to</strong> the idea that per<strong>for</strong>ming a reduction efficiently on a specific hardware<br />
must be done by taking in<strong>to</strong> account the hardware specificities. However it is difficult <strong>for</strong><br />
a compiler <strong>to</strong> au<strong>to</strong>matically generate an optimized, target-dependant reduction algorithm<br />
<strong>for</strong> non-trivial cases. It is more practical <strong>to</strong> call a generic routine or use a pre-defined stub<br />
instead. Two strategies are explored: Section 5.2.1 details a template-based approach and<br />
Section 5.2.2 details how <strong>to</strong> delegate reduction handling <strong>to</strong> a third-party function.<br />
5.2.1 Reduction Detection Inside a Sequence<br />
The slp algorithm presented in Section 5.1 only works on sequences. Because reductions<br />
introduce data dependencies that prevent the vec<strong>to</strong>rization, they need <strong>to</strong> be removed,
5.2. REDUCTION PARALLELIZATION 107<br />
something typically done in sse using a partial sum vec<strong>to</strong>r on <strong>for</strong> loops.<br />
We have extended this approach <strong>to</strong> process sequences, as shown by Figure 5.5.<br />
int a, b, c, d;<br />
int r = 0;<br />
r += a;<br />
r += b;<br />
r += c;<br />
r += d;<br />
(a) Reduction in a sequence be<strong>for</strong>e parallelization.<br />
int a, b, c, d;<br />
// PIPS generated variable<br />
int RED0 [4];<br />
int r = 0;<br />
RED0 [0] = 0;<br />
RED0 [1] = 0;<br />
RED0 [2] = 0;<br />
RED0 [3] = 0;<br />
RED0 [0] = RED0 [0]+ a;<br />
RED0 [1] = RED0 [1]+ b;<br />
RED0 [2] = RED0 [2]+ c;<br />
RED0 [3] = RED0 [3]+ d;<br />
r = RED0 [3]+ RED0 [2]+ RED0 [1]+ RED0 [0]+ r;<br />
(b) Reduction in a sequence after parallelization.<br />
Figure 5.5: Parallelizing reductions in a sequence.<br />
To achieve this goal, we first use the reduction analysis presented in [JD89] <strong>to</strong><br />
per<strong>for</strong>m a semantic detection of reduction statements which associates <strong>to</strong> each statement<br />
a set of couple 〈reduction, opera<strong>to</strong>r〉. Once all statements holding a reduction are<br />
flagged, these reductions are aggregated at the sequence level <strong>to</strong> <strong>for</strong>m a set of pairs<br />
{〈〈reductioni, opera<strong>to</strong>r i〉 , ni〉} where ni is the number of times reduction reductioni is per<strong>for</strong>med<br />
in the sequence. During the aggregation process, any reduction that is referenced<br />
by a non-reduction statement is pruned out. Then <strong>for</strong> each reductioni, an array aredi of<br />
ni elements is created <strong>to</strong> hold the intermediate values. A prelude fills the array with the<br />
neutral values, depending on the reduction opera<strong>to</strong>r opera<strong>to</strong>r i, and a postlude per<strong>for</strong>ms<br />
the reduction using the same opera<strong>to</strong>r. They are added be<strong>for</strong>e and after the sequence,<br />
respectively.<br />
If the statement block is the body of a loop and there is no data dependency between<br />
the loop index and the reduction variable, then the prelude and postlude can be moved<br />
around the surrounding loops.<br />
This behavior is shown in Figure 5.6 starting from an extract of the ddot_r function
108 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
<strong>for</strong> ( LU_IND0 = 0; LU_IND0
5.2. REDUCTION PARALLELIZATION 109<br />
__m128d xres = _mm_setzero_pd ( );<br />
<strong>for</strong> (i =0;i
110 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
void erode ( int n, int m, int in[n][m], int out [n][m]) {<br />
<strong>for</strong> ( int i =0;i
5.4. CONCLUSION 111<br />
– an execution time on the accelera<strong>to</strong>r, given as the ratio between the sequential execution<br />
time on the host processor, th, and the average relative speedup provided by<br />
the accelera<strong>to</strong>r, ath.<br />
This model is based on two assumptions: the data transfer time is a linear function of<br />
the amount of data transferred, and the accelera<strong>to</strong>r execution time is proportional <strong>to</strong> the<br />
host execution time. These assumptions are discussed in Section 5.3.2.<br />
The profitability of the offloading can be expressed as Inequality (5.2), which turns<br />
Equation (5.1) in<strong>to</strong> Inequality (5.3).<br />
τ0 +<br />
ta(s, σ) < th(s, σ) (5.2)<br />
V (s, σ)<br />
B<br />
< th(s, σ) × ath − 1<br />
ath<br />
(5.3)<br />
τ0, B, ath and th(s, σ), are parameters that depend from the hardware and host target.<br />
On the other hand V (s, σ) is program-dependent and can be computed at compile-time.<br />
As a consequence, the offloading decision is postponed <strong>to</strong> runtime. For instance, in the case<br />
of a matrix multiplication between two n × n matrices, th = O(n 3 ), while V (s, σ) = O(n 2 );<br />
in the absence of constraints on n, an off-line asymp<strong>to</strong>tic decision would unconditionally<br />
offload the kernel <strong>to</strong> an accelera<strong>to</strong>r, whereas a high τ0 value should prevent the offloading<br />
<strong>for</strong> small value of n.<br />
5.3.2 Limitations of the Model<br />
The model described in this section does not take in<strong>to</strong> account several aspects of data<br />
V (S,σ)<br />
transfers. Data transfer time cannot solely be represented by τ0 + . In the case<br />
B<br />
of gpu boards, data alignment has a significant impact on per<strong>for</strong>mances, and zero copy<br />
mechanism can be used <strong>for</strong> data that are read/written only once, but these aspects are<br />
ignored. Asynchronous transfers are often used <strong>to</strong> overlap communications and hide data<br />
transfer cost, which makes our approach over-pessimistic. In a similar manner, a succession<br />
of kernels can result in<strong>to</strong> redundant data transfers. Our method only takes in<strong>to</strong> account<br />
local in<strong>for</strong>mation.<br />
5.4 Conclusion<br />
In this section, we have focused on Super-word Level Parallelism and proposed an<br />
original algorithm that combines the traditional loop-based approach with the more recent<br />
sequence based pattern-matching, parametrized by the Multimedia Instruction Set<br />
description and without the need of loop rerolling. This combination makes it possible <strong>to</strong><br />
discover parallelism outside of loops or in manually unrolled loops, while still benefiting<br />
from the research led over the past decades in loop parallelization. Its validity is examined<br />
on several linpack kernels in Chapter 7.
112 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />
In a similar manner, we have extended reduction parallelization <strong>to</strong> sequences where it<br />
only held <strong>for</strong> loops, which led <strong>to</strong> more parallelization opportunities when combined with<br />
the slp algorithm. We also propose a methodology <strong>to</strong> ignore hardware-specific mechanism<br />
<strong>for</strong> reductions at compilation time.<br />
In the next chapter, we examine a last category of hardware constraints: distributed<br />
memory.
Chapter 6<br />
Trans<strong>for</strong>mations <strong>for</strong> Memory Size and<br />
Distribution<br />
Pont de Pacé, Ille-et-Vilaine c○ Pymouss / WikipediA<br />
Wm. A. Wulf and Sally A. Mckee concluded their article [WM95] “Hitting the<br />
Memory Wall: Implications of the Obvious” published in 1995 by the following sentence:<br />
The most “convenient” resolution <strong>to</strong> the problem would be the discovery of a<br />
cool, dense memory technology whose speed scales with that of processors. We<br />
are not aware of any such technology (. . . ).<br />
Fifteen years later, we are no more aware of any such technologies, and memory is still<br />
a critical issue <strong>for</strong> many parallel applications. In the context of heterogeneous computing,<br />
where host and accelera<strong>to</strong>r memory space are often separated, it is important <strong>to</strong> handle this<br />
hardware constraint with care. To this end, we introduce three generic trans<strong>for</strong>mations:<br />
statement isolation that separates the accelera<strong>to</strong>r memory space from the host memory<br />
space, presented in Section 6.1, memory footprint reduction that finds out tiling parameters<br />
<strong>for</strong> a loop nest so that the inner loops fit in<strong>to</strong> the target memory, presented in Section 6.2,<br />
and redundant load-s<strong>to</strong>re elimination presented in Section 6.3.<br />
113
114CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
void foo ( int i) {<br />
int j;<br />
j=i*i; //
6.1. STATEMENT ISOLATION 115<br />
where i ′ are new identifiers unique <strong>to</strong> the program, idss is a function that collects all<br />
identifiers syntactically used by a given statement and si→i ′<br />
identifier i is syntactically changed in<strong>to</strong> i<br />
is a statement where the<br />
′ .<br />
Definition 6.1. The idss : S → P(I) function is defined by the syntactic rules:<br />
idss(;) = ∅<br />
idss({ type id ; stat }) = idst(type) ∪ idss(stat)<br />
idss( ref=expr) = idsr(ref) ∪ idse(expr)<br />
idss( ref=read) = idsr(ref) ∪ {istdin}<br />
idss(write expr) = {istdout} ∪ idse(expr)<br />
idss(f(ref)) = idsr(ref)<br />
idss(if( expr ) { stat0 } else { stat1 }) = idse(expr) ∪ idss(stat0) ∪ idss(stat1)<br />
idss(while( expr ) { stat }) = idse(expr) ∪ idss(stat)<br />
where idst : T → P(I) is given by:<br />
idss(stat0 ; stat1) = idss(stat0) ∪ idss(stat1)<br />
idst(int | float | complex) = ∅<br />
idst(struct id { fields }) = ∅<br />
where idse : E → P(I) is given by:<br />
and idsr : R → P(I) is given by:<br />
idst(type [ expr ]) = idst(type) ∪ idse( expr)<br />
idse(cst) = ∅<br />
idse(ref) = idsr(ref)<br />
idse(expr0 op expr1) = idse(expr0) ∪ idse(expr1)<br />
idsr(id) = {id}<br />
idsr(ref [ expr ]) = idsr(ref) ∪ idse(expr)<br />
idsr(ref . fieldname) = idsr(ref)<br />
Definition 6.2. We denote si→i ′ the statement where the identifier i is syntactically<br />
changed in<strong>to</strong> i ′ , where i ′ ∈ I \ idss(s) ∧ ∀σ ∈ Σ, σ(i ′ ) = unbound.<br />
ei→i ′ and ri→i ′ have a similar meaning in the context of expression and references.
116CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
A lesser version of Theorem (6.1) is given in Theorem (6.2).<br />
Theorem 6.2. The evaluation of a statement where one identifier has been isolated yields<br />
the same memory state as the evaluation of the original statement.<br />
Given a statement s ∈ S, a memory state σ ∈ Σ and i ∈ idss(s) s.t. ∀j ∈ idss(s), I(j) ∈<br />
{I(istdin), I(istdout)}<br />
S({typeof(i) i ′ ; i ′ = i ; si→i ′ ; i = i′ ; }, σ) = S(s, σ)<br />
Theorem (6.1) results from the iterative application of Theorem (6.2) on all variables<br />
referenced by s. The remaining of the section is dedicated <strong>to</strong> the proof of Theorem (6.2).<br />
6.1.1.1 Expression Renaming<br />
Lemma 6.3. The evaluation of an expression e ∈ E in state σ ∈ Σ is not changed by the<br />
renaming of an identifier.<br />
∀i ∈ idse(e), ∀i ′ ∈ idse(e) s.t. typeof(i ′ ) = typeof(i),<br />
E(e, σ) = E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />
Proof. Let prove this lemma by induction on the syntactic elements of the expression<br />
domain. Let e be an expression, and σ a memory state. We choose i ∈ idse(e) and<br />
i ′ ∈ idse(e), s.t. typeof(i) = typeof(i ′ ).<br />
Constants If e = cst, we have<br />
E(ei→i ′, σ)<br />
=E(cst i→i ′, σ[I(i′ ) → σ(I(i))])<br />
=E(cst, σ[I(i ′ ) → σ(I(i))])<br />
=cst<br />
=E(e, σ)<br />
Identifiers If e = id, we have<br />
E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />
=E(id i→i ′, σ[I(i′ ) → σ(I(i))])
6.1. STATEMENT ISOLATION 117<br />
if i = id,<br />
=E(i ′ , σ[I(i ′ ) → σ(I(i))])<br />
=(σ[I(i ′ ) → σ(I(i))]))(I(i ′ ))<br />
=σ(I(i))<br />
=E(e, σ)<br />
otherwise we have id i→i ′ = id and<br />
=E(id, σ[I(i ′ ) → σ(I(i))])<br />
=E(id, σ)<br />
which terminates the induction proof <strong>for</strong> the initial elements of E.<br />
References We now consider non-initial elements. If e = ref . fieldname, we have:<br />
E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />
=E(ref i→i ′ . fieldname, σ[I(i′ ) → σ(I(i))])<br />
=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))]).fieldname <br />
=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))]) .fieldname<br />
=σ(R(ref , σ)).fieldname from induction hypothesis<br />
=σ(R(ref , σ).fieldname)<br />
=E(e, σ)<br />
If e = ref [ expr ], we have:<br />
E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />
=E(ref i→i ′ [ expr i→i ′ ], σ[I(i′ ) → σ(I(i))])<br />
=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))])[E(expr i→i ′, σ[I(i′ ) → σ(I(i))])] <br />
=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))])[E(expr, σ)] from induction hypothesis<br />
<br />
= σ[I(i ′ ) → σ(I(i))] R(ref i→i ′, σ[I(i′ ) → σ(I(i))]) <br />
[E(expr, σ)]<br />
= σ(R(ref , σ)) [E(expr, σ)] from induction hypothesis<br />
= σ(R(ref , σ))[E(expr, σ)] <br />
=E(e, σ)
118CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
Arithmetic Operations In the case of e = expr 0 op expr 1, we have,<br />
E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />
=E(expr 0 op expr 1i→i ′, σ[I(i′ ) → σ(I(i))])<br />
=E(expr 0i→i ′, σ[I(i′ ) → σ(I(i))]) op E(expr 1i→i ′, σ[I(i′ ) → σ(I(i))]<br />
=E(expr 0, σ) op E(expr 1, σ) from induction hypothesis<br />
=E(e, σ)<br />
6.1.1.2 Type Renaming<br />
We state a similar lemma <strong>for</strong> type evaluation:<br />
Lemma 6.4. The evaluation of a type t ∈ T in state σ ∈ Σ is not changed by the renaming<br />
of an identifier.<br />
∀i ∈ idse(e), ∀i ′ ∈ idst(t) s.t. typeof(i) = typeof(i ′ ),<br />
T(t, σ) = T(ti→i ′, σ[I(i′ ) → σ(I(i))])<br />
Proof. We use an induction proof on the type definition.<br />
Scalar Types The equality is direct <strong>for</strong> int, float and complex that are independent<br />
from the memory state and unchanged by the renaming rule.<br />
Structures In the case of t = struct id { fields }, we have<br />
= T(t, σ)<br />
T(ti→i ′, σ[I(i′ ) → σ(I(i))])<br />
= <br />
T(fi→i ′, σ[I(i′ ) → σ(I(i))])<br />
〈f,i〉∈fields<br />
= <br />
〈f,i〉∈fields<br />
Arrays In the case of t = type [ expr ], we get<br />
T(ti→i ′, σ[I(i′ ) → σ(I(i))])<br />
T(f, σ) from induction hypothesis<br />
=T( typei→i ′, σ[I(i′ ) → σ(I(i))]) × E(expri→i ′, σ[I(i′ ) → σ(I(i))])<br />
=T( type, σ) × E(expr, σ)from induction hypothesis and Lemma 6.3<br />
=T(t, σ)
6.1. STATEMENT ISOLATION 119<br />
6.1.1.3 Statement renaming<br />
Lemma (6.3) and Lemma (6.4) can be extended <strong>to</strong> the statement domain:<br />
Lemma 6.5. The evaluation of a statement s ∈ S in memory state σ ∈ Σ is not changed<br />
by the renaming of an identifier.<br />
∀i ∈ idss(s) s.t. I(i) ∈ {I(istdin), I(istdout)} , ∀i ′ ∈ idss(s) s.t. typeof(i) = typeof(i ′ ),<br />
S(si→i ′, σ[I(i′ ) → σ(I(i))]) = S(s, σ)[I(i) → σ(I(i)), I(i ′ ) →<br />
<br />
S(s, σ) I(i) <br />
]<br />
We were not able <strong>to</strong> prove this lemma <strong>for</strong>mally. In<strong>for</strong>mally, it describes the result of<br />
the evaluation of a statement where variable i is syntactically changed in<strong>to</strong> i ′ in a memory<br />
state where each memory location associated <strong>to</strong> i holds the value of the associated memory<br />
location from i. It states that this new memory state is the same as the one resulting<br />
from the evaluation of the initial statement in the initial memory state, with the memory<br />
locations associated <strong>to</strong> i unchanged, and the memory locations associated <strong>to</strong> i ′ holding the<br />
updated values.<br />
6.1.1.4 Restricted Statement Isolation<br />
Proof.<br />
We can now prove Theorem (6.2)<br />
S({typeof(i) i ′ ; i ′ = i ; si→i ′ ; i = i′ ; }, σ)<br />
=unbind(S(i ′ = i; si→i ′ ; i = i ;′ , loc(id, T(type, σ), σ)<br />
<br />
=σ ′<br />
), i ′ )<br />
=unbind(S(i = i ′ , S(si→i ′, S(i′ = i, σ ′ ))), i ′ )<br />
=unbind(S(i = i ′ , S(si→i ′, σ′ [I(i ′ ) → σ ′ (I(i))])), i ′ )<br />
We can apply Lemma 6.5 <strong>to</strong> the evaluation of si→i ′ <strong>to</strong> get<br />
=unbind(S(i = i ′ , S(s, σ ′ )[I(i) → σ ′ (I(i)), I(i ′ <br />
) → S(s, σ ′ ) I(i) <br />
=unbind(S(s, σ ′ )[I(i) → σ ′ (I(i)), I(i ′ ) →<br />
=unbind(S(s, σ ′ )[I(i ′ ) →<br />
=unbind(S(s, σ ′ )[I(i ′ ) →<br />
=S(s, σ)<br />
<br />
S(s, σ ′ ) I(i) <br />
<br />
S(s, σ ′ ) I(i) <br />
, I(i) →<br />
<br />
S(s, σ ′ ) I(i) <br />
], i ′ )<br />
, I(i) →<br />
<br />
S(s, σ ′ ) I(i) <br />
], i ′ )<br />
]), i ′ )<br />
<br />
S(s, σ ′ ) I(i) <br />
], i ′ )
120CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
6.1.2 Statement Isolation and Convex Array Regions<br />
The application of Theorem (6.1) leads <strong>to</strong> a correct but very poor code. Indeed any<br />
variable referenced in the isolated statement is transferred back and <strong>for</strong>th, even if it is only<br />
used, or if it is only defined by the statement. In a similar manner, arrays are transferred<br />
as a whole, while only a sub-array may be needed. This section studies the interaction<br />
between convex array regions and statement isolation, using the <strong>for</strong>mer <strong>to</strong> reduce the data<br />
transfers generated by the latter. 1<br />
In a nutshell, the approach is similar <strong>to</strong> using Theorem 6.1 <strong>for</strong> all variables, then calling<br />
an enhanced version of dead code elimination <strong>to</strong> remove all the “dead” data transfers.<br />
Given a statement s, it is possible <strong>to</strong> compute an estimate of the array regions imported<br />
or exported by this statement <strong>for</strong> each array reference r referenced by s. These regions<br />
are denoted Ri(s, σ)[r] and Ro(s, σ)[r], respectively. Depending on the accuracy of the<br />
analysis, these regions are either exact, R = i (v), or over-estimated, R <br />
i (v). There is a<br />
strong relationship between these array regions and the data <strong>to</strong> be transferred. Considering<br />
a statement s ∈ S:<br />
Transfers from the accelera<strong>to</strong>r All data that may be exported by s must be copied<br />
back <strong>to</strong> the host from the accelera<strong>to</strong>r:<br />
TH←A : S × Σ → P(R) = (s, σ) ↦→ Ro(s, σ) (6.1)<br />
Transfers <strong>to</strong> the accelera<strong>to</strong>r All data that may be imported by s must be copied from<br />
the host <strong>to</strong> the accelera<strong>to</strong>r:<br />
TH→A : S × Σ → P(R) = (s, σ) ↦→ Ri(s, σ)<br />
Indeed, all data <strong>for</strong> which we have no guarantee of a preliminary write by s must be<br />
copied in. Otherwise, uninitialized data may be transferred back <strong>to</strong> the host without<br />
being initialized. So the extended <strong>for</strong>mula is:<br />
TH→A : S × Σ → P(R) = (s, σ) ↦→ Ri(s, σ) ∪ (R o (s, σ) − R = o (s, σ)) (6.2)<br />
Based on Equations (6.1) and (6.2) it is possible <strong>to</strong> allocate new variables on the<br />
accelera<strong>to</strong>r, <strong>to</strong> generate copy operations from the old variables <strong>to</strong> the newly allocated ones<br />
and <strong>to</strong> per<strong>for</strong>m the required change of frame on s. Listing 6.2 illustrates this trans<strong>for</strong>mation<br />
on the running example from Listing 5.9. It presents the variable replacement, the data<br />
allocation and the 2D data transfers. Thanks <strong>to</strong> region analysis, in0 is not copied out<br />
and out0 is not copied in. The generated data transfers are target-independent and the<br />
implementation is specialized depending on the targeted accelera<strong>to</strong>r.<br />
1. Statement isolation can also be used <strong>to</strong> generate thread local s<strong>to</strong>rage, or improve cache behavior.
6.2. MEMORY FOOTPRINT REDUCTION 121<br />
void erode ( int n, int m, int in[n][m], int out [n][m]) {<br />
int (* out0 )[n][m] = 0, (* in0 )[n][m+1] = 0;<br />
P4A_accel_malloc (( void **) &in0 , sizeof ( int )*n*(m+1) );<br />
P4A_accel_malloc (( void **) &out0 , sizeof ( int )*n*m);<br />
P4A_copy_<strong>to</strong>_accel_2d ( sizeof ( int ), n, m, n, m+1, 0, 0, &in [0][0] ,<br />
* in0 );<br />
P4A_copy_<strong>to</strong>_accel_2d ( sizeof ( int ), n, m, n, m, 0, 0, & out [0][0] , *<br />
out0 );<br />
<strong>for</strong> ( int i = 0; i
122CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
To compute Vl we gather all the identifiers syntactically found in s and sum up their<br />
type size.<br />
Vl : S → E = s ↦→ <br />
i∈decls(s)<br />
sizeof(i)<br />
where decls : S → P(I) is a function similar <strong>to</strong> idss : S → P(I) that collects all identifiers<br />
declared in a statement.<br />
This approach would be <strong>to</strong>o naive <strong>for</strong> variables declared outside of s, as it includes all<br />
array elements, even those that are never written or read. Convex array regions are useful<br />
here, and we can use the <strong>for</strong>mulae:<br />
Vo : S × Σ → E(s, σ) ↦→ |Rr(s, σ) ∪ Rw(s, σ)|<br />
where the cardinal opera<strong>to</strong>r counts the number of elements in the resulting set. It can be<br />
split in a per-variable <strong>for</strong>m:<br />
Vo(s, σ) = <br />
i∈decls(s)<br />
|Rr(s, σ)[i] ∪ Rw(s, σ)[i]|<br />
where the [·] opera<strong>to</strong>r selects all the references prefixed by the given identifier. If we<br />
compute the convex hull of the region union, it is possible <strong>to</strong> count symbolically its cardinal<br />
using Ehrhart polynomials [Cla96].<br />
Vo(s, σ) ≤ <br />
i∈decls(s)<br />
|Rr(s, σ)[i] ¯∪ Rw(s, σ)[i]|<br />
where · ¯∪ · is the convex union.<br />
Because arrays in C have rectangular shapes 3 , it is more realistic <strong>to</strong> consider the rectangular<br />
hull of the regions, which leads <strong>to</strong> Equation 6.3.<br />
V (s, σ) ≤ Vl(s) + <br />
i∈decls(s)<br />
|⌈Rr(s, σ)[i]¯∪Rw(s, σ)[i]⌉| (6.3)<br />
where ⌈·⌉ is the rectangular hull. When s is surrounded by loops and the above expression<br />
depends on the loop indices, it is possible <strong>to</strong> trans<strong>for</strong>m these loops <strong>to</strong> change the memory<br />
footprint.<br />
6.2.2 Symbolic Rectangular Tiling<br />
Let us consider a perfectly nested loop s of depth n. In order <strong>to</strong> work out the tiling<br />
parameters, a two-step process is used: n symbolic values denoted p1, . . . , pn are introduced<br />
<strong>to</strong> represent the computational blocks and a symbolic tiling, parameterized by these values,<br />
compute their exact value. It is however possible <strong>to</strong> compute an over-estimation of this volume thanks <strong>to</strong><br />
statement preconditions, but this dissertation does not dive in<strong>to</strong> these details.<br />
3. It is possible <strong>to</strong> allocate non-rectangular convex shapes using pointer arrays. . .
6.3. REDUNDANT LOAD STORE OPTIMIZATION 123<br />
is per<strong>for</strong>med. It generates n outer loops and n inner loops. The statement carrying the<br />
inner loops is denoted sinner and the memory state be<strong>for</strong>e its execution is denoted σinner.<br />
The idea is <strong>to</strong> run the inner loops on the accelera<strong>to</strong>r once the pk are chosen so that the<br />
memory footprint of sinner does not exceed a threshold defined by the hardware. To this<br />
end, the memory footprint V (sinner, σinner) is computed and one of the solutions satisfying<br />
Condition (6.4) is searched.<br />
V (sinner, σinner) ≤ Vmaxa<br />
(6.4)<br />
Vmaxa is the memory size of the considered accelera<strong>to</strong>r a. This gives one inequality over<br />
the pk. Other constraints are derived from the accelera<strong>to</strong>r model specified in Section 2.1:<br />
e.g. a vec<strong>to</strong>r accelera<strong>to</strong>r requires p1 <strong>to</strong> be set <strong>to</strong> the vec<strong>to</strong>r size. The algorithm is given in<br />
a synthetic <strong>for</strong>m in Algorithm 5:<br />
Data: ln ← a perfect loop nest of depth n<br />
Data: Vmax a maximum memory footprint<br />
Data: c an additional system of linear inequalities<br />
Result: a statement that matches c and the volume constraint<br />
l2n ← rectangular_tiling(〈x1, . . . , xn〉);<br />
l ′ ← inner_loop(l2n, n);<br />
p ← memory_footprint(l ′ );<br />
〈p1, . . . , pn〉 ← solve(s ∧ (p ≤ Vmax), 〈x1, . . . , xn〉);<br />
return ;<br />
i∈1,n int xi = pi ; l2n<br />
Algorithm 5: Memory footprint reduction algorithm.<br />
Listing 6.3 shows the effect of a symbolic tiling and the result array region analysis on<br />
the running example. As a result, the memory footprint of sinner is given as a function of<br />
p1, p2 in Equation (6.5).<br />
V (sinner, σinner) = 2 × p1 × p2<br />
For terapix, the constraint system is<br />
x1 ≤ 128<br />
2 × x1x2 ≤ 1024<br />
and the tuple 〈128, 512〉 is the maximal solution.<br />
6.3 Redundant Load S<strong>to</strong>re Optimization<br />
(6.5)<br />
At every parallelism level, be it at the node, cpu or instruction level, data transfers<br />
is often the per<strong>for</strong>mance bottleneck. The time spent <strong>to</strong> transfer data does not contribute<br />
directly <strong>to</strong> the computation. There are two complementary approaches <strong>to</strong> limit the time<br />
loss:
124CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
void erode ( int n, int m, int in[n][m], int out [n][m]) {<br />
int p_1 , p_2 ;<br />
<strong>for</strong> ( int it = 0; it
6.3. REDUNDANT LOAD STORE OPTIMIZATION 125<br />
We also introduce a function that checks if two statements satisfy the Bernstein’s condition<br />
[Ber66], B : S × S → {true, false}.<br />
Characterizations of Direct Memory Access (dma) are used in the <strong>for</strong>m of load and<br />
s<strong>to</strong>re statements.<br />
Definition 6.3. A statement s ∈ S in memory state σ ∈ Σ is a dma statement if it<br />
verifies the following properties:<br />
1. s is a function call ;<br />
2. s writes a single convex array region: ∃i ∈ I s.t. Rw(σ, s) = {i[φ0, . . . , φk]}<br />
A dma statement is a function call statements that write data <strong>to</strong> a single location. As<br />
such the assignment opera<strong>to</strong>r, “=” is a <strong>for</strong>m of dma. loads and s<strong>to</strong>res are distinguished<br />
by their name—the Internal Representation (ir) does not distinguish between host and<br />
remote memory.<br />
Given a dma statement, we define its reciprocal as follows.<br />
Definition 6.4. The reciprocal of a dma statement d is a statement denoted d −1 that<br />
verifies the following properties:<br />
∀σ ∈ Σ, ∀l ∈ (L \ R(Rw(σ, d), σ)), S(d ; d −1 , σ)(l) = S(d, σ)(l)<br />
For instance, the statement denoted by “memcpy(a,b,10*sizeof(in));” is a dma and its<br />
reciprocal is denoted by “memcpy(b,a,10*sizeof(in));”. The idea is that in the sequence<br />
memcpy(a,b,10*sizeof(in));memcpy(b,a,10*sizeof(in)), the second call is useless.<br />
6.3.1 Redundant Load Elimination<br />
The algorithm used <strong>to</strong> move load statement upward is based on a simple idea: step-bystep<br />
move load operations upward in the hcfg so that are executed as soon as possible.<br />
Combined with the redundant s<strong>to</strong>re elimination trans<strong>for</strong>mation described in Section 6.3.2,<br />
it can lead <strong>to</strong> two optimizations:<br />
– Move load operations outside of loops, leading <strong>to</strong> an optimization related <strong>to</strong> invariant<br />
code motion;<br />
– Remove load and s<strong>to</strong>re operations when they meet.<br />
The next sections define the legality conditions <strong>for</strong> moving a statement in the three<br />
most common control flow constructs—sequences, tests and loops—and how this can be<br />
done interprocedurally.<br />
6.3.1.1 Sequences<br />
Let consider a statement sequence where sl is a load statement:<br />
s = s0 ; sl
126CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
Bernstein’s condition give us a condition when it is valid <strong>to</strong> swap them, as shown in<br />
Equation (6.6).<br />
<br />
sl ; s0 Bern(s0, sl)<br />
Rl(s) =<br />
(6.6)<br />
s otherwise<br />
6.3.1.2 Tests<br />
Let consider a branch statement.<br />
s = if(ec) { s0 ; st } else { s1 ; sf }<br />
Depending on the nature of s0 and s1, it is possible and profitable <strong>to</strong> move them be<strong>for</strong>e<br />
the condition. If s0 and s1 are similar, there is an opportunity <strong>to</strong> merge both statements<br />
in<strong>to</strong> a single one.<br />
Let σc denotes the memory state after the evaluation of ec. Both s0 and s1 are evaluated<br />
in the same memory state σc. If they both are load statements, satisfy the Bernstein’s<br />
condition and are textually equal, it is possible <strong>to</strong> move them upward in a single statement,<br />
as summarized by Equation (6.7).<br />
<br />
s0 ; if(ec) { st } else { sf } (Bern)(s0, ec) ∧ s0<br />
Rl(s) =<br />
s otherwise<br />
t<br />
= s1<br />
(6.7)<br />
Similarly, if only s0 or s1 is a load statement, and it satisfies the Bernstein’s condition,<br />
then it can be moved outside the test.<br />
6.3.1.3 Loops<br />
Let consider a loop statement:<br />
s = do { sl ; s0 } while(ec);<br />
A sufficient condition <strong>to</strong> move s0 out of the loop if that s0 satisfies the Bernstein’s<br />
condition with s0 and ec, leading <strong>to</strong> Equation (6.8).<br />
<br />
sl ; do { s0 } while(ec) Bern(s0, sl) ∧ Bern(sl, ec) ∧ S(sl ; sl) = sl<br />
Rl(s) =<br />
s otherwise<br />
(6.8)<br />
Proof. We use a recursive proof on the number of iteration of loop s. Let s n denote s<br />
when the loop body executes n times. The property <strong>to</strong> prove is that Equation (6.8) holds<br />
∀s n , n ∈ N ∗ .<br />
For n = 1<br />
s 1 = sl ; s0 ; ec<br />
= Rl(s 1 )
6.3. REDUNDANT LOAD STORE OPTIMIZATION 127<br />
Let assume the property is true <strong>for</strong> n ∈ N ∗ ,<br />
s n+1 = do { sl ; s0 } while(ec) <br />
= sl ; s0 ; ec ; s n <br />
= sl ; s0 ; ec ; Rl(s n ) from induction hypothesis<br />
= sl ; s0 ; ec ; sl ; do { s0 } while(ec) from definition<br />
= sl ; sl ; s0 ; ec ; do { s0 } while(ec) from Bern(sl, s0) ∧ Bern(sl, ec)<br />
= sl ; s0 ; ec ; do { s0 } while(ec) sl is idempotent<br />
= Rl(s n+1 )<br />
6.3.1.4 Interprocedurally<br />
As the result of moving load statement upward in the hcfg, a load can be found at<br />
the entry point of a function. In that case it may be interesting <strong>to</strong> move the load at the<br />
call sites. To do so, one must first ensure that the memory state be<strong>for</strong>e the call site is the<br />
same as the memory state at the function entry point. It is the case if there is no write<br />
effect in function parameters. In that situation, the load statement can be moved be<strong>for</strong>e<br />
the call state after backward translation from <strong>for</strong>mal parameters <strong>to</strong> effective parameters.<br />
6.3.2 Redundant S<strong>to</strong>re Elimination<br />
This section describe the conditions <strong>to</strong> move s<strong>to</strong>re statements upward in the hcfg.<br />
The equations are similar <strong>to</strong> redundant load elimination’s.<br />
6.3.3 Sequences<br />
This problem is quite similar <strong>to</strong> its load counterpart from Section 6.3.1.1.<br />
s = ss ; s0 <br />
Bernstein’s condition give us a condition when it is valid <strong>to</strong> swap them, as shown in<br />
Equation (6.9).<br />
6.3.4 Tests<br />
Rs(s) =<br />
Let consider a branch statement.<br />
s0 ; ss Bern(s0, ss)<br />
s otherwise<br />
(6.9)
128CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
s = if (ec) { st ; s0 else sf ; s1 }<br />
We get an equation that mirrors Equation (6.7) except <strong>for</strong> the condition over ec.<br />
6.3.5 Loops<br />
t<br />
if (ec) { st } else { sf } s0 s0 = s1<br />
Rs(s) =<br />
s otherwise<br />
Let consider a loop statement:<br />
s = do { s0 ; ss } while (ec)<br />
The s<strong>to</strong>re version is given by Equation (6.11).<br />
(6.10)<br />
<br />
do { s0 } while (ec) ss Bern(ss, s0) ∧ Bern(ss, ec) ∧ S(ss ; ss) = ss<br />
Rs(s) =<br />
s otherwise<br />
(6.11)<br />
The proof follows the same idea as <strong>for</strong> Equation (6.8).<br />
6.3.6 Interprocedurally<br />
If the same s<strong>to</strong>re statement is found at each exit point of a function, it may be possible<br />
<strong>to</strong> move it past its call site. To do so, one must ensure that the s<strong>to</strong>re statement only<br />
depends on <strong>for</strong>mal parameters and that these parameters are not written by the function.<br />
If this the case, the load statement can be removed from the function call and added after<br />
each call site after parameter backward translation.<br />
6.3.7 Combining Load and S<strong>to</strong>re Elimination<br />
This section examines the interaction between load and s<strong>to</strong>res in two situations: in a<br />
sequence, when a load is followed by a s<strong>to</strong>re, and in loops, when the loop body is surrounded<br />
by a load and a s<strong>to</strong>re. These two situations are eventually triggered by the upward move<br />
of dma in the hcfg.<br />
6.3.7.1 Sequence<br />
Let consider a simple sequence of two statements:<br />
s = s0 ; s1<br />
By definition, if s0 is a dma and s1 its reciprocal, then we have:
6.3. REDUNDANT LOAD STORE OPTIMIZATION 129<br />
s = s0 ; s −1<br />
0 <br />
= s0<br />
that eliminates the second call and may make it possible <strong>to</strong> continue the upward propagation.<br />
6.3.7.2 Loops<br />
Let consider a loop statement whose body is surrounded by dma calls:<br />
s = do { sl ; s0 ; ss } while (ec)<br />
then it can be translated in<strong>to</strong> Equation (6.12).<br />
under the following conditions:<br />
R(s) = sl do { s0 } while (ec) ss (6.12)<br />
sl = s − s 1 (6.13)<br />
Bern(ss, s0) (6.14)<br />
Bern(ss, ec) (6.15)<br />
(6.16)<br />
Proof. We use a recursive proof on the number of iteration of loop s. Let s n denote s when<br />
the loop body executes n times.<br />
Equation (6.12) is true when n = 1:<br />
s 1 = sl ; s0 ; ss ; ec<br />
= sl ; s0 ; ec ; ss from hypothesis 6.15<br />
= R(s 1 )<br />
Let us assume it is true if the loop iterates n times. In that case a loop that would<br />
iterate n + 1 times can be decomposed as follows:<br />
s n+1 = s n ; sl ; s0 ; ss ; ec<br />
= sl ; R(s n ) ; ss ; sl ; s0 ; ss ; ec from recursion hypothesis<br />
= sl ; R(s n ) ; ss ; s0 ; ss ; ec from hypothesis 6.13<br />
= sl ; R(s n ) ; s0 ; ss ; ec from hypothesis 6.14<br />
= sl ; R(s n ) ; s0 ; ec ; ss from hypothesis 6.15<br />
= R(s n+1 )
130CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />
6.3.8 Main Algorithm<br />
Applying iteratively redundant load elimination, redundant s<strong>to</strong>re elimination and combine<br />
load s<strong>to</strong>re may lead <strong>to</strong> fewer data communication. This process is detailed in Algorithm<br />
6.<br />
Data: p ← a program<br />
repeat<br />
p ′ ← p;<br />
p ← redundant_load_elimination(p);<br />
p ← redundant_s<strong>to</strong>re_elimination(p);<br />
p ← combine_load_s<strong>to</strong>re(p);<br />
pdead_code_elimination(p);<br />
until p = p ′ ;<br />
Algorithm 6: Redundant load s<strong>to</strong>re elimination algorithm at the pass manager level.<br />
Listing 6.4 illustrates the result of this algorithm on an example taken from the Paralléliseur<br />
Interprocedural de Programmes Scientifiques (pips) validation. It demonstrates<br />
interprocedural elimination of data communications represented by the load and s<strong>to</strong>re<br />
functions. These functions are first moved outside of the loop, then outside of the a function<br />
then redundant loads are eliminated.<br />
6.4 Conclusion<br />
In this chapter, we have presented and proved Theorem (6.1) <strong>to</strong> completely isolate a<br />
statement from its original memory. This trans<strong>for</strong>mation is the basic building block <strong>for</strong><br />
many trans<strong>for</strong>mations related <strong>to</strong> heterogeneous computing, <strong>for</strong> they usually use a separated<br />
memory space.<br />
The generated data transfers generated are not optimized globally. Hence we have<br />
proposed Algorithm 6 <strong>to</strong> iteratively merge these transfers in order <strong>to</strong> suppress redundant<br />
ones. This algorithm is independent from the previous one and also works with the dma<br />
generated by Algorithm 3.<br />
We have also developed Algorithm 5 <strong>to</strong> take in<strong>to</strong> account the limited memory size of<br />
the targeted hardware, based on loop tiling and memory footprint estimation.<br />
The experiments related <strong>to</strong> the usage of these trans<strong>for</strong>mations are presented with the<br />
compiler implementations in Chapter 7.
6.4. CONCLUSION 131<br />
void a( int i, int j[2] , int k [2]) {<br />
while (i - - >=0) {<br />
load (k, j); //
132CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION
Chapter 7<br />
Compiler Implementations and<br />
Experiments<br />
Pont de Bruz, Ille-et-Vilaine c○ Pymouss / WikipediA<br />
This thesis introduces and describes a methodology <strong>to</strong> cus<strong>to</strong>mize compilers <strong>for</strong> different<br />
heterogeneous plat<strong>for</strong>ms, building up on a rich <strong>to</strong>olbox of source-<strong>to</strong>-source trans<strong>for</strong>mations,<br />
a programmable pass-manager Application Programming Interface (api) and a simple<br />
hardware description. It would not be complete without an experimental validation.<br />
The methodology claims <strong>to</strong> make it easier <strong>to</strong> assemble compilers. To validate it, we<br />
have chosen five different targets: three general purpose Central Processing Unit (cpu) with<br />
different vec<strong>to</strong>r instruction units, a Field Programmable Gate Array (fpga)-based image<br />
processor [BLE + 08] and an nVidia Graphical Processing Unit (gpu). For each of them,<br />
we have developed a compiler pro<strong>to</strong>type using the techniques presented in Chapters 4, 5<br />
and 6. The efficiency of the code generated by these research compilers is measured using<br />
benchmarks or applications from the relevant domain.<br />
This chapter begins with a simple Open Multi Processing (openmp) directive genera<strong>to</strong>r<br />
in Section 7.1 <strong>to</strong> show how <strong>to</strong> apply the principles discussed in this thesis <strong>to</strong> a simple, yet<br />
real, example. The compiler <strong>for</strong> gpus implemented by hpc project based on our work is<br />
detailed in Section 7.2. Section 7.3 presents terapyps, a compiler from C <strong>to</strong> terasm, the<br />
assembly language <strong>for</strong> the terapix image processor. Finally, a retargetable compiler <strong>for</strong><br />
Multimedia Instruction Set (mis) is described in Section 7.4 <strong>for</strong> three targets: Streaming<br />
simd Extension (sse), Advanced Vec<strong>to</strong>r eXtensions (avx) and neon.<br />
133
134 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
memory<br />
ram<br />
shared<br />
optional feature<br />
manda<strong>to</strong>ry feature<br />
Multicore<br />
Device<br />
isa Acceleration<br />
Parallelism<br />
mimd<br />
Figure 7.1: Multicore hardware feature diagram.<br />
7.1 A Simple OpenMP Compiler<br />
The goal of this section is <strong>to</strong> illustrate the idea developed in this thesis on a simple<br />
example: a multicore machine.<br />
7.1.1 Architecture Description<br />
First step is <strong>to</strong> list the hardware constraints of the target machine. The (simple)<br />
hardware feature diagram is given in Figure 7.1. The only constraint is Multiple Instruction<br />
stream, Multiple Data stream (mimd) parallelism and it is optional. As a consequence, the<br />
only required trans<strong>for</strong>mation is mimd parallelism detection/extraction. Optional features<br />
and optimizations are not taken in<strong>to</strong> account.<br />
7.1.2 Compiler Implementation<br />
The input language is C and the output language is C with openmp directives. As directives<br />
can be represented in the Internal Representation (ir), it means no post-processor<br />
is needed. So we have a very classical source-<strong>to</strong>-source compilation flow, detailed in Figure<br />
7.2.<br />
Algorithm 7 is used by the source-<strong>to</strong>-source compiler. It involves privatization, parallelism<br />
detection, reduction detection and directive generation. Additionally, loop fusion is<br />
used <strong>to</strong> improve locality. If parallelism detection fails, the loops are distributed using the<br />
Allen & Kennedy algorithm [AK87], and the detection is tried again.<br />
For reference, the Pythonic PIPS (pyps) script executed by the pass manager is given<br />
in Listing 7.1.
7.1. A SIMPLE OPENMP COMPILER 135<br />
Sequential<br />
Code<br />
Transla<strong>to</strong>r<br />
Sequential<br />
Code<br />
+<br />
directives<br />
openmp<br />
Compiler<br />
Binary<br />
Figure 7.2: <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> openmp.<br />
Data: s ← a statement<br />
Result: a statement with openmp directives<br />
s ← loop_fusion(s);<br />
privatization(s);<br />
reduction_detection(s);<br />
if parallelism_detection(s) then<br />
s ← directive_generation(s);<br />
else<br />
s ← loop_distribution(s);<br />
if parallelism_detection(s) then<br />
s ← directive_generation(s);<br />
end<br />
end<br />
return s Algorithm 7: Parallel loop generation algorithm <strong>for</strong> openmp.
136 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
def openmp (m, verbose = False , ** props ):<br />
""" parallelize function with opennmp """<br />
... # initialization stuff<br />
m. loop_fusion ()<br />
# some analyses per<strong>for</strong>m better after this<br />
m. split_initializations (** props )<br />
# privatize scalar variables<br />
m. privatize_module (** props )<br />
# try openmp parallelization coarse grain<br />
# cus<strong>to</strong>m functions<br />
try :<br />
m. coarse_grain_parallelization (** props )<br />
except :<br />
m. internalize_parallel_code (** props )<br />
# directive generation<br />
m. ompify_code (** props )<br />
m. omp_merge_pragma (** props )<br />
# eventually print the resulting code<br />
if verbose :<br />
m. display (** props )<br />
Listing 7.1: Original PyPS script <strong>for</strong> openmp code generation.
7.1. A SIMPLE OPENMP COMPILER 137<br />
CFLAGS +=- fopenmp<br />
LIBS +=- fopenmp<br />
## pipsrules ##<br />
Listing 7.2: Makefile stub <strong>for</strong> openmp compilation.<br />
transla<strong>to</strong>r post-processor maker #pass involved<br />
SLOC 41 0 2 8<br />
Table 7.1: sloccount report <strong>for</strong> an openmp directive genera<strong>to</strong>r pro<strong>to</strong>type written in pyps.<br />
The build process is mostly unchanged, but an additional flag <strong>to</strong> tell the compiler <strong>to</strong><br />
interpret openmp directives. The makefile stub is given in Listing 7.2. No additional rules<br />
are provided, but the compiler and linker flags are changed.<br />
7.1.3 Experiments & Validation<br />
The aim of this section is not <strong>to</strong> build an efficient openmp code genera<strong>to</strong>r, but <strong>to</strong><br />
provide a sample example as an introduction <strong>to</strong> the next sections. As a consequence we do<br />
not choose focus on getting impressing speedups on real-world applications, but rather on<br />
giving evidences that a compiler pro<strong>to</strong>type can achieve reasonable results in spite of the<br />
little amount of work dedicated <strong>to</strong> its construction.<br />
The benchmark suite used is the polybench. Although this benchmark is intended<br />
at testing polyhedral trans<strong>for</strong>mations, it contains numerous kernels that are easily au<strong>to</strong>matically<br />
parallelized, so they do not stress our naïve implementation, while showing the<br />
relevancy of the approach. Figure 7.3 shows the median speedup of the accelerated version,<br />
measured by taking the median over 100 executions and <strong>for</strong> the default benchmark<br />
sizes.<br />
The reference timings are obtained from a lap<strong>to</strong>p running 2.6.38-2-686 GNU/Linux.<br />
It has 2 Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz. The code is compiled with<br />
gnu C Compiler (gcc) version 4.6.1 and -O3 -ffast-math flags. The accelerated code<br />
is obtained by running Algorithm 7 on each function of the program. It is compiled with<br />
same compiler with the same flags plus the openmp flag.<br />
To obtain this result, we have directly scripted the compiler with pyps. This script is<br />
a good illustration of pyps flexibility. It is reproduced in Appendix C.<br />
For each compiler we have implemented, we issue a small report that states the number<br />
of SLOC <strong>for</strong> the source-<strong>to</strong>-source compiler, the post-processor and the maker. We also<br />
compute the number of passes and analyses involved in the whole compilation process.<br />
The result <strong>for</strong> the openmp pro<strong>to</strong>type is given Table 7.1. It shows that the pro<strong>to</strong>type is<br />
simple and really capitalizes on existing trans<strong>for</strong>mations. The assembling is done in a very<br />
lightweight way.
138 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
generated OpenMP code relative execution time<br />
lu.c<br />
covariance.c<br />
correlation.c<br />
jacobi-2d-imper.c<br />
fdtd-2d.c<br />
jacobi-1d-imper.c<br />
adi.c<br />
seidel.c<br />
fdtd-apml.c<br />
gauss-filter.c<br />
reg-detect.c<br />
durbin.c<br />
symm.c<br />
symm.exp.c<br />
gemm.c<br />
mvt.c<br />
bicg.c<br />
trisolv.c<br />
trmm.c<br />
2mm.c<br />
syrk.c<br />
gemver.c<br />
cholesky.c<br />
atax.c<br />
syr2k.c<br />
doitgen.c<br />
gesummv.c<br />
3mm.c<br />
ludcmp.c<br />
dynprog.c<br />
gramschmidt.c<br />
Figure 7.3: Per<strong>for</strong>mance of an openmp directive genera<strong>to</strong>r pro<strong>to</strong>type on the polybench<br />
benchmark.
7.2. A GPU COMPILER 139<br />
7.2 A GPU Compiler<br />
This section describes a pro<strong>to</strong>type compiler <strong>for</strong> machine with nVidia’s gpus. It is not<br />
an optimizing compiler: it does not take advantage of some hardware capabilities, but it<br />
still generates speedups <strong>for</strong> computationally intensive applications.<br />
7.2.1 Architecture Description<br />
A gpu is an accelera<strong>to</strong>r that does not share a memory space with its host. It couples<br />
mimd parallelism at coarse grain with Single Instruction stream, Multiple Data stream<br />
(simd) parallelism at fine grain. Two characteristics are important: a huge number of cores<br />
that provide important theoretical speedup, and a constrained memory: shared memory<br />
with limited size and low transfer rate between the gpu and the cpu when compared with<br />
transfer rate between cpu and main memory, not <strong>to</strong> mention that coalesced accesses are<br />
critical <strong>to</strong> reach high throughput.<br />
An nVidia gpu board is a set of multiprocessors. For instance a GTX580 board<br />
has 16 multiprocessors containing up <strong>to</strong> 32 thread processors each, that is 512 Compute<br />
Unified Device Architecture (cuda) cores as a whole. It can run a maximum number of<br />
1536 threads per multiprocessor. The blocks of threads are scheduled transparently by the<br />
hardware. It assumes each multiprocessor is independent.<br />
The memory hierarchy is quite complex:<br />
– gpu have registers <strong>for</strong> fast 1-cycle thread-local access so they are <strong>to</strong> be privileged <strong>for</strong><br />
computations. Un<strong>for</strong>tunately since there are only 32KB registers per multiprocessor<br />
and thousands of threads on an nVidia GTX580, this is a scarce resource.<br />
– each thread has a local memory;<br />
– the shared memory is local <strong>to</strong> a multiprocessor and global <strong>for</strong> each thread of the<br />
multiprocessor. It is 16 or 48 KB large but it is accessed as fast as registers. They<br />
are often used as scratchpad memories <strong>to</strong> drastically increase per<strong>for</strong>mance;<br />
– the extended memory, called global memory (typically 1 <strong>to</strong> 6 GB) can be accessed<br />
by each thread but with a far bigger latency (800–1000 cycles). Each access must be<br />
coalesced by the use of a large number of threads per block;<br />
– the texture cache uses a small portion of the global memory. With a 50-cycles access<br />
time must be privileged over the global memory when possible. It is only readable,<br />
the coherence with the global memory is not ensured; but the memory amount is<br />
only 8 KB per multiprocessor;<br />
– the 64 KB constant memory.<br />
The different memory level and Processing Element (pe) layout are visible on the Fermi<br />
chip shown in Figure 7.4. It is synthesized in the <strong>for</strong>m of a hardware feature diagram<br />
in Figure 7.5. In this diagram, many features are flagged as optional. This allows an<br />
incremental compiler development: first manda<strong>to</strong>ry constraints are taken in<strong>to</strong> account,<br />
then optional constraints are integrated <strong>to</strong> enhance per<strong>for</strong>mance. In this thesis, we focus<br />
on building a pro<strong>to</strong>type that works <strong>for</strong> the manda<strong>to</strong>ry features. <strong>Building</strong> a compiler that<br />
takes in<strong>to</strong> account all gpu features is a PhD subject on its own!
140 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
Figure 7.4: nVidia Fermi architecture.<br />
memory<br />
rom ram<br />
gpu<br />
distributed shared<br />
optional feature<br />
manda<strong>to</strong>ry feature<br />
isa Acceleration<br />
Parallelism<br />
simd mimd<br />
Figure 7.5: gpu hardware feature diagram.
7.2. A GPU COMPILER 141<br />
7.2.2 Compiler Implementation<br />
The hardware feature diagram of a gpu has some similarities with the hardware feature<br />
diagram of terapix (see Figure 7.10): both have their own memory and benefit from simd<br />
acceleration. As a consequence, the two compilers roughly use the same algorithm <strong>to</strong> turn<br />
the input code in<strong>to</strong> host and accelera<strong>to</strong>r part. In a similar manner, Direct Memory Access<br />
(dma) generation uses the same analyses, although the api naturally differs.<br />
The main difference lies in the kernel Instruction Set Architecture (isa). Firstly, unlike<br />
terapix, a compiler from a C++ dialect, cuda, <strong>to</strong> gpu isa, Parallel Thread eXecution<br />
(ptx), already exists, and takes care of low level trans<strong>for</strong>mations. However, some constraints<br />
such as the lack of support <strong>for</strong> variable length arrays or the normalization of the<br />
iteration spaces, require additional trans<strong>for</strong>mations.<br />
Secondly, cuda extends the C89 syntax with function qualifiers (e.g. __global__) or<br />
kernel calls, using the triple chevron syntax (>). We do not extend the Paralléliseur<br />
Interprocedural de Programmes Scientifiques (pips) ir <strong>to</strong> cover these extensions but use a<br />
macro-based compatibility header.<br />
The compilation scheme <strong>for</strong> gpu code generation is given in Figure 7.6. The new parts<br />
are the compatibility headers, the source-<strong>to</strong>-source compiler and the cuda transla<strong>to</strong>r.<br />
Other modules are simply reused.<br />
Compatibility header A compatibility header is a set of macro functions that per<strong>for</strong>ms<br />
a translation from C syntax <strong>to</strong> cuda syntax. For instance the return type of a kernel<br />
can be set <strong>to</strong> a typedef, say typedef __global__void void, which is correct C, that<br />
is defined as #define __global__void __global__ void in the compatibility header. 1<br />
Initial transla<strong>to</strong>r The initial transla<strong>to</strong>r takes a sequential code , detects parallel loops<br />
and splits each of them in<strong>to</strong> two parts: a sequential C code that contains a call <strong>to</strong><br />
a kernel, and a sequential call that embodies the kernel. A loop proxy code applies<br />
the kernel <strong>to</strong> each element of the loop iteration space. This loop proxy code is<br />
needed <strong>for</strong> the sequential semantics but does not appear in the final code, nor in<br />
Figure 7.6. Let us take a simple example, the sum of two arrays: the two arrays are<br />
initialized, a loop over the array elements per<strong>for</strong>m the addition and s<strong>to</strong>res the result<br />
in a third array. The initial transla<strong>to</strong>r splits this code in three parts: the host code<br />
that per<strong>for</strong>ms initializations and calls a kernel; the loop proxy code that iterates over<br />
all the elements and calls a kernel code <strong>for</strong> each; the kernel code that per<strong>for</strong>ms the<br />
addition. Figure 7.7 illustrates this structure.<br />
Once the code is split in<strong>to</strong> host and kernel part, statement isolation is used <strong>to</strong> generate<br />
data transfers in the host part.<br />
cuda transla<strong>to</strong>r The cuda dialect is very close <strong>to</strong> the C language. The cuda transla<strong>to</strong>r<br />
takes care of the following syntactic changes: convert variable-length array <strong>to</strong> pointers<br />
1. Sometimes, the C preprocessor is not sufficient. In that case, regular expressions or (better) C++<br />
templates with type inference can become handy. The reader may doubt the relevancy of using third-party<br />
<strong>to</strong>ols instead of a large and monolithic ir. However, we are confident that using a wide range of specialized<br />
<strong>to</strong>ols is more flexible.
142 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
Compatibility<br />
header<br />
C<br />
Preprocessor<br />
cuda<br />
Host Code<br />
cuda<br />
Compiler<br />
Host Binary<br />
Sequential<br />
C Code<br />
+<br />
kernel call<br />
Sequential<br />
Code<br />
Initial<br />
Transla<strong>to</strong>r<br />
Sequential<br />
C Code<br />
=<br />
kernel<br />
cuda<br />
Transla<strong>to</strong>r<br />
C Code<br />
+<br />
compatibility<br />
layer<br />
Figure 7.6: <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> gpu.<br />
C<br />
PreProcessor<br />
cuda<br />
Kernel Code<br />
cuda<br />
Compiler<br />
Host Binary<br />
Compatibility<br />
header
7.2. A GPU COMPILER 143<br />
double in0 [n], in1 [n], out [n];<br />
<strong>for</strong> ( int i =0;i
144 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
transla<strong>to</strong>r post-processor maker #pass involved<br />
SLOC 3836 0 N/A 20<br />
Table 7.2: sloccount report <strong>for</strong> a cuda genera<strong>to</strong>r pro<strong>to</strong>type written in pyps.<br />
using array linearization, normalize the iteration space using loop normalization,<br />
make sure no additional iterations are per<strong>for</strong>med using iteration clamping, convert<br />
C99 complex type <strong>to</strong> cuda complex types and finally take care of the cuda specific<br />
syntax. Language differences are handled by the compatibility header.<br />
7.2.3 Experiments & Validation<br />
We have validated the <strong>to</strong>ol on a set of image processing kernels: a convolution, with a<br />
n<br />
window size of 5 × 5, and a finite impulse response filter, with a window size of . The<br />
1000<br />
erode used as a running example so far does not pass the computational intensity test (see<br />
Section 5.3) on the considered machine and is not included in the benchmark.<br />
Measurements have been made using a desk<strong>to</strong>p station hosting a 64-bit Debian/testing<br />
with gcc 4.3.5 and a 2-core 2.4 GHz Intel Core2 cpus. The cuda 3.2 compiler is<br />
used and the generated code is executed on a Quadro FX 2700M card. Compilation is<br />
fully au<strong>to</strong>matic. The whole run is measured, i.e. timings include gpu initialization, data<br />
transfers, kernel calls, etc. The median over 100 runs is taken. Figure 7.8 shows additional<br />
results <strong>for</strong> digital signal processing kernels extracted from [Orf95] and available on the<br />
website http://www.ece.rutgers.edu/~orfanidi/intro2sp: a N-Discrete Time Fourier<br />
Trans<strong>for</strong>m and a sample cross correlation.<br />
The sloccount report <strong>for</strong> each part of the pro<strong>to</strong>type is given in Table 7.2.<br />
7.3 An FPGA Image Processor Accelera<strong>to</strong>r Compiler<br />
<strong>Heterogeneous</strong> computing is all about balancing the hardware and the associated costs,<br />
say intellectual property rights, energy consumption, volume, throughput, maintenance or<br />
development costs. For embedded devices, the balance is all the more difficult <strong>to</strong> find as the<br />
constraints are tighter. As a consequence, the hardware is likely <strong>to</strong> be highly specialized,<br />
which often means difficult <strong>to</strong> program. The terapix plat<strong>for</strong>m is a good illustration of this<br />
phenomenon: it is a low-power, high-throughput device specialized <strong>for</strong> image-processing,<br />
based on fpga, developed by thales. There are two main motivations <strong>for</strong> this machine:<br />
1. <strong>to</strong> be able <strong>to</strong> process a stream of images directly on the camera that generates them—<br />
so called “intelligent camera”. In the context of event recognition, if the events are<br />
scarce, it is <strong>to</strong>o expensive <strong>to</strong> transfer all data <strong>to</strong> a remote processing engine. Per<strong>for</strong>ming<br />
the detection in place allows <strong>to</strong> transfer only valuable data;<br />
2. <strong>to</strong> be independent of a circuit provider. For longterm maintenance, it is not acceptable<br />
<strong>to</strong> depend on a third-party, closed-source, hardware. Choosing an fpga-based<br />
circuit unties the machine from the hardware.
7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 145<br />
execution time (s)<br />
execution time (s)<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0<br />
0<br />
1000<br />
10000<br />
2000<br />
20000<br />
3000<br />
4000<br />
5000<br />
input size<br />
6000<br />
(a) Convolution<br />
30000<br />
40000<br />
50000<br />
input size<br />
60000<br />
on GPU<br />
on CPU<br />
7000<br />
8000<br />
on GPU<br />
on CPU<br />
70000<br />
80000<br />
9000<br />
90000<br />
10000<br />
100000<br />
(c) N-Discrete Time Fourier Trans<strong>for</strong>m<br />
execution time(s)<br />
execution time (s)<br />
18<br />
16<br />
14<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
0<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
0<br />
0<br />
1e+06<br />
10000<br />
2e+06<br />
20000<br />
3e+06<br />
30000<br />
4e+06<br />
5e+06<br />
input size<br />
(b) fir<br />
40000<br />
50000<br />
input size<br />
6e+06<br />
60000<br />
(d) Correlation<br />
on GPU<br />
on CPU<br />
7e+06<br />
70000<br />
8e+06<br />
on GPU<br />
on CPU<br />
Figure 7.8: Median execution time on a gpu <strong>for</strong> dsp kernels.<br />
80000<br />
9e+06<br />
90000<br />
1e+07<br />
100000
146 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
#<br />
! "<br />
! "<br />
Figure 7.9: terapix architecture.<br />
This section presents the compilation chain <strong>for</strong> this hardware and its result on a few<br />
benchmarks. Section 7.3.1 describes the architecture and models it as a feature diagram.<br />
Section 7.3.2 proposes a compilation flow based on this model and the trans<strong>for</strong>mations<br />
presented in this thesis. Section 7.3.3 validates the approach on various image processing<br />
algorithms and compares manually compiled code <strong>to</strong> au<strong>to</strong>matically generated ones.<br />
7.3.1 Architecture Description<br />
In this section, we give a quick summary of the terapix architecture and emphasize<br />
on the hardware constraints. The reader interested in more details is referred <strong>to</strong> [BLE + 08].<br />
The terapix architecture is an fpga-based circuit implemented on a Virtex-4 SX-55<br />
from Xilinx. A general purpose softcore microprocessor, the µP , implements the control<br />
part and a simd Processing Unit (pu) is used <strong>for</strong> the image kernels that require high<br />
processing power. This pu consists of 128 pes that run at 150 MHz. The interconnect<br />
between pes follows a ring <strong>to</strong>pology so that each pe has access <strong>to</strong> its neighbours’ memory<br />
and <strong>to</strong> its local Random Access Memory (ram) of 512 × 36b used <strong>for</strong> registers. A Read<br />
Only Memory (rom) of limited size can be accessed by all pes.<br />
The isa is dedicated <strong>to</strong> image processing: it uses a Very Long Instruction Word (vliw)<br />
instruction set that provides arithmetic operations over integers and a conditional assignment.<br />
Neither division nor floating point operations are available. Direct and indirect<br />
addressing modes are supported as well as a special pattern addressing mode <strong>to</strong> describe<br />
complex memory access patterns. Using a vertical pattern, a pe that access a[i] retrieves<br />
the a[#PE][i] element of the 2D global memory, while using a diag2 pattern, it retrieves<br />
the a[#PE][i+#PE] element. The sequencer only provides three control operations: a<br />
counter-based loop, a continue and a return.<br />
A vliw instruction consists of 5 fields given Table 7.3. The image field manipulates<br />
$$$
7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 147<br />
32b 26b 32b 13b 5b<br />
Image Mask Register Alu Sequence<br />
Table 7.3: Description of a terapix microinstruction.<br />
pointers <strong>to</strong> the global ram and <strong>to</strong> neighbouring pes, the mask field manipulates pointers<br />
<strong>to</strong> the global rom, the register field manipulates pointers <strong>to</strong> the local ram, the ALU field<br />
selects arithmetic operations and opera<strong>to</strong>rs and the sequence field is used <strong>for</strong> the program<br />
counter. An example of vliw assembly code is given Listing 7.3.<br />
The set of hardware features is described in Figure 7.10, using the methodology proposed<br />
in Chapter 2.<br />
This hardware is currently only programmed by hand: the developer writes the C code<br />
<strong>for</strong> the host side and the microcode, the assembly code <strong>for</strong> the accelera<strong>to</strong>r. Three <strong>to</strong>ols are<br />
provided <strong>to</strong> the developer: a compiler from the assembly code that generates a microcode<br />
image in the <strong>for</strong>m of a an array of bytes defined in a C header <strong>for</strong> inclusion on the host<br />
side, a cycle-accurate simula<strong>to</strong>r <strong>to</strong> test the resulting code and a code compac<strong>to</strong>r <strong>to</strong> pack<br />
vliw instructions when possible.<br />
In addition <strong>to</strong> the restricted isa, programming such a machine is difficult because of the<br />
combination of a large simd unit and limited dma. For instance, <strong>to</strong> per<strong>for</strong>m a point-<strong>to</strong>point<br />
operation on a vec<strong>to</strong>r of 130 elements, one must load the first 128 elements, per<strong>for</strong>m<br />
the computation and copy back the result, then load the elements from the third <strong>to</strong> the<br />
130 th , per<strong>for</strong>m the computation and copy the result back, leading <strong>to</strong> 126 computations<br />
being per<strong>for</strong>med twice. Figure 7.11 illustrates this troublesome behavior.<br />
7.3.2 terapix Compiler Implementation<br />
Compiling <strong>for</strong> terapix requires two steps: the input code is scanned <strong>for</strong> parallel loops<br />
and each of them is split in<strong>to</strong> an host part and an accelera<strong>to</strong>r part, then the accelera<strong>to</strong>r<br />
part is translated in<strong>to</strong> terapix assembly code.<br />
A compilation scheme that takes in<strong>to</strong> account terapix specificities is given in Figure<br />
7.12.<br />
7.3.2.1 Input Code Splitting<br />
The separation of the kernel from its caller is per<strong>for</strong>med by Algorithm 8 based on the<br />
trans<strong>for</strong>mations presented in Sections 6.1 and 6.2.<br />
Once a kernel has been extracted, it must be converted <strong>to</strong> meet the hardware constraints.<br />
However, no C-<strong>to</strong>-assembly compiler is available, and we are left with an assembler<br />
and a code compacter. As a consequence, we first per<strong>for</strong>m as many refinements as<br />
possible at the source level, using the ideas developed in Chapter 4. Then we use an ad<br />
hoc C-<strong>to</strong>-terasm <strong>to</strong>ol developed <strong>for</strong> this purpose. It generates uncompacted code and pipes<br />
it through the code compac<strong>to</strong>r <strong>to</strong> generate the final assembly code.
148 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
prog convol<br />
sub convol<br />
pattern vertic || || || ||<br />
im ,i1= FIFO1 +NN ||ma ,m1= FIFO3 || || ||<br />
im ,i2= FIFO2 +NN || || || do_N1 ||<br />
im ,i1=i1+SS || || || ||<br />
im ,i2=i2+SS || || || ||<br />
im ,i3=i1+W || || || ||<br />
im ,i4=i2+W || || || do_N2 ||<br />
im ,i3=i3+E || ma=m1 ||P=im*ma || ||<br />
im=im+E || ma=ma+E ||P=P+im*ma || ||<br />
im=im+E || ma=ma+E ||P=P+im*ma || ||<br />
im=im+S || ma=ma+S ||P=P || ||<br />
|| ||P=N+im*ma || ||<br />
im=im+S || ma=ma+S ||P=P || ||<br />
|| ||P=N+im*ma || ||<br />
im=im+W || ma=ma+W ||P=P+im*ma || ||<br />
im=im+W || ma=ma+W ||P=P+im*ma || ||<br />
im=im+N || ma=ma+N ||P=P || ||<br />
|| ||P=S+im*ma || ||<br />
im=im+E || ma=ma+E ||P=P+im*ma || ||<br />
im ,i4=i4+E || ||P,im=P || loop ||<br />
|| || || loop ||<br />
|| || || || return<br />
endsub<br />
endprog<br />
Listing 7.3: terapix assembly <strong>for</strong> a 3 × 3 convolution kernel.<br />
memory<br />
rom ram<br />
distributed<br />
manda<strong>to</strong>ry feature<br />
terapix<br />
isa Acceleration<br />
Parallelism<br />
simd<br />
Figure 7.10: terapix hardware feature diagram.
7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 149<br />
(a) First step: no redundant computations.<br />
(b) Second step: redundant computations.<br />
Figure 7.11: terapix redundant computations.<br />
Sequential<br />
C Code<br />
+<br />
kernel call<br />
C Compiler<br />
Host<br />
Binary<br />
Sequential Code<br />
Transla<strong>to</strong>r<br />
Sequential<br />
C Code<br />
=<br />
kernel<br />
terapix<br />
PostProcessor<br />
Assembly<br />
(not compacted)<br />
terapix<br />
Compac<strong>to</strong>r Assembler<br />
Microcode Assembly<br />
(compacted) Binary<br />
Figure 7.12: <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> terapix.
150 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
Data: s ← a statement<br />
Data: pe ← the number of Processing Element<br />
Data: m ← the accelera<strong>to</strong>r memory size<br />
Result: k a set of kernel codes<br />
<strong>for</strong> l ∈ loops(s) do<br />
if depth(l) = 2 then<br />
declare_variable(s, size_t height);<br />
declare_variable(s, size_t width);<br />
l ′ ← symbolic_tiling(l, 〈height, width〉);<br />
solve_linear_system(l ′ , pe, m);<br />
generate_rom(l ′ );<br />
s ′ ← isolate_statement(s, l ′ );<br />
k ← k ∪ {outline(s, s ′ )};<br />
end<br />
end<br />
return k<br />
Algorithm 8: terapix kernel extraction algorithm at the pass manager level.<br />
Algorithm 9 details the steps involved in assembly code generation. It first processes<br />
each loop <strong>to</strong> normalize its iteration space and converts each do-loop <strong>to</strong> its while-loop<br />
counterpart. Declaration blocks are removed by flatten code and all array references are<br />
replaced by their pointer equivalent using array linearization. strength reduction trans<strong>for</strong>ms<br />
pointers in<strong>to</strong> itera<strong>to</strong>r whenever possible. The granularity of the C code is then lowered by<br />
split update opera<strong>to</strong>r and n address code generation.<br />
7.3.3 Experiments & Validation<br />
We categorize the image opera<strong>to</strong>rs found in terapix’s application domain as either<br />
point-<strong>to</strong>-point, vertical, horizontal or stencil opera<strong>to</strong>rs. This leaves apart opera<strong>to</strong>rs such as<br />
his<strong>to</strong>grams that are not covered by our compilation scheme because of the more complex<br />
parallelization scheme. For each category, we choose a specific opera<strong>to</strong>r, namely brightness,<br />
vertical erode, horizontal convolution and convolution. A terapix expert manually wrote<br />
an optimized assembly version <strong>for</strong> these kernels, and we wrote the text-book version of<br />
these algorithms in C and piped it through our au<strong>to</strong>matic compiler. Table 7.4 gives the<br />
ratio between microcode cycle counts <strong>for</strong> au<strong>to</strong>matic and manual code generation. It shows<br />
that the au<strong>to</strong>matically generated code’s execution time is close <strong>to</strong> the manual one. The<br />
slowdown of the vertical erode is due <strong>to</strong> a naïve register allocation scheme that suffers from<br />
the low-latency terapix registers.<br />
The sloccount report <strong>for</strong> each part of the pro<strong>to</strong>type is given in Table 7.5. This pro<strong>to</strong>type<br />
is far more complex than the previous ones, but so is the target.<br />
Listing 7.4 illustrates the behavior of Algorithm 9 on a horizontal erosion <strong>for</strong> the host<br />
side. Listing 7.5 illustrates the accelera<strong>to</strong>r side. Listing 7.6 shows the generated assembly
7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 151<br />
Data: k ← a kernel from Algorithm 8<br />
Data: I ← {f | f is an instruction in Terasm}<br />
Result: k <strong>for</strong>matted as a terapix microcode<br />
<strong>for</strong> l ∈ loops(k) do<br />
k ← loop_normalize(l, lower_bound=0);<br />
k ← do_loop_<strong>to</strong>_while_loop(l);<br />
end<br />
k ← flatten_code(k);<br />
k ← linearize_array(k, pointer_conversion=True);<br />
k ← strength_reduction(k);<br />
k ← split_update_opera<strong>to</strong>r(k);<br />
k ← n_address_code_generation(k, 2);<br />
k ← normalize_terapix_microcode(k);<br />
k ← dead_code_elimination(k);<br />
<strong>for</strong> i ∈ I do<br />
k ← instruction_selection(k, i);<br />
end<br />
return k<br />
Algorithm 9: C-<strong>to</strong>-terapix translation algorithm at the pass manager level.<br />
brightness horizontal convolution vertical erode convolution<br />
au<strong>to</strong>matic<br />
manual ×1 ×1.31 ×2.12 ×1.31<br />
Table 7.4: Ratio between terapix microcode cycle counts <strong>for</strong> au<strong>to</strong>matic and manual code<br />
generation.<br />
transla<strong>to</strong>r post-processor maker #pass involved<br />
SLOC 211 218 18 32<br />
Table 7.5: sloccount report <strong>for</strong> a terapix assembly genera<strong>to</strong>r pro<strong>to</strong>type written in pyps.
152 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
code after compaction. The comparison with the initial listing demonstrates the need <strong>for</strong><br />
an au<strong>to</strong>matic generation <strong>to</strong>ol.<br />
7.4 A Retargetable Multimedia Instruction Set Compiler<br />
This section presents a retargetable compiler <strong>for</strong> mis based on pips. It is based on the<br />
work on Super-word Level Parallelism (slp) presented in Section 5.1 and the communication<br />
optimization from Section 6.3. It targets three different instruction sets: sse, avx<br />
and neon.<br />
7.4.1 Architecture Description<br />
miss rely on the vec<strong>to</strong>r unit found in most modern processors. The set of hardware<br />
feature is described of such processors is given in Figure 7.13. Each mis has a specific<br />
isa but we have already shown in Section 5.1.2 how <strong>to</strong> represent them using a generic<br />
instruction set. As a consequence, the instruction set is mainly represented by the number<br />
of bits per vec<strong>to</strong>r register.<br />
7.4.2 Compiler Implementation<br />
The input language is C and the output language is C with mis intrinsics. As intrinsics<br />
are C functions, there is no need <strong>for</strong> post-processing. However, header substitution is<br />
required <strong>to</strong> specialize the generic mis. Figure 7.14 summarizes the compilation flow. The<br />
source-<strong>to</strong>-source transla<strong>to</strong>r relies on Algorithm 3 from Chapter 5 <strong>to</strong> generate vec<strong>to</strong>r code.<br />
7.4.3 Multimedia Instruction Set on Desk<strong>to</strong>p and Embedded Processors<br />
Three sets of experiments have been carried out. They all use the same set of C source<br />
files. Three mis are tested on Core2 Duo running at 2.2 GHz, using a 2.6.34 Linux Kernel.<br />
A board with an ARMv7 processor and a 2.6.28 Linux kernel was used <strong>for</strong> the neon mis.<br />
A machine with a 2.6.32 Linux kernel and an Intel SandyBridge (running at 2.6 GHz)<br />
executed the avx tests.<br />
Applications have been chosen <strong>to</strong> point out limitations of compilers (including ours).<br />
daxpy_u?r.c, ddot_u?r.c and dscal_u?r.c are taken from linpack [DLP03] benchmark<br />
and illustrate the impact of manual unrolling on vec<strong>to</strong>rization. matrix_*.c are taken from<br />
the Coremark [Con] benchmark and show the impact of tiling. stencil.c is a typical<br />
stencil application and a good candidate <strong>for</strong> vec<strong>to</strong>rization.<br />
Other benchmarks are text-book versions of well known computations kernels (Finite<br />
Impulse Response filter, average power, alpha-blending, convolution with a 3 × 3 kernel)<br />
taken from a dsp manual [Orf95].
7.4. A RETARGETABLE MULTIMEDIA INSTRUCTION SET COMPILER 153<br />
void runner ( int n, int img_out [n][n -4] , int img [n][n]) {<br />
<strong>for</strong> ( int y = 0; y
154 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
void launcher_0_microcode ( int I_29 , int img00 [258] , int img_out00<br />
[254]) {<br />
<strong>for</strong> ( int x = 0; x
7.4. A RETARGETABLE MULTIMEDIA INSTRUCTION SET COMPILER 155<br />
prog launcher_0_microcode<br />
sub launcher_0_microcode<br />
im , i12 = FIFO2 ||||P,re (0)=1 || ||<br />
im , i11 = FIFO1 |||| P=P || ||<br />
im , i10 = i11 +1* E |||| P=P || ||<br />
im ,i9=i11 +2* E |||| P=P || ||<br />
im ,i8=i11 +3* E |||| || ||<br />
im ,i7=i11 +4* E |||| || ||<br />
im ,i1=i7 |||| || ||<br />
im ,i2=i8 |||| || ||<br />
im ,i3=i9 |||| || ||<br />
im ,i4=i10 |||| || ||<br />
im ,i5=i11 |||| || ||<br />
im ,i6=i12 |||| || ||<br />
im ,i6=i6 +1* W |||| || ||<br />
im ,i5=i5 +1* W |||| || ||<br />
im ,i4=i4 +1* W |||| || ||<br />
im ,i3=i3 +1* W |||| || ||<br />
im ,i2=i2 +1* W |||| || ||<br />
im ,i1=i1 +1* W |||| || do_N1 ||<br />
im=i5 +1* E ||||P,re (1)= im*re (0) || ||<br />
im=i4 +1* E |||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
||||P,As=P-im*re (0) || ||<br />
||||P,re (2)= im*re (0) || ||<br />
im=i3 +1* E |||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=re (1) || ||<br />
||||P,re (8)= if(As =1 ,P,re (2))|| ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
||||P,As=re (8) - im*re (0) || ||<br />
||||P,re (7)= im*re (0) || ||<br />
im=i2 +1* E |||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=re (8) || ||<br />
||||P,re (7)= if(As =1 ,P,re (7))|| ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
||||P,As=re (7) - im*re (0) || ||<br />
||||P,re (6)= im*re (0) || ||<br />
im=i1 +1* E |||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=re (7) || ||<br />
||||P,re (6)= if(As =1 ,P,re (6))|| ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
||||P,As=re (6) - im*re (0) || ||<br />
||||P,re (2)= im*re (0) || ||<br />
im=i6 +1* E |||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=P || ||<br />
|||| P=re (6) || ||<br />
||||P,im=if(As =1 ,P,re (2)) || loop ||<br />
|||| || || return<br />
endsub<br />
endprog<br />
Listing 7.6: Illustration of terapix compacted assembly.
156 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
memory<br />
ram<br />
manda<strong>to</strong>ry feature<br />
mis<br />
isa Acceleration<br />
Parallelism<br />
simd<br />
Figure 7.13: mis hardware feature diagram.<br />
Sequential<br />
Code<br />
Transla<strong>to</strong>r<br />
Sequential<br />
Code<br />
+<br />
Intrinsics<br />
<strong>Source</strong><strong>to</strong>-Binary<br />
Compiler<br />
Binary<br />
Specialization<br />
header<br />
Figure 7.14: <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> mis.
7.4. A RETARGETABLE MULTIMEDIA INSTRUCTION SET COMPILER 157<br />
The experiments consist of measuring the execution time of each kernel, using either<br />
the initial version or the optimized version generated by our compiler. The compilers used<br />
were gcc 4.4.5 <strong>for</strong> both i386 and ARM architectures, with -O3 -ffast-math flags, and<br />
median over 150 runs of each program is measured.<br />
The challenge is <strong>to</strong> have the same level of per<strong>for</strong>mance as Intel C++ Compiler (icc)<br />
<strong>for</strong> sse and avx, while supporting another architecture, arm, in the same unified infrastructure,<br />
pips, <strong>to</strong> provide per<strong>for</strong>mance portability.<br />
7.4.4 Results & Analyses<br />
Figure 7.15a, 7.15b and 7.15c show the result of the experiments, giving speedup of vec<strong>to</strong>rized<br />
version compared <strong>to</strong> the reference sequential version. This reference run is given by<br />
the original source compiled with gcc with -O3 -ffast-math and -fno-tree-vec<strong>to</strong>rize.<br />
These experiments lead <strong>to</strong> the following assessments:<br />
1. gcc vec<strong>to</strong>rization engine hardly achieves any speedup: only rather simple kernels<br />
get a 2× speedup, fir even has a significant slowdown, while icc gets very good<br />
speedups <strong>for</strong> all kernels;<br />
2. using our vec<strong>to</strong>rization engine and running gcc on the output is almost always beneficial<br />
(green bars are above red bars). It is especially visible on the arm processor;<br />
3. we outper<strong>for</strong>m icc (pink bars above blue bars) <strong>for</strong> matrix-mul-* on sse, due <strong>to</strong> the<br />
combination of tiling and vec<strong>to</strong>rization, but always lose <strong>to</strong> it <strong>for</strong> the same kernels on<br />
avx;<br />
4. the Fused Multiply-Add (fma) operation is available in the neon mis and gcc does<br />
not use it. This explains the super-ideal speedups and shows the benefits of using<br />
target-specific instructions;<br />
5. unrolled version of linpack kernels are better vec<strong>to</strong>rized by pips, thanks <strong>to</strong> the slp<br />
approach;<br />
6. pips output behaves better when compiled by icc than when compiled by gcc. It<br />
illustrates the source-<strong>to</strong>-source approach that hooks the compilation flow <strong>to</strong> add a<br />
feature, here vec<strong>to</strong>rization, and delegates the remaining work <strong>to</strong> other compilers. In<br />
this case, icc per<strong>for</strong>ms additional optimizations on the vec<strong>to</strong>r code gcc is not aware<br />
of.<br />
These experiments validate the approach: per<strong>for</strong>mance within the reach of icc is<br />
achieved, and this per<strong>for</strong>mance is portable across architectures.<br />
The sloccount report <strong>for</strong> each part of the pro<strong>to</strong>type is given in Table 7.6. The SLOC<br />
<strong>for</strong> the post-processor and the maker are given <strong>for</strong> the avx driver, sse and neon drivers<br />
have similar values.
158 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
speedup vs. sequential execution<br />
speedup vs. sequential execution<br />
speedup vs. sequential execution<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
gcc+nopips<br />
gcc+pips<br />
icc+nopips<br />
icc+pips<br />
fir matrix-mul-matrix<br />
corr<br />
convol3x3<br />
ddot-r<br />
dscal-r<br />
matrix-mul-vect<br />
ddot-ur<br />
alphablending<br />
stencil<br />
matrix-mul-const<br />
dscal-ur<br />
daxpy-r<br />
daxpy-ur<br />
(a) Vec<strong>to</strong>rization using the sse mis with 32-bit OS.<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
gcc+nopips<br />
gcc+pips<br />
icc+nopips<br />
icc+pips<br />
fir matrix-mul-matrix<br />
corr<br />
convol3x3<br />
ddot-r<br />
dscal-r<br />
matrix-mul-vect<br />
ddot-ur<br />
alphablending<br />
stencil<br />
matrix-mul-const<br />
dscal-ur<br />
daxpy-r<br />
daxpy-ur<br />
(b) Vec<strong>to</strong>rization using the avx mis with 64-bit OS.<br />
12<br />
10<br />
8<br />
6<br />
4<br />
2<br />
gcc<br />
gcc+pips<br />
fir matrix-mul-matrix<br />
corr<br />
convol3x3<br />
ddot-r<br />
dscal-r<br />
matrix-mul-vect<br />
ddot-ur<br />
alphablending<br />
stencil<br />
matrix-mul-const<br />
dscal-ur<br />
daxpy-r<br />
daxpy-ur<br />
(c) Vec<strong>to</strong>rization using the neon mis.
7.5. CONCLUSION 159<br />
transla<strong>to</strong>r post-processor maker #pass involved<br />
SLOC 223 166 1 30<br />
Table 7.6: sloccount report <strong>for</strong> an avx intrinsic genera<strong>to</strong>r pro<strong>to</strong>type written in pyps.<br />
pass count<br />
30<br />
25<br />
20<br />
15<br />
10<br />
7.5 Conclusion<br />
5<br />
0<br />
1 2 3 4<br />
# of compilers where a pass is found<br />
Figure 7.15: Pass reuse among 4 pyps-based compilers.<br />
We claim in Chapter 3 that an important characteristic of a compiler infrastructure is<br />
the pass reuse among compilers. To verify that our framework matches this expectation,<br />
we use the following experiment over the six compilers presented in this chapter: <strong>for</strong> each<br />
pass and analysis available in pips, we count the number of compilers it is used in. The<br />
mis compiler is counted only once. The result of this analysis is given in Figure 7.15. This<br />
chart does not take in<strong>to</strong> account parsers and pretty-printers. Although the targets are very<br />
different (a multicore device, a mis, a gpu and an embedded accelera<strong>to</strong>r), we still have<br />
good pass reuse, as more than 50% of the passes are used in at least two compilers, if they<br />
are used at all. The high number of passes used only in one compiler is linked <strong>to</strong> the fact<br />
that each target is subject <strong>to</strong> very different hardware constraints, especially <strong>for</strong> the isa.<br />
As a consequence, each compiler has many small passes <strong>to</strong> take in<strong>to</strong> account target-specific<br />
aspects. The passes that are reused the most are also the most complex ones: symbolic<br />
tiling, outlining, statement isolation, etc. The addition of an Open Computing Language<br />
(opencl) compiler would certainly hare a lot with the existing cuda compiler.<br />
Table 7.7 summarizes the in<strong>for</strong>mation presented <strong>for</strong> each compiler pro<strong>to</strong>type. It shows<br />
that each complier was assembled using a reasonable development ef<strong>for</strong>t, but also that the
160 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />
transla<strong>to</strong>r post-processor maker #pass involved<br />
openmp 41 0 2 8<br />
terapix 211 218 18 32<br />
mis 223 166 1 30<br />
cuda 597 0 N/A 20<br />
Table 7.7: Summary of the sloccount reports <strong>for</strong> the compiler pro<strong>to</strong>types written in pyps.<br />
more complex the target is, the more complex the pass manager is. However, pass reuse<br />
makes it possible <strong>to</strong> keep this complexity low.<br />
Another point <strong>to</strong> consider about pass reuse is the compiler composition. Some compilers<br />
may not share many passes, but we have provided a way <strong>to</strong> compose them using multiple<br />
inheritance. The consequence of this modular design is that each compiler focuses on a<br />
specific task, not an all-around purpose.<br />
This chapter presents the design and implementation of four compilers <strong>for</strong> heterogeneous<br />
devices, based on the heterogeneous model analyses from Chapter 2, the compiler<br />
infrastructure described in Chapter 3 and the trans<strong>for</strong>mations described in Chapters 4, 5<br />
and 6.<br />
This chapter describes how <strong>to</strong> build a basic openmp compiler using the ideas developed<br />
in this thesis: model the architecture, identify the output language and re-use existing<br />
trans<strong>for</strong>mations.<br />
Using the same methodology, we present three other compiler pro<strong>to</strong>types we have built<br />
during our phd: a retargetable compiler <strong>for</strong> mis that targets both sse, avx and neon,<br />
a compiler <strong>for</strong> the terapix image processor that goes from C down <strong>to</strong> assembly, and a<br />
C-<strong>to</strong>-cuda compiler <strong>for</strong> nVidia gpus.<br />
Each pro<strong>to</strong>type is validated on a set of benchmarks <strong>to</strong> ensure that it generates valid<br />
and reasonably efficient code. We also provide an analyse of each compiler in the term of<br />
<strong>Source</strong> Lines Of Code (sloc), pass usage and pass reuse.
Chapter 8<br />
Conclusion<br />
Vieux pont du Bono, Morbihan c○ Pierre Yves Sabas<br />
The path <strong>to</strong>ward per<strong>for</strong>mance is led by heterogeneous devices: even the lap<strong>to</strong>p used <strong>to</strong><br />
write this dissertation can use the processing power of two gpps, the two associated sse<br />
vec<strong>to</strong>r units, and a gpgpu. The main concern with such devices is programmability. In<br />
this thesis, we have taken the path of compilation <strong>to</strong> au<strong>to</strong>mate the production of code <strong>for</strong><br />
hardware accelera<strong>to</strong>rs. We focused on the ability <strong>to</strong> produce cheaply different compilers <strong>for</strong><br />
different hardware. As modern hardware is usually programmable in a C dialect, we have<br />
set the goal of au<strong>to</strong>matically translating text book algorithm written in the C language<br />
in<strong>to</strong> several target-dependent kernels written in a C dialect, and <strong>to</strong> generate the glue code<br />
between the host and the accelera<strong>to</strong>r.<br />
The advantage of this approach is its modularity: many trans<strong>for</strong>mations can be reused<br />
from one accelera<strong>to</strong>r <strong>to</strong> the other. It reduces the cost of producing compilers as new<br />
targets become available. Moreover, using a source-<strong>to</strong>-source based infrastructure makes<br />
it possible <strong>to</strong> interact with existing <strong>to</strong>ols, especially compilers that generate binary code<br />
from C dialects.<br />
161
162 CHAPTER 8. CONCLUSION<br />
To avoid the pitfall of parallel compilers designed in the 80’s, we deliberately chose <strong>to</strong><br />
take as inputs programs <strong>for</strong> which the sequential and parallel algorithms are similar. This<br />
has made it possible <strong>to</strong> focus on the translation task and not the parallelism extraction.<br />
Contributions<br />
Methodology <strong>to</strong> Build <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> <strong>Compilers</strong><br />
We have proposed <strong>to</strong> model hardware devices with a hardware constraint diagram. This<br />
diagram identifies the manda<strong>to</strong>ry and optional features of the hardware, and the manual<br />
association between constraints and code trans<strong>for</strong>mations guides the compiler developer<br />
through the compiler development process.<br />
Generic Compiler Infrastructure Design<br />
The heterogeneity of accelerating devices makes it difficult <strong>to</strong> build a unique compiler <strong>to</strong><br />
target them all. Moreover, pieces of software already exist, at different levels, <strong>to</strong> program<br />
these machines. In Chapter 3, we have proposed a compilation flow that combines a<br />
comprehensive source-<strong>to</strong>-source trans<strong>for</strong>mation <strong>to</strong>olbox, an api <strong>for</strong> pass management and<br />
an heterogeneous machine model. This methodology is validated in Chapter 7 <strong>for</strong> 4 different<br />
targets. It is used in the Par4All <strong>to</strong>ol developed by hpc project.<br />
Trans<strong>for</strong>mations <strong>for</strong> isa Constraints<br />
<strong>Heterogeneous</strong> devices provide acceleration through specialization: basically, they per<strong>for</strong>m<br />
better on a narrower set of applications. The direct consequence is a specialization of<br />
the isa. This specialization is visible in the C dialects proposed <strong>to</strong> program these devices.<br />
Chapter 4 proposes a set of source-<strong>to</strong>-source trans<strong>for</strong>mations <strong>to</strong> lower the level of the input<br />
language, including an original algorithm <strong>for</strong> outlining based on convex array regions.<br />
Hybrid slp Algorithm<br />
Multimedia instructions are now commonly found in gpp and even provide the base <strong>for</strong><br />
acceleration in hybrid cpu/gpu chips. We have developed an original algorithm based on<br />
existing works around loop vec<strong>to</strong>rization and Super-word Level Parallelism that combines<br />
the benefits of loop-based and sequence-based approaches in a unified algorithm. It is<br />
parametrized by a C description of the isa and so respects the criteria of retargetability<br />
raised at Chapter 3. The algorithm has been tested <strong>for</strong> three Multimedia Instruction Set:<br />
sse, avx and neon. This work received the third best poster award at PACT 2011.
Trans<strong>for</strong>mations <strong>to</strong> Meet Memory Constraints<br />
Memory is critical <strong>for</strong> many heterogeneous systems: when the accelera<strong>to</strong>r does not share<br />
memory with its host, rpc and dma are needed. The programming model is much more<br />
complex than classical ones. We have presented in Chapter 6 three trans<strong>for</strong>mations <strong>to</strong> take<br />
this in<strong>to</strong> account: statement isolation that separates accelera<strong>to</strong>r memory and host memory;<br />
memory footprint reduction that finds a tiling matrix that make sure there is enough<br />
memory on the accelera<strong>to</strong>r <strong>to</strong> run the tiled code; and redundant load-s<strong>to</strong>re elimination<br />
that removes redundant data transfers.<br />
Implementation<br />
All the trans<strong>for</strong>mations presented in this thesis have been developed in the pips source<strong>to</strong>-source<br />
compiler infrastructure <strong>for</strong> the C language and assembled using the pyps pass<br />
manager. They have led <strong>to</strong> the implementation of four compilers: a pro<strong>to</strong>type of Open<br />
Multi Processing (openmp) directive genera<strong>to</strong>r, a retargetable compiler <strong>for</strong> mis, a kernel<br />
genera<strong>to</strong>r <strong>for</strong> an fpga-based image processor, terapix, and a gpu code genera<strong>to</strong>r developed<br />
by the hpc project. It validates both the overall compiler infrastructure design and<br />
the algorithm proposed in the thesis. Experiments and compilation flows are detailed in<br />
Chapter 7.<br />
Contributions <strong>to</strong> the pips community<br />
It is difficult <strong>to</strong> untie a PhD in applied computer science from the development task.<br />
Developing new passes and extending some of the existing passes, originally designed <strong>for</strong><br />
the Fortran, <strong>to</strong> the C language, has taken a significant amount of time. As a member of the<br />
pips team, I have managed the modernization of the build system and the rationalization<br />
of the software packaging.<br />
I have supervised five trainees at Télécom Bretagne during internships related <strong>to</strong> the<br />
pips project and contributed <strong>to</strong> the scientific dissemination of our works through two<br />
tu<strong>to</strong>rials.<br />
Future Work<br />
hpc is steadily changing. Sparc64 are ranking first in the <strong>to</strong>p500 of June, 2011, while<br />
nVidia’s gpu led the race 6 months be<strong>for</strong>e. In this moving environment, nothing is ever<br />
settled yet and hardware vendors are still pushing their standards <strong>to</strong> have a common<br />
programming model supported by efficient engineering <strong>to</strong>ols. This requires cooperation<br />
and interoperability between <strong>to</strong>ols. To that extent, bridging the gap between opencl and<br />
existing vhdl genera<strong>to</strong>rs is an interesting challenge and it remains an open research field.<br />
However, hpc is still a niche market compared <strong>to</strong> embedded systems and smartphones. In<br />
these fields, the hardware constraints are all the more important due <strong>to</strong> limited battery<br />
163
164 CHAPTER 8. CONCLUSION<br />
capacity, weight, space, etc. The trans<strong>for</strong>mations and the approach studied during this<br />
PhD can certainly be used in that fields.<br />
mis are becoming more and more flexible, and it is common <strong>to</strong> find non-simd instructions<br />
in their C api. These instructions allows more load/s<strong>to</strong>re patterns (e.g. strided loads)<br />
and helps <strong>to</strong> achieve higher throughput in memory-constrained applications. The incremental<br />
addition of trans<strong>for</strong>mations that handle these instructions <strong>to</strong> our mis compiler is a<br />
promising subject.<br />
We see two possible extensions <strong>to</strong> the work on pass managers. First, the opera<strong>to</strong>r<br />
combinations creates an oriented graph that presents interesting parallelization opportunities<br />
at the pass manager level. It would make the compilation process itself parallel and<br />
improve compilation time. Second, some phase combination are known <strong>to</strong> be redundant,<br />
meaningless, etc. Adding semantics on code trans<strong>for</strong>mation would yield interesting graph<br />
pruning, <strong>for</strong> instance in the context of iterative compilation.
Appendix A<br />
The PIPS Compiler Infrastructure<br />
Pont de Saint Goustan, Morbihan c○ Gwenael AB / flickr<br />
Paralléliseur Interprocedural de Programmes Scientifiques (pips) [IJT91, ISCKG10,<br />
AAC + 11] is a source-<strong>to</strong>-source compiler infrastructure started in 1988 at MINES Paris-<br />
Tech, when parallel architectures were preeminent. Since then it has been successfully<br />
used <strong>to</strong> analyse, check or parallelize industrial Fortran codes. C support started ten years<br />
ago, bringing both new challenges and new applications. The key ideas of the framework,<br />
that makes it still relevant in 2011, are: a minimalistic Internal Representation (ir), interprocedural<br />
analyses and abstract interpretation on polyhedral lattices. The compiler<br />
infrastructure <strong>for</strong> heterogeneous targets, the algorithms and passes described in this thesis<br />
have been fully implemented on <strong>to</strong>p of pips. Finally, the three compilers described<br />
in Chapter 7 are also based on pips. The short overview given in this appendix should<br />
help readers not accus<strong>to</strong>med <strong>to</strong> pips <strong>to</strong> understand some technical parts. A more detailed<br />
overview is given in [AAC + 11], and the interested reader is advised <strong>to</strong> refer <strong>to</strong> the theses<br />
of Nga Nguyen [Ngu02] and Béatrice Creusillet [Cre96] <strong>for</strong> a detailed description of<br />
the underlying mathematical framework.<br />
165
166 APPENDIX A. THE PIPS COMPILER INFRASTRUCTURE<br />
void foo ( int n, int threshold , int a[n], int b[n]) {<br />
int k =0;<br />
<strong>for</strong> ( int h =0;h threshold )<br />
b[k ++]= a[h];<br />
}<br />
}<br />
Available Analyses<br />
Listing A.1: A simple loop <strong>to</strong> illustrate pips analyses.<br />
In addition <strong>to</strong> classical compiler analyses such as use-def chains, dependence graph or<br />
read-write effects, pips provides accurate interprocedural analyses. We illustrate each of<br />
them based on the loop from Listing A.1.<br />
Preconditions and Postconditions are affine predicates over scalar variables that are<br />
proved <strong>to</strong> hold be<strong>for</strong>e or after the execution of a given statement, respectively (see<br />
Listing A.2).<br />
// P() {}<br />
void foo ( int n,int threshold , int a[n], int b[n ]){<br />
// P() {}<br />
int k = 0;{<br />
// P(k) {k ==0}<br />
// P(h,k) {k ==0}<br />
<strong>for</strong> ( int h = 0; h
T() {}<br />
void foo ( int n,int threshold , int a[n], int b[n ]){<br />
// T(k) {k ==0}<br />
int k = 0;{<br />
// T(h) {}<br />
// T(h,k) {0
168 APPENDIX A. THE PIPS COMPILER INFRASTRUCTURE<br />
// : a [*] threshold<br />
// : b [*]<br />
// < is read >: n<br />
void foo ( int n,int threshold , int a[n], int b[n ]){<br />
// < is written >: k<br />
int k = 0;{<br />
// : a [*] h k threshold<br />
// : b [*] k<br />
// < is read >: n<br />
// < is written >: h<br />
<strong>for</strong> ( int h = 0; h : h n threshold<br />
if (a[h]> threshold )<br />
// : a [*]<br />
// : b [*]<br />
// < is read >: h k n<br />
// < is written >: k<br />
b[k ++] = a[h];<br />
}<br />
}<br />
Listing A.4: Example of cumulated memory effects analysis.<br />
//
Appendix B<br />
The LuC language<br />
The LuC language is used in some proofs of this dissertation. This language is similar<br />
<strong>to</strong> Fortran with a C syntax. Redundant constructs such as the += opera<strong>to</strong>r are not<br />
represented <strong>to</strong> keep proofs simple. The main differences with the C language are the<br />
removal of recursive calls, unions and pointers and the addition of reference passing mode<br />
<strong>for</strong> function parameters. Global variables are not allowed. A short reference of its syntax<br />
is given here, using typing conventions <strong>to</strong> differentiate non-terminal symbols and terminals<br />
from language constructs.<br />
B.1 Syntactic Clauses<br />
prog : fdecls<br />
fdecls : ∅ | fdecl fdecls<br />
fdecl : void id ( param ) { stat }<br />
type : int | float | complex | struct id { fields } | type [ expr ]<br />
param : type id<br />
fields : ∅ | field fields<br />
field : type id<br />
expr : cst | ref | expr op expr<br />
ref : id | ref [ expr ] | ref . fieldname<br />
stat : ∅ | ; | { type id ; stat }<br />
| ref =expr ; | ref =read ;<br />
| write expr ;<br />
| ref (ref ) ;<br />
| if( expr ) { stat } else { stat }<br />
| while( expr ) { stat }<br />
| stat ; stat<br />
169
170 APPENDIX B. THE LUC LANGUAGE<br />
B.2 Semantic Clauses<br />
T(int|float, σ) = 1<br />
T(complex, σ) = 2<br />
T(struct id { fields }, σ) = <br />
〈t,i〉∈fields<br />
T(t, σ)<br />
T( type [ expr ] , σ) = T( type, σ) × E(expr, σ)<br />
R(id, σ) = I(id)<br />
R(ref [ expr ], σ) = R(ref , σ)[E(expr, σ)]<br />
R( ref . fieldname, σ) = R(ref , σ).fieldname<br />
E(cst, σ) = cst<br />
E(ref , σ) = σ(R(ref ))<br />
E(expr op expr, σ) = E(expr, σ) op E(expr, σ)<br />
S(∅, σ) = σ<br />
S(;, σ) = σ<br />
S({ type id ; stat}, σ) = unbind(S(stat, loc(id, T(type, σ), σ)), id)<br />
S(ref =expr, σ) = σ[R(ref ) → E(expr, σ)]<br />
S(write expr, σ) = push(σ(istdout), E(expr, σ)); σ<br />
S(ref =read, σ) = σ[R(ref, σ) → pop(σ(istdin))]<br />
S( id(ref ), σ) = σ[lk → vk | lk ∈ R(ref , σ) ∧ vk =<br />
S(body(id), {I(<strong>for</strong>mal(id)) → E(ref , σ)})(lk)]<br />
S(if(expr){ stat 0 }else{ stat 1 }, σ) = if E(expr, σ) then S(stat 0, σ)<br />
else S(stat 1, σ)<br />
S( while( expr ) { stat }, σ) = if E(expr, σ) then<br />
S( while( expr ) { stat }, S(stat, σ))<br />
else σ<br />
S(stat 0 ; stat 1, σ) = S(stat 1, S(stat 0, σ))
Appendix C<br />
Using PyPS <strong>to</strong> Drive a Compilation<br />
Benchmark<br />
This verbatim copy of the script used <strong>to</strong> benchmark our Open Multi Processing<br />
(openmp) transla<strong>to</strong>r on the polybench is a good illustration of the flexibility of Pythonic<br />
PIPS (pyps). It instantiates a compiler <strong>for</strong> each application found in the polybench source<br />
tree and turns each sequential kernel in<strong>to</strong> a parallel kernel and instruments it <strong>to</strong> gather<br />
execution time in<strong>for</strong>mation. This is achieve by composition of the openmp compiler with<br />
an instrumentation compiler and an abstraction—pyrops.pworkspace—that runs each<br />
compiler in a new process.<br />
import pyrops<br />
import workspace_gettime<br />
import openmp<br />
from glob import glob<br />
import shutil<br />
from os. path import basename<br />
map ( shutil . rmtree , glob (" PYPS *"))<br />
map ( shutil . rmtree , glob (" .*. tmp "))<br />
class workspace ( workspace_gettime . workspace , pyrops . pworkspace ):<br />
pass<br />
ITER =10<br />
result = list ()<br />
<strong>for</strong> src in glob (" polybench -2.0/*/*/*. c")+ glob (" polybench -2.0/*/*/*/*. c"):<br />
if src [ -6:] != " pocc .c":<br />
name = basename ( src ). replace ("_","-")<br />
# workspace . delete ( name )<br />
w= workspace (src , cppflags ="-Ipolybench -2.0/ utilities /",verbose = False )<br />
w. fun . main . benchmark_module ()<br />
times0 =w. benchmark ( iterations =ITER , LDFLAGS ="-lm",CFLAGS =’-O3␣-ffast - ma<br />
w. fun . main . openmp ( internalize_parallel_code = False )<br />
171
172 APPENDIX C. USING PYPS TO DRIVE A COMPILATION BENCHMARK<br />
times1 =w. benchmark ( openmp . ompMaker (), iterations =ITER , LDFLAGS ="-lm",CFL<br />
count =0<br />
<strong>for</strong> line in w. fun . main . code . split (’\n’):<br />
if line . find ("# pragma ␣ omp ␣" )!= -1:<br />
count +=1<br />
result . append (( name ,count , times0 [’main ’][0] , times1 [’main ’][0]))<br />
w. close ()<br />
fout = file (" polybench - openmp . dat ","w")<br />
<strong>for</strong> r in result :<br />
print >> fout ,r[0] ,r[1] ,r[2] ,r[3]<br />
fout . close ()
Appendix D<br />
Using C <strong>to</strong> Emulate sse Intrinsics<br />
An excerpt of the header used as a sequential replacement of the Streaming simd<br />
Extension (sse) header xmmintrin.h is reproduced here <strong>for</strong> the interested reader. It shows<br />
how the sse Instruction Set Architecture (isa) can be emulated using pure C code. This<br />
header was used <strong>to</strong> compile sse enabled applications on processors that do not have sse<br />
vec<strong>to</strong>r units.<br />
# include <br />
/* Some Macros from xmmintrin .h */<br />
# define _MM_SHUFFLE (z, y, x, w) ((( z)
174 APPENDIX D. USING C TO EMULATE SSE INTRINSICS<br />
/* data reorganization */<br />
inline __m128i _mm_unpacklo_epi64 ( __m128i v0 , __m128i v1) {<br />
__m128i ov = { . u64 = { v0.u64 [0] , v1.u64 [0] } };<br />
return ov;<br />
}<br />
inline __m128i _mm_shufflehi_epi16 ( __m128i v, int mask ) {<br />
__m128i ov = { . u16 = { v. u16 [0] , v. u16 [1] , v. u16 [2] , v. u16 [3] ,<br />
v. u16 [4+(( mask > >0)&3)] , v. u16 [4+(( mask > >2)& 3)] ,<br />
v. u16 [4+(( mask > >4)& 3)] , v. u16 [4+(( mask > >6)& 3)] } };<br />
return ov;<br />
}<br />
inline __m128i _mm_shufflelo_epi16 ( __m128i v, int mask ) {<br />
__m128i ov = { . u16 = { v. u16 [( mask > >0)&3] , v. u16 [( mask > >2)& 3],<br />
v. u16 [( mask > >4)& 3], v. u16 [( mask > >6)& 3],<br />
v. u16 [4] , v. u16 [5] , v. u16 [6] , v. u16 [7] } };<br />
return ov;<br />
}<br />
inline __m128i _mm_shuffle_epi32 ( __m128i v, int mask ) {<br />
__m128i ov = { . u32 = { v. u32 [( mask > >0)&3] , v. u32 [( mask > >2)& 3],<br />
v. u32 [( mask > >4)& 3], v. u32 [( mask > >6)& 3] } };<br />
return ov;<br />
}<br />
inline __m128i _mm_unpackhi_epi64 ( __m128i v0 , __m128i v1) {<br />
__m128i ov = { . u64 = { v0.u64 [1] , v1.u64 [1] } };<br />
return ov;<br />
}<br />
/* pure vec<strong>to</strong>r operations */<br />
inline __m128i _mm_or_si128 ( __m128i v0 , __m128i v1) {<br />
__m128i ov;<br />
<strong>for</strong> ( size_t i =0;i
}<br />
return ov;<br />
inline __m128i _mm_add_epi32 ( __m128i v0 , __m128i v1 ){<br />
__m128i ov;<br />
<strong>for</strong> ( size_t i =0;i
176 APPENDIX D. USING C TO EMULATE SSE INTRINSICS<br />
}<br />
return ov;<br />
/* bit operations on full vec<strong>to</strong>r */<br />
inline __m128i _mm_slli_si128 ( __m128i v, int count ) {<br />
__m128i ov;<br />
count *=8;<br />
ov.u64 [1]= v. u64 [1] > (64 - count ):<br />
v. u64 [0] count ;<br />
ov.u64 [0]|= ( count < 64) ?<br />
v. u64 [1] > count %64;<br />
ov.u64 [1]= v. u64 [1] >> count ;<br />
return ov;<br />
}
Code Trans<strong>for</strong>mation Glossary<br />
Trans<strong>for</strong>mations marked with a † are passes implemented in pips during the PhD or<br />
passes I made significant contributions <strong>to</strong>o.<br />
array linearization † is the process of converting multidimensional array in<strong>to</strong> unidimensional<br />
arrays, possibly with a conversion from array <strong>to</strong> pointer.. 144, 150<br />
common subexpression elimination † is the process of replacing similar expressions by<br />
a variable that holds the result of their evaluation.. 85, 167<br />
constant propagation is a pass that replaces a variable by its value when this value is<br />
known at compile time.. 51<br />
dead code elimination is the process of pruning from a function all the statements whose<br />
results are never used.. 120, 167<br />
directive generation is a common name <strong>for</strong> code trans<strong>for</strong>mations that annotate the code<br />
with directives.. 49, 134<br />
flatten code is the process of pruning a function body from declaration blocks so that all<br />
declarations are made at the <strong>to</strong>p level.. 150<br />
<strong>for</strong>ward substitution † is the process of replacing a reference read in an expression by<br />
the latest expression affected <strong>to</strong> it.. 46, 84, 167<br />
go<strong>to</strong> elimination is the process of replacing go<strong>to</strong> instructions by a hierarchical control<br />
flow graph.. 84<br />
inlining † is a function trans<strong>for</strong>mation. Inlining a function foo in its caller bar consists<br />
in the substitution of the calls <strong>to</strong> foo in bar by the function body after replacement<br />
of the <strong>for</strong>mal parameters by their effective parameters.. 46, 53, 83, 84<br />
instruction selection † is the process of mapping parts of the Internal Representation <strong>to</strong><br />
machine instructions.. 79<br />
invariant code motion is a loop trans<strong>for</strong>mation that moves outside of the loop the code<br />
from its body that is independent from the iteration.. 167<br />
iteration clamping is a loop trans<strong>for</strong>mation that extends the loop range but guards the<br />
loop body with the <strong>for</strong>mer range.. 144<br />
loop fusion is a loop trans<strong>for</strong>mation that replaces two loops by a single loops whose body<br />
is the concatenation of the bodies of the two initial loops.. 49, 52, 87, 134, 167<br />
177
178 Glossary<br />
loop interchange is a loop trans<strong>for</strong>mation that permutes two loops from a loop nest..<br />
102, 167<br />
loop normalization is a loop trans<strong>for</strong>mation that changes the loop initial increment<br />
value or the loop range <strong>to</strong> en<strong>for</strong>ce certain values, generally 1.. 144<br />
loop rerolling finds manually unrolled loop and replace them by their non-unrolled version..<br />
108<br />
loop tiling is a loop nest trans<strong>for</strong>mation that changes the loop execution order through a<br />
partitions of the iteration space in<strong>to</strong> chunks, so that the iteration is per<strong>for</strong>med over<br />
each chunk and in the chunks.. 52, 102, 167<br />
loop unrolling is a loop trans<strong>for</strong>mation. Unrolling a loop by a fac<strong>to</strong>r of n consists in the<br />
substitution of a loop body by itself, replicated n times. A prelude and/or postlude<br />
are added <strong>to</strong> preserve the number of iteration.. 46, 50, 102<br />
loop unswitching † is a loop trans<strong>for</strong>mation that replaces a loop containing a test independent<br />
from the loop execution by a test containing the loop without the test in<br />
both true and false branch.. 104<br />
memory footprint reduction † is the process of tiling a loop <strong>to</strong> make sure the iteration<br />
over the tile has a memory footprint bounded by a given value.. 121, 163<br />
n address code generation † is the process of splitting complex expression in simpler<br />
ones that take at most n operands.. 150<br />
outlining † is the process of extracting part of a function body in<strong>to</strong> a new function and<br />
replacing it in the initial function by a function call.. 18, 84, 87, 159, 162<br />
parallelism detection is a common name <strong>for</strong> analysis that detect if a loop can be run in<br />
parallel.. 134<br />
parallelism extraction is a common name <strong>for</strong> code trans<strong>for</strong>mations that modifies loop<br />
nest <strong>to</strong> make it legal <strong>to</strong> run them in parallel.. 49<br />
privatization is the process of detecting variables that are private <strong>to</strong> a loop body, i.e.<br />
written first, then read.. 134<br />
reduction detection is an analysis that identifies statements that per<strong>for</strong>m a reduction<br />
over a variable.. 49, 134<br />
redundant load-s<strong>to</strong>re elimination † is an inter procedural trans<strong>for</strong>mation that optimizes<br />
data transfers by delaying and merging them.. 113, 124, 163<br />
scalar renaming † is the process of renaming scalar variables <strong>to</strong> suppress false data dependencies..<br />
100<br />
split update opera<strong>to</strong>r † is the process of replacing an update opera<strong>to</strong>r by its expanded<br />
<strong>for</strong>m.. 150<br />
statement isolation † is the process of replacing all variables referenced in a statement by<br />
newly declared variables. A prologue and an epilogue are added <strong>to</strong> copy old variable<br />
values <strong>to</strong> new variable, back and <strong>for</strong>th.. 114, 120, 141, 159, 163<br />
strength reduction † is the process of replacing an operation by an operation of lower<br />
cost.. 150
Acronyms<br />
Trans<strong>for</strong>mations marked with a † are passes implemented in pips during the PhD or<br />
passes I made significant contributions <strong>to</strong>o.<br />
api Application Programming Interface. 2, 7–9, 15, 18, 22, 37–40, 44, 58, 63, 67, 88, 133,<br />
141, 162, 164<br />
asic Application-Specific Integrated Circuit. 6, 27<br />
asip Application-Specific Instruction set Processor. 41<br />
ast Abstract Syntax Tree. 65, 66, 98<br />
avx Advanced Vec<strong>to</strong>r eXtensions. iii, xvii, 16, 18, 53, 59, 60, 79, 93, 95, 97, 133, 152,<br />
157–160, 162<br />
cisc Complex Instruction Set Computer. 79<br />
cli Command Line Interface. 47, 58<br />
cpu Central Processing Unit. 9, 15, 18, 33, 66, 69, 92, 121, 123, 133, 139, 144, 162<br />
cri Centre de Recherche en In<strong>for</strong>matique. 4, 24<br />
cuda Compute Unified Device Architecture. iii, xvii, 2, 5, 7, 8, 10, 16, 22, 25, 26, 31, 34,<br />
43, 44, 49, 52, 55, 59, 60, 66, 70, 71, 139, 141, 142, 144, 159, 160<br />
dma Direct Memory Access. 7, 11, 19, 32, 33, 38, 50, 89, 110, 125, 128–130, 141, 147, 163<br />
dsp Digital Signal Processing. xiv, 79, 145, 152<br />
flops FLoating point Operations per Second. 2, 5, 22, 26<br />
fma Fused Multiply-Add. xi, 11, 79, 96, 98, 99, 157<br />
fpga Field Programmable Gate Array. iii, 4–6, 15, 19, 24–27, 32, 38, 39, 41, 66, 67, 70,<br />
71, 106, 133, 144, 146, 163<br />
fpu Floating-Point Unit. 73<br />
fsa Fusion System Architecture. 9, 10, 69<br />
gcc gnu C Compiler. xi, xiii, xvii, 12, 17, 34, 36, 44–47, 49, 51, 63, 65, 67, 73, 79, 92–97,<br />
137, 144, 157<br />
gpgpu General Purpose gpu. 2, 5, 6, 12, 17, 22, 25, 26, 31, 33, 34, 39, 91, 161<br />
gpp General Purpose Processor. 4, 6, 26, 27, 161, 162<br />
179
180 Acronyms<br />
gpu Graphical Processing Unit. xiv, 4, 5, 8, 9, 15, 16, 18, 19, 24–26, 30–33, 38, 40, 41,<br />
43, 49, 52, 54, 59, 67, 69, 71, 83, 106, 111, 133, 139–142, 144, 145, 159, 160, 162, 163<br />
hcfg Hierarchical Control Flow Graph. 72, 124, 125, 127, 128<br />
hdl Hardware Description Language. 42<br />
hmpp Hybrid Multicore Parallel Programming. 7, 41<br />
hpc High Per<strong>for</strong>mance Computing. 12, 19, 20, 42, 91, 106, 163<br />
hpec High Per<strong>for</strong>mance Embedded Computing. 36<br />
icc Intel C++ Compiler. xiii, 12, 17, 36, 73, 92, 93, 95, 96, 157<br />
ilp Instruction Level Parallelism. 12, 72, 92<br />
ir Internal Representation. 23, 43–45, 54, 56, 57, 59, 61, 67, 69–72, 79, 83, 89, 125, 134,<br />
141, 165, 177<br />
isa Instruction Set Architecture. 4, 6, 9, 10, 16, 18, 24, 32, 33, 67, 69, 70, 72, 76, 79, 83,<br />
89, 134, 140, 141, 146–148, 152, 156, 159, 162, 173<br />
jit Just In Time. 39, 66, 67, 96<br />
llvm Low Level Virtual Machine. xi, xiii, xvii, 12, 45–47, 49, 63, 65, 67, 70, 92, 93, 95,<br />
96<br />
mimd Multiple Instruction stream, Multiple Data stream. 5, 12, 16, 26, 29, 32, 33, 67,<br />
91, 134, 139, 140<br />
mis Multimedia Instruction Set. viii, x, xiii, xiv, 12, 18, 20, 87, 92, 94–97, 99, 111, 124,<br />
133, 152, 156–160, 162–164<br />
mkl Math Kernel Library. 27, 109<br />
mmx Matrix Math eXtension. 93, 95<br />
mp-soc MultiProcessor System-on-Chip. 5, 25<br />
mpi Message Passing Interface. 29, 55<br />
oop Object Oriented Programming. 50<br />
oo Object Oriented. 66<br />
opencl Open Computing Language. xiii, 5–7, 10, 20, 26–28, 34, 37–42, 45, 51, 52, 57,<br />
70, 71, 159, 163<br />
opengl Open Graphics Library. 5, 26, 31, 38<br />
openmp Open Multi Processing. iii, xi, xiv, xv, xvii, 16, 19, 31, 49, 55, 59, 60, 67, 106,<br />
133–138, 160, 163, 171<br />
pci Peripheral Component Interconnect. 40<br />
pe Processing Element. 139, 146, 147, 150<br />
pips Paralléliseur Interprocedural de Programmes Scientifiques. iii, xii, xiii, 4, 8, 9, 12,<br />
19, 24, 37, 44, 46, 48, 55, 58, 63, 67, 72, 73, 87, 91, 130, 141, 152, 157, 159, 163,<br />
165–167, 177, 179
Acronyms 181<br />
ps3 PlayStation 3. 2, 22<br />
ptx Parallel Thread eXecution. 67, 72, 77, 141<br />
pu Processing Unit. 146<br />
pocc Polyhedral Compiler Collection. 55, 56<br />
pyps Pythonic PIPS. xi, xiii, xiv, xvii, 8, 9, 19, 44, 46, 52, 53, 58–60, 63, 64, 67, 98, 134,<br />
137, 144, 151, 159, 160, 163, 171<br />
ram Random Access Memory. 33, 134, 140, 146, 147<br />
rom Read Only Memory. 32, 140, 146, 147<br />
rpc Remote Procedure Call. 19, 29, 30, 163<br />
sdk Software Development Kit. 5, 8, 25, 37, 39<br />
simd Single Instruction stream, Multiple Data stream. 5, 12, 16, 17, 20, 26, 29, 31–33,<br />
52, 67, 79, 91–93, 96–102, 139–141, 146–148, 156, 164<br />
sisd Single Instruction stream, Single Data stream. 29<br />
sloc <strong>Source</strong> Lines Of Code. 8, 44, 58, 63, 160<br />
slp Super-word Level Parallelism. viii, xi, 18, 92, 96, 98, 104, 106, 111, 112, 152, 157, 162<br />
soc Software On Chip. 5, 25, 41<br />
ssa Simple Static Assignment. 67<br />
sse Streaming simd Extension. iii, xi, xiii, 10, 16–18, 29, 39, 49, 53, 55, 60–62, 72, 94, 95,<br />
97, 98, 107, 109, 133, 152, 157, 158, 160–162, 173<br />
tr Textual Representation. 56, 61, 89<br />
ulp Unit in the Last Place. 38<br />
vhdl vhsic Hardware Description Language. 5, 7, 20, 26, 34, 55, 163<br />
vliw Very Long Instruction Word. 16, 146, 147
182 Acronyms
Bibliography<br />
[AAC + 11] Mehdi Amini, Corinne Ancourt, Fabien Coelho, Béatrice Creusillet, Serge<br />
Guel<strong>to</strong>n, François Irigoin, Pierre Jouvelot, Ronan Keryell, and Pierre Villalon.<br />
PIPS Is not (only) Polyhedral Software. In First International Workshop on<br />
Polyhedral Compilation Techniques, IMPACT, April 2011.<br />
[ABC + 06] Krste Asanovic, Ras Bodik, Bryan Chris<strong>to</strong>pher Catanzaro, Joseph James<br />
Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester<br />
Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The<br />
landscape of parallel computing research: A view from berkeley. Technical<br />
Report UCB/EECS-2006-183, EECS Department, University of Cali<strong>for</strong>nia,<br />
Berkeley, 2006.<br />
[ABCR10] Joshua S. Auerbach, David F. Bacon, Perry Cheng, and Rodric M. Rabbah.<br />
Lime: a Java-compatible and synthesizable language <strong>for</strong> heterogeneous<br />
architectures. In Proceedings of the 25th Annual SIGPLAN Conference on<br />
Object-Oriented Programming, Systems, Languages, and Applications, OOP-<br />
SLA, pages 89–108, New York, NY, USA, Oc<strong>to</strong>ber 2010. ACM.<br />
[ACIK97] Corinne Ancourt, Fabien Coelho, François Irigoin, and Ronan Keryell. A linear<br />
algebra framework <strong>for</strong> static High Per<strong>for</strong>mance Fortran code distribution.<br />
Scientific Programming, 6(1):3–27, 1997.<br />
[AJ75] Alfred V. Aho and Stephen C. Johnson. Optimal code generation <strong>for</strong> expression<br />
trees. In Proceedings of seventh annual symposium on Theory of<br />
computing, STOC, pages 207–217, New York, NY, USA, 1975. ACM.<br />
[AK87] Randy Allen and Ken Kennedy. Au<strong>to</strong>matic translation of FORTRAN programs<br />
<strong>to</strong> vec<strong>to</strong>r <strong>for</strong>m. Transactions on Programming Languages and Systems,<br />
9:491–542, 1987.<br />
[AKPW83] John R. Allen, Ken Kennedy, Carrie Porterfield, and Joe D. Warren. Conversion<br />
of control dependence <strong>to</strong> data dependence. In Principles of Programming<br />
Languages, POPL, pages 177–189, 1983.<br />
[ALSU06] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. <strong>Compilers</strong>:<br />
Principles, Techniques, and Tools (2nd Edition). Addison Wesley, 2006.<br />
[Amd67] Gene M. Amdahl. Validity of the single processor approach <strong>to</strong> achieving<br />
large scale computing capabilities. In Proceedings of the spring joint computer<br />
conference, AFIPS, pages 483–485, New York, NY, USA, April 1967. ACM.<br />
183
184 BIBLIOGRAPHY<br />
[AMG + 99] Eduard Ayguade, MarcGonzalez, Marc Gonzalez, Jesus Labarta, Xavier Mar<strong>to</strong>rell,<br />
Nacho Navarro, and Jose Oliver. NanosCompiler: A research plat<strong>for</strong>m<br />
<strong>for</strong> OpenMP extensions. In In First European Workshop on OpenMP, pages<br />
27–31, 1999.<br />
[API03] Kubilay Atasu, Laura Pozzi, and Paolo Ienne. Au<strong>to</strong>matic application-specific<br />
instruction-set extensions under microarchitectural constraints. International<br />
Journal of Parallel Programming, 31(6):411–428, 2003.<br />
[AR97] Rumen Andonov and Sanjay V. Rajopadhye. Optimal orthogonal tiling of 2-d<br />
iterations. Journal of Parallel Distributed Computing, 45(2):159–165, September<br />
1997.<br />
[ASU86] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. <strong>Compilers</strong>: Princiles,<br />
Techniques, and Tools. Addison-Wesley, 1986.<br />
[ATNW09] Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André<br />
Wacrenier. StarPU: A unified plat<strong>for</strong>m <strong>for</strong> task scheduling on heterogeneous<br />
multicore architectures. In European Conference on Parallel Processing, Euro-<br />
Par, pages 863–874, 2009.<br />
[Aß96] Uwe Aßmann. How <strong>to</strong> uni<strong>for</strong>mly specify program analysis and trans<strong>for</strong>mation<br />
with graph rewrite systems. In Proceedings of the 6th International Conference<br />
on Compiler Construction, CC, pages 121–135, London, UK, 1996.<br />
Springer-Verlag.<br />
[Bac57] John W. Backus. The FORTRAN Au<strong>to</strong>matic Coding System <strong>for</strong> the IBM 704<br />
EDPM. International Business Machines Corporation (IBM), 1957.<br />
[Bas04] Cédric Bas<strong>to</strong>ul. Code generation in the polyhedral model is easier than you<br />
think. In International Conference on Parallel Architecture and Compilation<br />
Techniques, PACT, pages 7–16, Juan-les-Pins, France, 2004. IEEE Computer<br />
Society Press.<br />
[BB09] Francois Bodin and Stephane Bihan. <strong>Heterogeneous</strong> multicore parallel programming<br />
<strong>for</strong> graphics processing units. Scientific Programming, 17:325–336,<br />
December 2009.<br />
[BBK + 08] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy,<br />
J. Ramanujam, Atanas Rountev, and P. Sadayappan. A compiler framework<br />
<strong>for</strong> optimization of affine loop nests <strong>for</strong> GPGPUs. In Proceedings of the 22nd<br />
annual international conference on Supercomputing, ICS, pages 225–234, New<br />
York, NY, USA, 2008. ACM.<br />
[BDH + 10] André Rigland Brodtkorb, Chris<strong>to</strong>pher Dyken, Trond Runar Hagen, Jon M.<br />
Hjelmervik, and Olaf O. S<strong>to</strong>raasli. State-of-the-art in heterogeneous computing.<br />
Scientific Programming, 18(1):1–33, 2010.<br />
[Ber66] Arthur J. Bernstein. Analysis of programs <strong>for</strong> parallel processing. Transactions<br />
on Electronic Computers, pages 757 –762, 1966.
BIBLIOGRAPHY 185<br />
[BFH + 04] Ian Buck, Tim Foley, Daniel Reiter Horn, Jeremy Sugerman, Kayvon Fatahalian,<br />
Mike Hous<strong>to</strong>n, and Pat Hanrahan. Brook <strong>for</strong> GPUs: stream computing<br />
on graphics hardware. Transactions on Graphics, 23(3):777–786, 2004.<br />
[BGGT02] Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. Au<strong>to</strong>matic detection<br />
of saturation and clipping idioms. In International Workshop on Languages<br />
and <strong>Compilers</strong> <strong>for</strong> Parallel Computing, LNCS, pages 61–74. Springer-<br />
Verlag, 2002.<br />
[BGS94] David F. Bacon, Susan L. Graham, and Oliver J. Sharp. Compiler trans<strong>for</strong>mations<br />
<strong>for</strong> high-per<strong>for</strong>mance computing. Computing Surveys, 26(4):345–420,<br />
1994.<br />
[Bik04] Aart J. C. Bik. The Software Vec<strong>to</strong>rization Handbook: Applying Intel Multimedia<br />
Extensions <strong>for</strong> Maximum Per<strong>for</strong>mance. Intel Press, 2004.<br />
[BJK + 95] Robert D. Blumofe, Chris<strong>to</strong>pher F. Joerg, Bradley C. Kuszmaul, Charles E.<br />
Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded<br />
runtime system. In Journal of Parallel and Distributed Computing, JPDC,<br />
pages 207–216, New York, NY, USA, 1995. ACM.<br />
[BL10] Nicolas Benoit and Stéphane Louise. Extending GCC with a multi-grain<br />
parallelism adaptation framework <strong>for</strong> MPSoCs. In GCC <strong>for</strong> Research Opportunities<br />
Workshop, January 2010.<br />
[Ble89] Guy E. Blelloch. Scans as primitive parallel operations. Transactions on<br />
Computers, 38(11):1526–1538, November 1989.<br />
[BLE + 08] Philippe Bonnot, Fabrice Lemonnier, Gilbert Edelin, Gérard Gaillat, Olivier<br />
Ruch, and Pascal Gauget. Definition and SIMD implementation of a multiprocessing<br />
architecture approach on FPGA. In Design Au<strong>to</strong>mation and Test<br />
in Europe, DATE, pages 610–615. IEEE Computer Society Press, 2008.<br />
[BN84] Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure<br />
calls. Transactions on Computer Systems, 2:39–59, 1984.<br />
[Bre10] Tony M. Brewer. Instruction set innovations <strong>for</strong> the Convey HC-1 computer.<br />
IEEE Micro, 30(2):70–79, 2010.<br />
[BSB + 01] Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau L. Boloni, Muthucumaru<br />
Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D.<br />
Theys, Bin Yao, Debra Hensgen, and Richard F. Freund. A comparison of<br />
eleven static heuristics <strong>for</strong> mapping a class of independent tasks on<strong>to</strong> heterogeneous<br />
distributed computing systems. Journal of Parallel Distributed<br />
Computing, 61:810–837, June 2001.<br />
[CDL11] Alexandre Cornu, Steven Derrien, and Dominique Lavenier. HLS <strong>to</strong>ols <strong>for</strong><br />
FPGA: Faster development with better per<strong>for</strong>mance. In Reconfigurable Computing:<br />
Architectures, Tools and Applications - 7th International Symposium,<br />
volume 6578 of LNCS, pages 67–78, Belfast, UK, March 2011. Springer.
186 BIBLIOGRAPHY<br />
[CDM + 10] Hassan Chafi, Zach DeVi<strong>to</strong>, Adriaan Moors, Tiark Rompf, Arvind K. Sujeeth,<br />
Pat Hanrahan, Martin Odersky, and Kunle Olukotun. Language virtualization<br />
<strong>for</strong> heterogeneous parallel computing. In Proceedings of the international<br />
conference on Object oriented programming systems languages and applications,<br />
OOPSLA, pages 835–847, New York, NY, USA, 2010. ACM.<br />
[CDMC + 05] Cristian Coarfa, Yuri Dotsenko, John M. Mellor-Crummey, François Can<strong>to</strong>nnet,<br />
Tarek A. El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel G.<br />
Chavarría-Miranda. An evaluation of global address space languages: coarray<br />
Fortran and unified parallel C. In SIGPLAN Annual Symposium on<br />
Principles and Practice of Parallel Programming, PPOPP, pages 36–47, New<br />
York, NY, USA, 2005. ACM.<br />
[CGO11] Proceedings of the International Symposium on Code Generation and Optimization<br />
(CGO), New York, NY, USA, April 2011. ACM.<br />
[CH89] Pohua P. Chang and W.-W. Hwu. Inline function expansion <strong>for</strong> compiling<br />
C programs. In Proceedings of the SIGPLAN Conference on Programming<br />
language design and implementation, PLDI, pages 246–257, New York, NY,<br />
USA, June 1989. ACM.<br />
[Che94] Wasel Chemij. Parallel Computer Taxonomy. PhD thesis, Aberystwyth University,<br />
1994.<br />
[CJIA11] Fabien Coelho, Pierre Jouvelot, François Irigoin, and Corinne Ancourt. Data<br />
and process abstraction in PIPS internal representation. In Workshop on<br />
Internal Representations, WIR, Chamonix, France, April 2011.<br />
[Cla96] Philippe Clauss. Counting solutions <strong>to</strong> linear and nonlinear constraints<br />
through ehrhart polynomials: applications <strong>to</strong> analyze and trans<strong>for</strong>m scientific<br />
programs. In Proceedings of the 10th international conference on Supercomputing,<br />
ICS, pages 278–285, New York, NY, USA, May 1996. ACM.<br />
[Coe93] Fabien Coehlo. Étude de la Compilation du High Per<strong>for</strong>mance Fortran. PhD<br />
thesis, Université Paris VI, 1993.<br />
[Con] Embedded Microprocessor Benchmark Consortium. Coremark. http://www.<br />
coremark.org.<br />
[Coo04] Keith D. Cooper. Evolving the next generation of compilers. keynote talk at<br />
CGO’04, 2004.<br />
[Cre96] Béatrice Creusillet. Array Region Analyses and Applications. PhD thesis,<br />
MINES ParisTech, 1996.<br />
[CSY10] Kuan-Hsu Chen, Bor-Yeh Shen, and Wuu Yang. An au<strong>to</strong>matic superword<br />
vec<strong>to</strong>rization in LLVM. In 16th Workshop on Compiler Techniques <strong>for</strong> High-<br />
Per<strong>for</strong>mance and Embedded Computing, pages 19–27, Taipei, 2010.<br />
[Dal09] William J. Dally. The end of denial architecture and the rise of throughput<br />
computing. In Design Au<strong>to</strong>mation Conference, San Francisco, CA, USA, July<br />
2009.
BIBLIOGRAPHY 187<br />
[Dar99] Alain Darte. On the complexity of loop fusion. In Proceedings of the 1999 International<br />
Conference on Parallel Architectures and Compilation Techniques,<br />
PACT, pages 149–, Washing<strong>to</strong>n, DC, USA, September 1999. IEEE Computer<br />
Society Press.<br />
[DKK + 99] Carole Dulong, Rakesh Krishnaiyer, Dattatraya Kulkarni, Daniel Lavery, Wei<br />
Li, John Ng, and David Sehr. An overview of the Intel IA-64 compiler. Intel<br />
Technology Journal, 1999.<br />
[DKYC10] Gregory Frederick Diamos, Andrew Kerr, Sudhakar Yalamanchili, and<br />
Nathan Clark. Ocelot: a dynamic optimization framework <strong>for</strong> bulksynchronous<br />
applications in heterogeneous systems. In 19th International<br />
Conference on Parallel Architecture and Compilation Techniques, PACT,<br />
pages 353–364. ACM, September 2010.<br />
[DLP03] Jack Dongarra, Piotr Luszczek, and An<strong>to</strong>ine Petitet. The LINPACK benchmark:<br />
past, present and future. Concurrency and Computation: Practice and<br />
Experience, 15(9):803–820, 2003.<br />
[DMM + ] Steven Derrien, Daniel Ménard, Kevin Martin, An<strong>to</strong>ine Floch, An<strong>to</strong>ine Morvan,<br />
Adeel Pasha, Patrice Quin<strong>to</strong>n, Amit Kumar, and Loïc Cloatre. GeCoS:<br />
Generic compiler suite. http://gecos.g<strong>for</strong>ge.inria.fr.<br />
[DSV96] Alain Darte, Georges-André Silber, and Frédéric Vivien. Combining retiming<br />
and scheduling techniques <strong>for</strong> loop parallelization and loop tiling, 1996.<br />
[Dun90] R. Duncan. A survey of parallel computer architectures. Computer, 23(2):5–<br />
16, February 1990.<br />
[DUSsH93] Raja Das, Mustafa Uysal, Joel Saltz, and Yuan shin Hwang. Communication<br />
optimizations <strong>for</strong> irregular scientific computations on distributed memory architectures.<br />
Journal of Parallel and Distributed Computing, 22:462–479, 1993.<br />
[ER08] Eric Eide and John Regehr. Volatiles are miscompiled, and what <strong>to</strong> do about<br />
it. In International Workshop on Embedded Systems, pages 255–264, 2008.<br />
[Ero95] Ana Maria Erosa. A go<strong>to</strong>-elimination method and its implementation <strong>for</strong> the<br />
McCat C compiler. Thesis (m.s.), McGill University, Montreal, Canada, May<br />
1995.<br />
[FLR98] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation<br />
of the Cilk-5 multithreaded language. In Proceedings of the SIGPLAN<br />
Conference on Program Language Design and Implementation, PLDI, pages<br />
212–223, 1998.<br />
[Fly72] Michael J. Flynn. Some computer organizations and their effectiveness.<br />
Transactions on Computers, C-21(9):948–960, 1972.<br />
[FO03] Björn Franke and Michael F. P. O’Boyle. Array recovery and high-level trans<strong>for</strong>mations<br />
<strong>for</strong> DSP applications. Transactions in Embedded Computing Systems,<br />
2(2):132–162, 2003.
188 BIBLIOGRAPHY<br />
[Fra03] cois Ferrand Fran˙ Optimization and code parallelization <strong>for</strong> processors with<br />
multimedia SIMD instructions. Technical report, Télécom Bretagne, August<br />
2003. master thesis report.<br />
[GCB07] Gildas Genest, Richard Chamberlain, and Robin J. Bruce. Programming an<br />
FPGA-based super computer using a C-<strong>to</strong>-VHDL compiler: DIME-C. In<br />
Adaptive Hardware and Systems (AHS), pages 280–286, 2007.<br />
[GG] J. L. Gustafson and B. S. Greer. ClearSpeed whitepaper: accelerating the<br />
Intel Math Kernel Library. http://www.clearspeed.com/docs/resources/<br />
ClearSpeedIntelWhitepaperFeb07.pdf.<br />
[GLGP06] Gert Goossens, Dirk Lanneer, Werner Geurts, and Johan Van Praet. Design<br />
of ASIPs in multi-processor SoCs using the Chess/Checkers retargetable <strong>to</strong>ol<br />
suite. International Symposium on System-on-Chip, 2006.<br />
[GNB08] Zhi Guo, Walid Najjar, and Betul Buyukkurt. Efficient hardware code generation<br />
<strong>for</strong> FPGAs. Transactions on Architecture and Code Optimization,<br />
5(1):1–26, 2008.<br />
[GO03] Etienne Gaudrain and Yann Orlarey. A Faust tu<strong>to</strong>rial. Technical report,<br />
GRAME, September 2003.<br />
[GPZ + 01] María Jesús Garzarán, Milos Prvulovic, Ye Zhangy, Josep Torrellas, Alin<br />
Jula, Hao Yu, and Lawrence Rauchwerger. Architectural support <strong>for</strong> parallel<br />
reductions in scalable shared-memory multiprocessors. In Proceedings<br />
of the 2001 International Conference on Parallel Architectures and Compilation<br />
Techniques, PACT, pages 243–, Washing<strong>to</strong>n, DC, USA, 2001. IEEE<br />
Computer Society Press.<br />
[GS04] Brian J. Gough and Richard M. Stallman. An Introduction <strong>to</strong> GCC. Network<br />
Theory Ltd., 2004.<br />
[GZA + 11] Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin<br />
Größlinger, and Louis-Noël Pouchet. Polly - polyhedral optimization in<br />
LLVM. In First International Workshop on Polyhedral Compilation Techniques,<br />
IMPACT, 2011.<br />
[HEL + 09] Manuel Hohenauer, Felix Engel, Rainer Leupers, Gerd Ascheid, and Heinrich<br />
Meyr. A SIMD optimization framework <strong>for</strong> retargetable compilers. Transactions<br />
on Architecture and Code Optimization, 6(1), 2009.<br />
[HF11] Matt J. Harvey and Gianni De Fabritiis. Swan: A <strong>to</strong>ol <strong>for</strong> porting CUDA<br />
programs <strong>to</strong> OpenCL. Computer Physics Communications, 182(4):1093–1099,<br />
2011.<br />
[HP06] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth<br />
Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San<br />
Francisco, CA, USA, 2006.<br />
[HRF + 10] Ever<strong>to</strong>n Hermann, Bruno Raffin, François Faure, Thierry Gautier, and<br />
Jérémie Allard. Multi-GPU and multi-CPU parallelization <strong>for</strong> interactive
BIBLIOGRAPHY 189<br />
physics simulations. In Euro-Par, volume 6272 of LNCS, pages 235–246.<br />
Springer, 2010.<br />
[HRTV11] Robert Hundt, Easwaran Raman, Martin Thuresson, and Neil Vachharajani.<br />
MAO – an extensible micro-architectural optimizer. In CGO [CGO11].<br />
[Ier06] Rober<strong>to</strong> Ierusalimschy. Programming in Lua, Second Edition. Lua.Org, 2006.<br />
[IJT91] François Irigoin, Pierre Jouvelot, and Rémi Triolet. Semantical interprocedural<br />
parallelization: an overview of the PIPS project. In International<br />
Conference on Supercomputing, ICS, pages 244–251, 1991.<br />
[iLJE03] Sang ik Lee, Troy A. Johnson, and Rudolf Eigenmann. Cetus - an extensible<br />
compiler infrastructure <strong>for</strong> source-<strong>to</strong>-source trans<strong>for</strong>mation. In 16th International<br />
Workshop on Languages and <strong>Compilers</strong> <strong>for</strong> Parallel Computing, volume<br />
2958 of LCPC, pages 539–553, College Station, TX, USA, 2003.<br />
[INR] INRIA. Aladdin-g5k. https://www.grid5000.fr.<br />
[IS99] Liviu If<strong>to</strong>de and Jaswinder Pal Singh. Shared Virtual Memory: Progress<br />
and Challenges. In Proceedings of the IEEE, pages 498–507. IEEE Computer<br />
Society Press, 1999.<br />
[ISCKG10] François Irigoin, Frédérique Silber-Chaussumier, Ronan Keryell, and Serge<br />
Guel<strong>to</strong>n. PIPS tu<strong>to</strong>rial at PPoPP 2010. http://pips4u.org/doc/tu<strong>to</strong>rial,<br />
2010.<br />
[ISO99] ISO. ISO/IEC 9899 Programming languages —C. ISO, 1999.<br />
[ISO08] ISO. ISO/IEC TR 18037:2008 Programming languages —C —Extensions <strong>to</strong><br />
support embedded processors. ISO, 2008.<br />
[IT88] François Irigoin and Rémi Triolet. Supernode partitioning. In Proceedings<br />
of the 15th SIGPLAN-SIGACT symposium on Principles of programming<br />
languages, POPL, pages 319–329, New York, NY, USA, 1988. ACM.<br />
[JD89] Pierre Jouvelot and Babak Dehbonei. A unified semantic approach <strong>for</strong> the<br />
vec<strong>to</strong>rization and parallelization of generalized reductions. In International<br />
Conference on Supercomputing, ICS, pages 186–194, 1989.<br />
[JM99] Simon Pey<strong>to</strong>n Jones and Simon Marlow. Secrets of the Glasgow Haskell<br />
compiler inliner. In Journal of Functional Programming, page 2002, 1999.<br />
[JMH + 05] Weihua Jiang, Chao Mei, Bo Huang, Jianhui Li, Jiahua Zhu, Binyu Zang,<br />
and Chuanqi Zhu. Boosting the per<strong>for</strong>mance of multimedia applications using<br />
SIMD instructions. In International Conference on Compiler Construction,<br />
CC, pages 59–75, 2005.<br />
[JPJ + 11] Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson,<br />
Stephen R. Beard, and David I. August. Au<strong>to</strong>matic CPU–GPU communication<br />
management and optimization. In Proceedings of the 32nd SIGPLAN<br />
conference on Programming language design and implementation, PLDI,<br />
pages 142–151, New York, NY, USA, June 2011. ACM.
190 BIBLIOGRAPHY<br />
[JRR99] Simon L. Pey<strong>to</strong>n Jones, Norman Ramsey, and Fermin Reig. C--: A portable<br />
assembly language that supports garbage collection. In Principles and Practice<br />
of Declarative Programming, International Conference, LNCS, pages 1–<br />
28, Paris, France, September 1999. Springer.<br />
[KBM07] Volodymyr V. Kindratenko, Robert J. Brunner, and Adam D. Myers. Mitrion-<br />
C application development on SGI Altix 350/RC100. In International Symposium<br />
on Field-Programmable Cus<strong>to</strong>m Computing Machines, FCCM, pages<br />
239–250. IEEE Computer Society Press, 2007.<br />
[Ker03] Brian Kernighan. Interview with brian kernighan. Linux Journal, July 2003.<br />
http://www.linuxjournal.com/article/7035.<br />
[KK92] Ken Kennedy and Kathryn S. M Kinley. Optimizing <strong>for</strong> parallelism and data<br />
locality. In International Conference on Supercomputing, ICS, pages 323–334,<br />
New York, NY, USA, 1992. ACM.<br />
[KKS00] Ki-Il Kum, Jiyang Kang, and Wonyong Sung. AUTOSCALER <strong>for</strong> C: an<br />
optimizing floating-point <strong>to</strong> integer C program converter <strong>for</strong> fixed-point digital<br />
signal processors. Circuits and Systems II: Analog and Digital Signal<br />
Processing, 47(9):840–848, September 2000.<br />
[KOWG10] Khronos OpenCL Working Group. The OpenCL Specification, version 1.1,<br />
2010.<br />
[KRS90] Clyde Kruskal, Larry Rudolph, and Marc Snir. Efficient parallel algorithms<br />
<strong>for</strong> graph problems. Algorithmica, 5:43–64, 1990.<br />
[KS99] Kazuhiro Kusano and Mitsuhisa Sa<strong>to</strong>. A comparison of au<strong>to</strong>matic parallelizing<br />
compiler and improvements by compiler directives. In International<br />
Symposium on High Per<strong>for</strong>mance Computing, ISHPC, pages 95–108, London,<br />
UK, 1999. Springer-Verlag.<br />
[KSA + 10] Kazuhiko Komatsu, Katsu<strong>to</strong> Sa<strong>to</strong>, Yusuke Arai, Kentaro Koyama, Hiroyuki<br />
Takizawa, and Hiroaki Kobayashi. Evaluating per<strong>for</strong>mance and portability<br />
of OpenCL programs. In The Fifth International Workshop on Au<strong>to</strong>matic<br />
Per<strong>for</strong>mance Tuning, June 2010.<br />
[LA00] Samuel Larsen and Saman P. Amarasinghe. Exploiting superword level parallelism<br />
with multimedia instruction sets. In Programming Language Design<br />
and Implementation, PLDI, pages 145–156, 2000.<br />
[LA03] Chris Lattner and Vikram Adve. Architecture <strong>for</strong> a next-generation GCC.<br />
In Proceedings of First Annual GCC Developers’ Summit, Ottawa, Canada,<br />
May 2003.<br />
[LA04] Chris Lattner and Vikram Adve. LLVM: A compilation framework <strong>for</strong> lifelong<br />
program analysis & trans<strong>for</strong>mation. In International Symposium on Code<br />
Generation and Optimization, CGO, Palo Al<strong>to</strong>, Cali<strong>for</strong>nia, 2004.<br />
[Lam74] Leslie Lamport. The parallel execution of do loops. communications of the<br />
ACM, 17(2):83–93, 1974.
BIBLIOGRAPHY 191<br />
[Lat11] Chris Lattner. LLVM, chapter 11. Amy Brown and Greg Wilson (edi<strong>to</strong>rs),<br />
2011. http://www.aosabook.org.<br />
[Lei92] F. Thomson Leigh<strong>to</strong>n. Introduction <strong>to</strong> parallel algorithms and architectures:<br />
array, trees, hypercubes. Morgan Kaufmann Publishers Inc., San Francisco,<br />
CA, USA, 1992.<br />
[LF80] Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. Journal<br />
of the ACM, 27:831–838, 1980.<br />
[LRW05] James Lebak, Albert Reuther, and Edmund Wong. Polymorphous computing<br />
architecture kernel-level benchmarks. Technical Report Project Report PCA-<br />
KERNEL-1, MIT Lincoln Labora<strong>to</strong>ry, Lexing<strong>to</strong>n, MA, 2005.<br />
[LVM + 10] Allen Leung, Nicolas Vasilache, Benoît Meister, Muthu Baskaran, David<br />
Wohl<strong>for</strong>d, Cédric Bas<strong>to</strong>ul, and Richard Lethin. A mapping path <strong>for</strong> multigpgpu<br />
accelerated computers from a portable high level programming abstraction.<br />
In Proceedings of the 3rd Workshop on General-Purpose Computation<br />
on Graphics Processing Units, GPGPU, pages 51–61, New York, NY, USA,<br />
2010. ACM.<br />
[LWFK02] S.M. Loo, B.E. Wells, N. Freije, and J. Kulick. Handel-C <strong>for</strong> rapid pro<strong>to</strong>typing<br />
of VLSI coprocessors <strong>for</strong> real time systems. In Proceedings of the Thirty-<br />
Fourth Southeastern Symposium on System Theory, pages 6–10, 2002.<br />
[MAB + 10] Harm Munk, Eduard Ayguadé, Cédric Bas<strong>to</strong>ul, Paul Carpenter, Zbigniew<br />
Chamski, Albert Cohen, Marco Cornero, Philippe Dumont, Marc Duran<strong>to</strong>n,<br />
Mohammed Fellahi, Roger Ferrer, Razya Ladelsky, Menno Lindwer, Xavier<br />
Mar<strong>to</strong>rell, Cupertino Miranda, Dorit Nuzman, Andrea Ornstein, An<strong>to</strong>niu<br />
Pop, Sebastian Pop, Louis-Noël Pouchet, Alex Ramírez, David Ródenas, Erven<br />
Rohou, Ira Rosen, Uzi Shvadron, Konrad Trifunović, and Ayal Zaks.<br />
Acotes project: Advanced compiler technologies <strong>for</strong> embedded streaming. International<br />
Journal of Parallel Programming, 2010. Special issue on European<br />
HiPEAC network of excellence members projects. To appear.<br />
[Mas92] Vadim Maslov. Delinearization: an efficient way <strong>to</strong> break multiloop dependence<br />
equations. In In Proceedings of the SIGPLAN Conference on Programming<br />
Language Design and Implementation, PLDI, pages 152–161, 1992.<br />
[McB94] Oliver A. McBryan. An overview of message passing environments. Parallel<br />
Computing, 20:417–444, 1994.<br />
[MMG08] Peter Messmer, Paul J. Mullowney, and Brian E. Granger. GPULib: GPU<br />
computing in high-level languages. Computing in Science and Engineering,<br />
10:70–73, 2008.<br />
[Moo65] Gordon E. Moore. Cramming more components on<strong>to</strong> integrated circuits.<br />
Electronics, 38(8), April 1965.<br />
[Muc97] Steven S. Muchnick. Advanced compiler design and implementation, chapter<br />
13. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
192 BIBLIOGRAPHY<br />
[Ngu02] Thi Viet Nga Nguyen. Efficient and Effective Software Verifications <strong>for</strong> Scientific<br />
Applications using Static Analysis and Code Instrumentation. PhD<br />
thesis, MINES ParisTech, 2002.<br />
[NMRW02] George C. Necula, Scott McPeak, Shree Prakash Rahul, and Westley Weimer.<br />
CIL: Intermediate language and <strong>to</strong>ols <strong>for</strong> analysis and trans<strong>for</strong>mation of C<br />
programs. In Compiler Construction, volume 2304 of LNCS, pages 213–228.<br />
Springer, April 2002.<br />
[Nov06] Diego Novillo. GCC - an architectural overview, current status and future<br />
directions. Ottawa Linux Symposium, July 2006.<br />
[NRD + 11] Dorit Nuzman, Ira Rosen, Sergei Dyshel, Ayal Zaks, Erven Rohou, Kevin<br />
Williams, Albert Cohen, and David Yuste. Vapor SIMD: Au<strong>to</strong>-vec<strong>to</strong>rize once,<br />
run everywhere. In CGO [CGO11].<br />
[NSL + 11] Chris J. Newburn, Byoungro So, Zhenying Liu, Michael D. McCool, Anwar M.<br />
Ghuloum, Stefanus Du Toit, Zhi-Gang Wang, Zhaohui Du, Yongjian Chen,<br />
Gansha Wu, Peng Guo, Zhanglin Liu, and Dan Zhang. Intel’s Array <strong>Building</strong><br />
Blocks: A retargetable, dynamic compiler and embedded language. In CGO<br />
[CGO11], pages 224–235.<br />
[NVI10] NVIDIA. PTX: Parallel Thread Execution ISA Version 2.1, NVIDIA compute<br />
edition, April 2010.<br />
[NVI11] NVIDIA. NVIDIA CUDA Reference Manual 3.2. http://www.nvidia.com/<br />
object/cuda_develop.html, 2011.<br />
[OOVV05] Karina Olmos, Karina Olmos, Eelco Visser, and Eelco Visser. Composing<br />
source-<strong>to</strong>-source data-flow trans<strong>for</strong>mations with rewriting strategies and dependent<br />
dynamic rewrite rules. In 14th International Conference on Compiler<br />
Construction, volume 3443 of LNCS, pages 204–220. Springer-Verlag, 2005.<br />
[Ope11] OpenMP Architecture Review Board. OpenMP Application Program Interface,<br />
2011.<br />
[Orf95] Sophocles J. Orfanidis. Introduction <strong>to</strong> signal processing. Prentice-Hall, Inc.,<br />
Upper Saddle River, NJ, USA, 1995.<br />
[PAB + 06] Dac Pham, Hans-Werner Anderson, Erwin Behnen, Mark Bolliger, Sanjay<br />
Gupta, H. Peter Hofstee, Paul E. Harvey, Charles R. Johns, James A. Kahle,<br />
Atsushi Kameyama, John M. Keaty, Bob Le, Sang Lee, Tuyen V. Nguyen,<br />
John G. Petrovick, Mydung Pham, Juergen Pille, Stephen D. Posluszny,<br />
Mack W. Riley, Joseph Verock, James D. Warnock, Steve Weitzel, and Dieter<br />
F. Wendel. Key features of the design methodology enabling a multi-core<br />
SoC implementation of a first-generation CELL processor. In Asia and South<br />
Pacific Design Au<strong>to</strong>mation Conference, ASP-DAC, pages 871–878, 2006.<br />
[Pat10] David A. Patterson. The trouble with multicore. IEEE Spectrum, 2010.<br />
[PBB10] Louis-Noel Pouchet, Cédric Bas<strong>to</strong>ul, and Uday Bondhugula. PoCC: the Polyhedral<br />
Compiler Collection, 2010. http://pocc.sf.net.
BIBLIOGRAPHY 193<br />
[PBdD11] Artur Pietrek, Florent Bouchez, and Benoît Dupont de Dinechin. Tirex: A<br />
textual target-level intermediate representation <strong>for</strong> compiler exchange. In<br />
Workshop on Intermediate Representations, WIR, Chamonix, France, April<br />
2011.<br />
[PBSB04] Gilles Pokam, Stéphane Bihan, Julien Simonnet, and François Bodin.<br />
SWARP: a retargetable preprocessor <strong>for</strong> multimedia instructions. Concurrency<br />
and Computation: Practice and Experience, 16(2-3):303–318, 2004.<br />
[PBV06] E. Moscu Panainte, K.L.M. Bertels, and S. Vassiliadis. Interprocedural compiler<br />
optimization <strong>for</strong> partial run-time reconguration. Journal of VLSI Signal<br />
Processing, pages 161–172, 2006.<br />
[PBV07] Elena Moscu Panainte, Koen Bertels, and Stamatis Vassiliadis. The Molen<br />
compiler <strong>for</strong> reconfigurable processors. Transactions in Embedded Computing<br />
Systems, 2007.<br />
[PEH + 93] David A. Padua, Rudolf Eigenmann, Jay Hoeflinger, Paul Petersen, Peng Tu,<br />
Stephen Weather<strong>for</strong>d, and Keith Faigin. Polaris: A new-generation parallelizing<br />
compiler <strong>for</strong> MPPs. Technical Report 1306, University of Illinois, Center<br />
<strong>for</strong> Supercomputing Research and Development, 1993.<br />
[PGS + 09] Alexandros Papakonstantinou, Karthik Gururaj, John A. Strat<strong>to</strong>n, Deming<br />
Chen, Jason Cong, and Wen mei W. Hwu. FCUDA: Enabling efficient compilation<br />
of CUDA kernels on<strong>to</strong> FPGAs. In Proceedings of the 7th Symposium on<br />
Application Specific Processors, pages 35–42. IEEE Computer Society Press,<br />
July 2009.<br />
[PH09] David A. Patterson and John L. Hennessy. Computer organization and design:<br />
the hardware/software interface. Morgan Kaufmann Publishers, 2009.<br />
[Qui00] Daniel J. Quinlan. ROSE: Compiler support <strong>for</strong> object-oriented frameworks.<br />
Parallel Processing Letters, 10(2/3):215–226, 2000.<br />
[RBR + 05] Noam Rinetzky, Jörg Bauer, Thomas W. Reps, Shmuel Sagiv, and Reinhard<br />
Wilhelm. A semantics <strong>for</strong> procedure local heaps and its abstractions.<br />
In Proceedings of the 32nd SIGPLAN-SIGACT Symposium on Principles of<br />
Programming Languages, pages 296–309, New York, NY, USA, January 2005.<br />
ACM.<br />
[RCH + 10] Gabe Rudy, Chun Chen, Mary Hall, Malick Murtaza Khan, and Jacqueline<br />
Chame. A programming language interface <strong>to</strong> describe trans<strong>for</strong>mations<br />
and code generation. In The 23rd International Workshop on Languages and<br />
<strong>Compilers</strong> <strong>for</strong> Parallel Computing, LCPC, pages 136–150, Berlin, Heidelberg,<br />
2010. Springer-Verlag.<br />
[RNZ07] Ira Rosen, Dorit Nuzman, and Ayal Zaks. Loop-aware SLP in GCC - two<br />
years later. In GCC summit, 2007.<br />
[Roj04] Juan Rojas. Multimedia Macros <strong>for</strong> Portable Optimized Programs. PhD<br />
thesis, Northeastern University, 2004.
194 BIBLIOGRAPHY<br />
[SA94] Mark Segal and Kurt Akeley. The OpenGL graphics interface. Technical<br />
report, Silicon Graphics Computer Systems, 1994.<br />
[Sar97] Vivek Sarkar. Au<strong>to</strong>matic selection of high-order trans<strong>for</strong>mations in the<br />
IBM XL FORTRAN compilers. IBM Journal of Research and Development,<br />
41(3):233–264, 1997.<br />
[SCH03] Jaewook Shin, Jacqueline Chame, and Mary W. Hall. Exploiting superwordlevel<br />
locality in multimedia extension architectures. Journal of Instruction-<br />
Level Parallelism, 5, 2003.<br />
[Sch09] David Schleef. Oil Runtime Compiler. http://code.entropywave.com/<br />
projects/orc, 2009.<br />
[SCS + 08] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash,<br />
Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert<br />
Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan.<br />
Larrabee: a many-core x86 architecture <strong>for</strong> visual computing. Transactions<br />
on Graphics, 27:18:1–18:15, August 2008.<br />
[SHC05] Jaewook Shin, Mary W. Hall, and Jacqueline Chame. Superword-level parallelism<br />
in the presence of control flow. In International Symposium on Code<br />
Generation and Optimization, CGO, pages 165–175, 2005.<br />
[SHG08] Shubhabrata Sengupta, Mark Harris, and Michael Garland. Efficient parallel<br />
scan algorithms <strong>for</strong> GPUs. Technical report, NVIDIA, 2008.<br />
[Sin98] Satnam Singh. Accelerating Adobe Pho<strong>to</strong>shop with reconfigurable logic. In<br />
Symposium on FPGA Cus<strong>to</strong>m Computing Machines, pages 18–26. IEEE Computer<br />
Society Press, 1998.<br />
[SQ03] Markus Schordan and Daniel J. Quinlan. A source-<strong>to</strong>-source architecture <strong>for</strong><br />
user-defined optimizations. In László Böszörményi and Peter Schojer, edi<strong>to</strong>rs,<br />
Modular Programming Language, volume 2789 of LNCS, pages 214–223, 2003.<br />
[SSM08] Jay Smith, Howard Jay Siegel, and Anthony A. Maciejewski. A s<strong>to</strong>chastic<br />
model <strong>for</strong> robust resource allocation in heterogeneous parallel and distributed<br />
computing systems. In International Parallel & Distributed Processing Symposium,<br />
IPDPS, pages 1–5. IEEE Computer Society Press, 2008.<br />
[Ste97] Robert Stephens. A survey of stream processing. Acta In<strong>for</strong>matica, 34:491–<br />
541, 1997.<br />
[S<strong>to</strong>08] O. Olaf S<strong>to</strong>raasli. Accelerating genome sequencing 100–1000x with FPGAs.<br />
Many-core and Reconfigurable Supercomputing Conference, April 2008.<br />
[TCE + 10] Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser,<br />
Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna<br />
Upadrasta. GRAPHITE two years after: First lessons learned from<br />
real-world polyhedral compilation. In GCC Research Opportunities Workshop,<br />
GROW, Pisa Italie, 2010.
BIBLIOGRAPHY 195<br />
[TM08] Donald Thomas and Philip Moorby. The Verilog Hardware Description Language.<br />
Springer Publishing Company, Incorporated, 5 edition, 2008.<br />
[TNC + 09] Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen.<br />
Polyhedral-model guided loop-nest au<strong>to</strong>-vec<strong>to</strong>rization. International Conference<br />
on Parallel Architectures and Compilation Techniques, pages 327–337,<br />
2009.<br />
[vECGS92] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik<br />
Schauser. Active messages: a mechanism <strong>for</strong> integrated communication and<br />
computation. SIGARCH Computer Architecture News, 20:256–266, 1992.<br />
[War99] Martin P. Ward. Assembler <strong>to</strong> C migration using the FermaT trans<strong>for</strong>mation<br />
system. In International Conference on Software Maintenance, ICSM, pages<br />
67–76, 1999.<br />
[WC03] Ge Wang and Perry R. Cook. Chuck: a concurrent, on-the-fly audio programming<br />
language. In International Computer Music Conference, ICMC, pages<br />
219–226, 2003.<br />
[WFW + 94] Robert P. Wilson, Robert S. French, Chris<strong>to</strong>pher S. Wilson, Saman P. Amarasinghe,<br />
Jennifer M. Anderson, Steve W. K. Tjiang, Shih wei Liao, Chau wen<br />
Tseng, Mary W. Hall, Monica S. Lam, and John L. Hennessy. SUIF: An infrastructure<br />
<strong>for</strong> research on parallelizing and optimizing compilers. SIGPLAN<br />
Notices, 29:31–37, 1994.<br />
[WGN + 02] Oliver Wahlen, Tilman Glökler, Achim Nohl, Andreas Hoffmann, Rainer Leupers,<br />
and Heinrich Meyr. Application specific compiler/architecture codesign:<br />
a case study. SIGPLAN Notices, 37:185–193, 2002.<br />
[Whe01] David A. Wheeler. SLOCCount. http://www.dwheeler.com/sloccount,<br />
January 2001.<br />
[Whi09] Tom White. Hadoop: The Definitive Guide. O’Reilly, June 2009.<br />
[Wik09] Wikibooks, edi<strong>to</strong>r. GNU C Compiler Internals. http://en.wikibooks.org/<br />
wiki/GNU_C_Compiler_Internals, 2006-2009.<br />
[WL91a] Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. In<br />
SIGPLAN conference on Programming Language Design and Implementation,<br />
PLDI, pages 30–44, New York, NY, USA, 1991. ACM.<br />
[WL91b] Michael E. Wolf and Monica S. Lam. A loop trans<strong>for</strong>mation theory and an<br />
algorithm <strong>to</strong> maximize parallelism. Transactions on Parallel and Distributed<br />
Systems, 2(4):452–471, 1991.<br />
[WM95] Wm. A. Wulf and Sally A. Mckee. Hitting the memory wall: Implications of<br />
the obvious. Computer Architecture News, 23:20–24, March 1995.<br />
[Wol94] Wayne H. Wolf. Hardware-software co-design of embedded systems. In proceedings<br />
of the IEEE, pages 967–989, 1994.<br />
[Wol96] Michael Wolfe. High per<strong>for</strong>mance compilers <strong>for</strong> parallel computing. Addison-<br />
Wesley, 1996.
196 BIBLIOGRAPHY<br />
[Wol10] Michael Wolfe. Implementing the PGI accelera<strong>to</strong>r model. In Proceedings of<br />
the 3rd Workshop on General-Purpose Computation on Graphics Processing<br />
Units, GPGPU, pages 43–50, New York, NY, USA, 2010. ACM.<br />
[Wol11] Michael Wolfe. <strong>Compilers</strong> and more: Programming at exascale. HPC Wire,<br />
March 2011. http://www.hpcwire.com/hpcwire/2011-03-08/compilers_<br />
and_more_programming_at_exascale.html.<br />
[WW94] David Walker and David W. Walker. The design of a standard message passing<br />
interface <strong>for</strong> distributed memory concurrent computers. Parallel Computing,<br />
20:657–673, 1994.<br />
[Yi11] Qing Yi. Au<strong>to</strong>mated programmable control and parameterization of compiler<br />
optimizations. In CGO [CGO11].<br />
[YRR + 10] Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay V. Rajopadhye,<br />
Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien. Au<strong>to</strong>matic<br />
creation of tile size selection models. In International Symposium on<br />
Code Generation and Optimization, CGO’08, pages 190–199, New York, NY,<br />
USA, April 2010. ACM.<br />
[ZC91] Hans Zima and Barbara Chapman. Supercompilers <strong>for</strong> parallel and vec<strong>to</strong>r<br />
computers. ACM, New York, NY, USA, 1991.<br />
[ZC98] Julien Zory and Fabien Coelho. Using algebraic trans<strong>for</strong>mations <strong>to</strong> optimize<br />
expression evaluation in scientific codes. In Proceeding of International<br />
Conference on Parallel Architectures and Compiler Techniques, PACT, pages<br />
376–384, 1998.<br />
[Zim97] Re<strong>to</strong> Zimmermann. Binary adder architectures <strong>for</strong> cell-based VLSI and their<br />
synthesis. PhD thesis, Swiss Federal Institute of Technology (ETH), Zürich ,<br />
Switzerland, 1997.<br />
[ZO01] Matthias Zenger and Martin Odersky. Implementing extensible compilers. In<br />
ECOOP workshop on multiparadigm programming with object-oriented languages,<br />
pages 61–80, 2001.<br />
[ZPS + 96] V. Zivojnovic, S. Pees, C. Schlager, M. Willems, R. Schoenen, and H. Meyr.<br />
DSP processor/compiler co-design: a quantitative approach. In Proceedings<br />
of the 9th international symposium on System synthesis, ISSS, pages 108–,<br />
Washing<strong>to</strong>n, DC, USA, 1996. IEEE Computer Society Press.<br />
[ZWZD93] Songnian Zhou, Jingwen Wang, Xiaohu Zheng, and Pierre Delisle. U<strong>to</strong>pia: a<br />
load sharing facility <strong>for</strong> large, heterogeneous distributed computer systems.<br />
Software – Practice and Experiments, 23:1305–1336, December 1993.
Personal Bibliography<br />
[AAC + 11] Mehdi Amini, Corinne Ancourt, Fabien Coelho, Béatrice Creusillet, Serge<br />
Guel<strong>to</strong>n, François Irigoin, Pierre Jouvelot, Ronan Keryell, and Pierre Villalon.<br />
PIPS Is not (only) Polyhedral Software. In First International Workshop on<br />
Polyhedral Compilation Techniques, IMPACT, Chamonix, France, April 2011.<br />
[ACSGK11] Corinne Ancourt, Frédérique Chaussumier-Silber, Serge Guel<strong>to</strong>n, and Ronan<br />
Keryell. PIPS: An interprodedural, extensible, source-<strong>to</strong>-source compiler infrastructure<br />
<strong>for</strong> code trans<strong>for</strong>mations and instrumentations. tu<strong>to</strong>rial at International<br />
Symposium on Code Generation and Optimization, April 2011.<br />
Chamonix, France.<br />
[CSGIK10] Frédérique Chaussumier-Silber, Serge Guel<strong>to</strong>n, François Irigoin, and Ronan<br />
Keryell. PIPS: An interprodedural, extensible, source-<strong>to</strong>-source compiler infrastructure<br />
<strong>for</strong> code trans<strong>for</strong>mations and instrumentations. tu<strong>to</strong>rial at Principles<br />
and Practice of Parallel Programming, January 2010. Bangalore, India.<br />
[DGG + 07] Vincent Danjean, Roland Gillard, Serge Guel<strong>to</strong>n, Jean-Louis Roch, and<br />
Thomas Roche. Adaptive loops with KAAPI on multicore and grid: applications<br />
in symmetric cryp<strong>to</strong>graphy. In Parallel Symbolic Computation, PASCO,<br />
pages 33–42, 2007.<br />
[GAKC11] Serge Guel<strong>to</strong>n, Mehdi Amini, Ronan Keryell, and Béatrice Creusillet. PyPS,<br />
a programmable pass manager. poster at International Workshop on Languages<br />
and <strong>Compilers</strong> <strong>for</strong> Parallel Computing, September 2011. Fort Collins,<br />
Colorado, USA.<br />
[GGK11] Serge Guel<strong>to</strong>n, Adrien Guinet, and Ronan Keryell. <strong>Building</strong> retargetable<br />
and efficient compilers <strong>for</strong> multimedia instruction sets. poster at Parallel<br />
Architectures and Compilation Techniques, Oc<strong>to</strong>ber 2011. Galves<strong>to</strong>n, Texas,<br />
USA.<br />
[GGPV09] Serge Guel<strong>to</strong>n, Thierry Gautier, Jean-Louis Pazat, and Sébastien Varrette.<br />
Dynamic Adaptation Applied <strong>to</strong> Sabotage Tolerance. In Proceedings of<br />
the 17th Euromicro International Conference on Parallel, Distributed and<br />
Network-Based Processing, PDP, pages 237–244, Weimar, Germany, February<br />
2009.<br />
197
198 PERSONAL BIBLIOGRAPHY<br />
[GIK10] Serge Guel<strong>to</strong>n, François Irigoin, and Ronan Keryell. Au<strong>to</strong>matic and source<strong>to</strong>-source<br />
code generation <strong>for</strong> vec<strong>to</strong>r hardware accelera<strong>to</strong>rs. poster at Colloque<br />
National du GDR SOC-SIP, June 2010. Cergy, France.<br />
[GKI11] Serge Guel<strong>to</strong>n, Ronan Keryell, and François Irigoin. Compilation pour cible<br />
hétérogènes: au<strong>to</strong>matisation des analyses, trans<strong>for</strong>mations et décisions nécessaires.<br />
In 20ème Rencontres Françaises du Parallélisme, Renpar, Saint Malo,<br />
France, May 2011.<br />
[Gue09] Serge Guel<strong>to</strong>n. A genetic and source-<strong>to</strong>-source approach <strong>to</strong> iterative compilation.<br />
poster at ACM Student Research Competition Posters, Parallel Architectures<br />
and Compilation Techniques, September 2009. Raleigh, North<br />
Carolina, USA.<br />
[Gue10] Serge Guel<strong>to</strong>n. Au<strong>to</strong>matic source-<strong>to</strong>-source code generation <strong>for</strong> vec<strong>to</strong>r hardware<br />
accelera<strong>to</strong>rs. poster at International Workshop on Languages and <strong>Compilers</strong><br />
<strong>for</strong> Parallel Computing, Oc<strong>to</strong>ber 2010. Hous<strong>to</strong>n, Texas, USA.<br />
[Gue11] Serge Guel<strong>to</strong>n. <strong>Building</strong> <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> compilers <strong>for</strong> Heterogenous targets.<br />
PhD thesis, Télécom Bretagne, 2011.<br />
[GV09] Serge Guel<strong>to</strong>n and Sébastien Varrette. Une approche génétique et source<br />
à source de l’optimisation de code. In 19ème Rencontres francophones du<br />
parallélisme, Renpar, Toulouse, France, September 2009.<br />
[PVG + 08a] Christine Plumejeaud, Jean-Marc Vincent, Claude Grasland, Sandro Bimonte,<br />
Hélène Mathian, Serge Guel<strong>to</strong>n, Joël Boulier, and Jérôme Gensel.<br />
Hypersmooth: A system <strong>for</strong> interactive spatial analysis via potential maps.<br />
In The 8th International Symposium on Web and Wireless Geographical In<strong>for</strong>mation<br />
Systems, W2GIS, pages 4–16, 2008.<br />
[PVG + 08b] Christine Plumejeaud, Jean-Marc Vincent, Claude Grasland, Jérôme Gensel,<br />
Hélène Mathian, Serge Guel<strong>to</strong>n, and Joël Boulier. Hypersmooth : calcul et<br />
visualisation de cartes de potentiel interactives. CoRR, abs/0802.4191, 2008.<br />
[TVa + 12] Massimo Torquati, Marco Vanneschi, Mehdi amini, Serge Guel<strong>to</strong>n, Ronan<br />
Keryell, Vincent Lanore, François-Xavier Pasquier, Michel Barreteau, Rémi<br />
Barrère, Claudia-Teodora Petrisor, Éric Lenormand, Claudia Cantini, and<br />
Filippo De Stefani. An innovative compilation <strong>to</strong>ol-chain <strong>for</strong> embedded multicore<br />
architectures. In Embedded World Conference, February 2012.
Index<br />
terapix, 16, 17, 19, 73, 77, 78, 123, 133,<br />
141, 144, 146–151, 153–155, 160, 163<br />
array linearization, 144, 150<br />
C99, 34<br />
common subexpression elimination, 85, 167<br />
Compilation flow, 39<br />
compilation flow, 42, 45, 67, 134, 152<br />
constant propagation, 51<br />
convex array region, 87<br />
convex array regions, 120, 168<br />
data transfers, 29, 104, 114, 120, 123, 141<br />
dead code elimination, 120, 167<br />
directive generation, 49, 134<br />
distributed memory, 29, 114<br />
flatten code, 150<br />
<strong>for</strong>ward substitution, 46, 84, 167<br />
fuzz testing, 63<br />
go<strong>to</strong> elimination, 84<br />
header substitution, 59, 99, 108, 152<br />
inlining, 46, 53, 83, 84<br />
instruction selection, 79<br />
invariant code motion, 167<br />
iteration clamping, 144<br />
loop fusion, 49, 52, 87, 134, 167<br />
loop interchange, 102, 167<br />
loop normalization, 144<br />
loop rerolling, 108<br />
loop tiling, 52, 102, 167<br />
loop unrolling, 46, 50, 102<br />
loop unswitching, 104<br />
199<br />
memory footprint reduction, 121, 163<br />
n address code generation, 150<br />
outlining, 18, 84, 87, 159, 162<br />
parallelism detection, 134<br />
parallelism extraction, 49<br />
pass manager, 46, 47, 134<br />
privatization, 134<br />
reduction detection, 49, 134<br />
redundant load-s<strong>to</strong>re elimination, 113, 124,<br />
163<br />
scalar renaming, 100<br />
split update opera<strong>to</strong>r, 150<br />
statement isolation, 114, 120, 141, 159, 163<br />
strength reduction, 150<br />
symbolic tiling, 122, 159<br />
variable length array, 34, 141