29.06.2013 Views

Building Source-to-Source Compilers for Heterogeneous Targets

Building Source-to-Source Compilers for Heterogeneous Targets

Building Source-to-Source Compilers for Heterogeneous Targets

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

THÈSE TÉLÉCOM Bretagne<br />

sous le sceau de l’Université Européenne de<br />

Bretagne<br />

pour obtenir le titre de<br />

DOCTEUR DE TÉLÉCOM Bretagne<br />

En habilitation conjointe avec l’Université<br />

de Bretagne Occidentale<br />

Spécialité :<br />

<strong>Building</strong><br />

<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong><br />

<strong>Compilers</strong> <strong>for</strong><br />

<strong>Heterogeneous</strong> <strong>Targets</strong><br />

présentée par<br />

Serge Guel<strong>to</strong>n<br />

ÉCOLE DOCTORALE : SICMA<br />

LABORATOIRE : TB/INFO<br />

Thèse soutenue le 7 oc<strong>to</strong>bre 2011<br />

devant le jury composé de :<br />

Albert Cohen, Professeur, INRIA<br />

Fran cois Irigoin, Professeur, MINES ParisTech<br />

Ronan Keryell, Enseignant-chercheur, Télécom Bretagne<br />

& HPC Project<br />

Fabrice Lemonnier, Responsable Projet, Thales<br />

Research & Technology<br />

Bernard Pottier, Professeur, Université de Bretagne<br />

Occidentale<br />

Patrice Quin<strong>to</strong>n, Professeur, École Normale<br />

Supérieure de Cachan-Bretagne<br />

Sanjay Rajopadhye, Professeur, Colorado State<br />

University<br />

Eugene Ressler, Professeur, United States Military<br />

Academy


N o d’ordre : 2011telb0203<br />

Sous le sceau de l’Université européenne de Bretagne<br />

Télécom Bretagne<br />

En habilitation conjointe avec l’Université de Bretagne<br />

Occidentale<br />

École Doc<strong>to</strong>rale – SICMA<br />

<strong>Building</strong> <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> <strong>Compilers</strong> <strong>for</strong> <strong>Heterogeneous</strong><br />

<strong>Targets</strong><br />

Thèse de Doc<strong>to</strong>rat<br />

Mention : Sciences et Technologies de l’In<strong>for</strong>mation et des Communications<br />

Présentée par Serge Guel<strong>to</strong>n<br />

Labora<strong>to</strong>ire : TB/INFO<br />

Directeur de thèse : François Irigoin<br />

Encadrant de thèse : Ronan Keryell<br />

Soutenue le 7 oc<strong>to</strong>bre 2011<br />

Jury :<br />

Albert Cohen, Professeur, INRIA<br />

Fran cois Irigoin, Professeur, MINES ParisTech<br />

Ronan Keryell, Enseignant-chercheur, Télécom Bretagne & HPC Project<br />

Fabrice Lemonnier, Responsable Projet, Thales Research & Technology<br />

Bernard Pottier, Professeur, Université de Bretagne Occidentale<br />

Patrice Quin<strong>to</strong>n, Professeur, École Normale Supérieure de Cachan-Bretagne<br />

Sanjay Rajopadhye, Professeur, Colorado State University<br />

Eugene Ressler, Professeur, United States Military Academy


Abstract<br />

<strong>Heterogeneous</strong> computers—plat<strong>for</strong>ms that make use of multiple specialized devices <strong>to</strong><br />

achieve high throughput or low energy consumption—are difficult <strong>to</strong> program. Hardware<br />

vendors usually provide compilers from a C dialect <strong>to</strong> their machines, but complete application<br />

rewriting is frequently required <strong>to</strong> take advantage of them.<br />

In this thesis, we propose a new approach <strong>to</strong> building bridges between regular applications<br />

written in C and these dialects. From an analysis of the hardware constraints, we<br />

propose original code trans<strong>for</strong>mations that can meet those constraints, and we combine<br />

them using a completely programmable pass manager that can handle the complex compilation<br />

flows required by the code generation process <strong>for</strong> multiple targets. It makes it<br />

possible <strong>to</strong> build a collection of compilers based on the same infrastructure while targeting<br />

different architectures.<br />

All the code trans<strong>for</strong>mations are done at the source level using the pips source-<strong>to</strong>source<br />

compiler framework. New trans<strong>for</strong>mations are detailed using denotational semantics<br />

on a simplified language and have been combined <strong>to</strong> build four different compilers: an<br />

openmp directive genera<strong>to</strong>r, a retargetable multimedia instruction genera<strong>to</strong>r <strong>for</strong> sse, avx<br />

and neon, an assembly code genera<strong>to</strong>r <strong>for</strong> an fpga-based image processor and a cuda<br />

genera<strong>to</strong>r.<br />

Résumé<br />

Les machines hétérogènes — des ordinateurs reposant sur la combinaison d’unités de<br />

calculs spécialisées pour obtenir des per<strong>for</strong>mances élevées et une consommation énergétique<br />

moindre — sont difficiles à programmer. Les vendeurs de matériels fournissent généralement<br />

un compilateur d’un dialecte du C pour leur machines, mais il faut dans ce cas réécrire<br />

complètement l’application cible pour en profiter.<br />

Dans cette thèse, nous proposons une nouvelle approche pour faire le pont entre des<br />

applications classiques écrites en C et ces dialectes. Partant d’une analyse des contraintes<br />

imposées par le matériel, nous proposons un ensemble de trans<strong>for</strong>mations de code original<br />

qui permet de répondre à ces contraintes et nous les combinons à l’aide d’un gestionnaire<br />

de passe complètement programmable qui peut gérer les flots de compilation complexes<br />

mis en œuvre lors de la génération de code multi-cible. Cela permet d’obtenir un ensemble<br />

de compilateurs réutilisant les mêmes éléments de base <strong>to</strong>ut en ciblant des architectures<br />

différentes.<br />

Toutes les trans<strong>for</strong>mations s’appliquent au niveau source en se basant sur<br />

l’infrastructure de compilation pips. Les nouvelles trans<strong>for</strong>mations proposées sont explicitées<br />

en se basant sur la sémantique dénotationnelle et un langage cible simplifié. Elles<br />

sont utilisées pour assembler quatre compilateurs différents : un générateur de directives<br />

openmp, un générateur reciblable d’instructions multimédia pour sse, avx et neon, un<br />

générateur de code assembleur pour une machine à base de fpga spécialisée dans le traitement<br />

d’images et un générateur de code cuda.<br />

iii


Remerciements<br />

Acknowledgments is something better done in one’s native <strong>to</strong>ngue, <strong>for</strong> it is<br />

difficult <strong>to</strong> express the subtlety of feelings in another language. . .<br />

Cette thèse a été financée par l’Agence Nationale de la Recherche dans la cadre du<br />

projet freia. Elle a été effectuée en collaboration avec thales trt et mines paristech.<br />

Il y a cinq ans de ça, suivant à Grenoble la souriante étudiante qui allait devenir ma<br />

femme, j’ai fait mes débuts dans le monde du travail dans une équipe de recherche nommée<br />

mescal, sous la direction d’un certain Jean-Marc Vincent. Ce chercheur passionné et<br />

chaleureux m’a le plus innocemment du monde 1 fait plonger dans un monde merveilleux,<br />

plongeon dont je ne ressors qu’après la rédaction de ce manuscrit. Ce sont ces grenoblois,<br />

Thierry, Jean-Louis, Vincent, Frédéric, Arnaud, Bruno et leur joyeuse bande de thésard<br />

Xavier, Sébastien, Maxime, Jean-Noël, Swann, qui m’ont fait penché du côté lumineux.<br />

Un autre Jean-Louis, rennais celui-là, a su en profiter et équilibrer par la raison la<br />

folie qui guette du haut des montagnes grenobloises. Il a pourtant commis une faute irréparable<br />

en me jetant dans les rets d’un de mes anciens professeurs bre<strong>to</strong>ns, le machiavélique<br />

Ronan Keryell, me soustrayant par là-même aux saintes intentions d’un exthésard<br />

grenoblois nouvellement luxembourgeois. Me promettant l’équilibre parfait entre<br />

recherche et développement, s’inspirant des démons grenoblois pour mieux m’attirer à lui,<br />

il parvint sans peine à me faire franchir la ligne séparant l’ingénierie de la recherche pour<br />

me faire commencer la thèse dont ce manuscrit est le fruit.<br />

Que serais-je devenu entre les mains de ce personnage s’il n’avait pas eu l’égarement de<br />

me placer sous la direction de François Irigoin ? Il est sûr que ces trois années auraient<br />

eues une <strong>to</strong>ute autre saveur, et il m’est encore maintenant difficile de mesurer les bienfaits<br />

de sa présence, <strong>to</strong>ujours prompte — bien que parfois inefficace — à contrebalancer les excès<br />

d’enthousiasmes provoqués par mes travaux.<br />

Ce fut une thèse itinérante, qui a commencé Rennes, en<strong>to</strong>uré des joyeux symbiotes<br />

Dominique, Rayan, Jacques, Pierre et les autres, pour terminer à Brest avec la délicieuse<br />

Armelle, les sympathiques Eliya, Zhe, Xu et Jiayi, les indélogeables habitants du D3 128,<br />

les joyeux Grégoire, Frédéric, Adrien, Sébastien et les fidèles membres du club de tkd,<br />

entrecoupé de séjours bellifontains avec Corinne, Fabien, Pierre, Laurent, Amira. . . Sans<br />

oublier le canal #pipsien et Mehdi, Pierre, Béatrice.<br />

La rédaction de ce manuscrit a <strong>to</strong>ut particulièrement profité des conseils et des relectures<br />

avisées de François Irigoin, Ronan Keryell, Béatrice Creusillet, Pierre Jouvelot,<br />

1. Avec le recul, il y a peut-être là une volonté maligne de conversion de sa part.<br />

v


vi REMERCIEMENTS<br />

Fabien Dagnat et Adrien Guinet. A big thanks <strong>to</strong> Aimée Johansen <strong>for</strong> her english<br />

advices and <strong>for</strong> spotting my numerous mistakes. Mes rapporteurs Albert Cohen, Sanjay<br />

Rajopadhye et Eugene Ressler n’ont pas hésité à pointer du doigt les nombreux défauts<br />

de la version qui leur a été soumise 2 et ont grandement contribué à son amélioration.<br />

Les ef<strong>for</strong>ts typographiques consentis à ce document doivent beaucoup à Yannis Haralambous<br />

et son Chicago Manual of Style, les atrocités restantes sont l’unique fruit de<br />

ma paresse. Le plaisir de la rédaction doit beaucoup au système de composition L ATEX, au<br />

paquet TikZ et à l’éditeur de texte vim.<br />

Mes parents m’ont <strong>to</strong>ujours poussé a faire des études pour garder le plus de portes<br />

ouvertes, et m’envoyaient à l’école en me disant « amuse <strong>to</strong>i bien ». J’ai essayé de suivre<br />

le premier de ces conseils, et il ne fut pas bien difficile de suivre le second.<br />

Et pour les moments difficiles, les baisses de moral, les longues nuits de soumission<br />

d’article, les bien plus longs mois de rédaction, pour les sourires radieux, les attentions<br />

quotidiennes, 灰太狼十分感谢伟大的红太狼和狼宝宝.<br />

2. Qu’ils soient remerciés pour cette masse de travail supplémentaire :-)


Contents<br />

Remerciements v<br />

Acknowledgments, in french.<br />

Résumé en français 1<br />

Dissertation summary in French.<br />

1 Introduction 21<br />

2 <strong>Heterogeneous</strong> Computing Paradigm 25<br />

2.1 <strong>Heterogeneous</strong> Computing Model . . . . . . . . . . . . . . . . . . . . . . . 26<br />

2.2 Influence on Programming Model . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2.3 Hardware Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

2.4 Note About the C Language . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

2.5 OpenCL Programming Model Analysis . . . . . . . . . . . . . . . . . . . . 37<br />

2.6 Other Programing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42<br />

3 Compiler Design <strong>for</strong> <strong>Heterogeneous</strong> Architectures 43<br />

3.1 Extending Compiler Infrastructures . . . . . . . . . . . . . . . . . . . . . . 44<br />

3.1.1 Existing Compiler Infrastructures . . . . . . . . . . . . . . . . . . . 44<br />

3.1.2 A Simple Model <strong>for</strong> Code Trans<strong>for</strong>mations . . . . . . . . . . . . . . 48<br />

3.1.2.1 Trans<strong>for</strong>mations: Definition and Compositions . . . . . . . 48<br />

3.1.2.2 Parametric Trans<strong>for</strong>mations . . . . . . . . . . . . . . . . . 50<br />

3.1.2.3 From Model <strong>to</strong> Implementation . . . . . . . . . . . . . . . 50<br />

3.1.3 Programmable Pass Management . . . . . . . . . . . . . . . . . . . 51<br />

3.1.3.1 A Class Hierarchy <strong>for</strong> Pass Management . . . . . . . . . . 51<br />

3.1.3.2 Control Flow and Pass Management . . . . . . . . . . . . 53<br />

3.2 On <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> <strong>Compilers</strong> . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

3.2.1 Exploring <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Opportunities . . . . . . . . . . . . . . 55<br />

3.2.2 Impact of <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compilation on the Compilation Infrastructure<br />

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

3.3 pyps, a High Level Pass Manager api . . . . . . . . . . . . . . . . . . . . . 58<br />

3.3.1 api Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

vii


viii CONTENTS<br />

3.3.2 Usage Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63<br />

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

4 Representing the Instruction Set Architecture in C 69<br />

4.1 C as a Common Denomina<strong>to</strong>r . . . . . . . . . . . . . . . . . . . . . . . . . 70<br />

4.1.1 C Dialects and <strong>Heterogeneous</strong> Computing . . . . . . . . . . . . . . 70<br />

4.1.2 From the ISA <strong>to</strong> C . . . . . . . . . . . . . . . . . . . . . . . . . . . 71<br />

4.2 Native Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

4.2.1 Scalar Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

4.2.2 Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73<br />

4.2.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

4.3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76<br />

4.4 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

4.4.1 Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

4.4.2 N-Address Code Generation . . . . . . . . . . . . . . . . . . . . . . 80<br />

4.5 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

4.6 Function calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

4.6.1 Removing Function Calls . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

4.6.2 Outlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

4.6.2.1 Outlining Algorithm . . . . . . . . . . . . . . . . . . . . . 84<br />

4.6.2.2 Using Outlining <strong>to</strong> Reduce Compilation Time . . . . . . . 87<br />

4.7 Library Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

5 Parallelism with Multimedia Instructions 91<br />

5.1 Super-word Level Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

5.1.2 A Meta-Multimedia Instruction Set . . . . . . . . . . . . . . . . . . 97<br />

5.1.2.1 Sequential Implementation . . . . . . . . . . . . . . . . . . 98<br />

5.1.2.2 Target-Specific Implementation . . . . . . . . . . . . . . . 99<br />

5.1.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

5.1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100<br />

5.1.4 Generation of Optimized simd Instructions . . . . . . . . . . . . . . 101<br />

5.1.4.1 Statement Closeness . . . . . . . . . . . . . . . . . . . . . 101<br />

5.1.4.2 Parametric Vec<strong>to</strong>r Instruction Generation Algorithm . . . 101<br />

5.1.5 Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

5.1.6 Loop Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

5.1.7 Combining Loop Vec<strong>to</strong>rization and Super-word Level Parallelism . . 104<br />

5.2 Reduction Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106<br />

5.2.1 Reduction Detection Inside a Sequence . . . . . . . . . . . . . . . . 106<br />

5.2.2 Delegating <strong>to</strong> Library . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

5.3 Computational Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


CONTENTS ix<br />

5.3.1 Execution Time versus Transfer Time Estimation . . . . . . . . . . 110<br />

5.3.2 Limitations of the Model . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

6 Trans<strong>for</strong>mations <strong>for</strong> Memory Size and Distribution 113<br />

6.1 Statement Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

6.1.1 Formulation of Statement Isolation . . . . . . . . . . . . . . . . . . 114<br />

6.1.1.1 Expression Renaming . . . . . . . . . . . . . . . . . . . . 116<br />

6.1.1.2 Type Renaming . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

6.1.1.3 Statement renaming . . . . . . . . . . . . . . . . . . . . . 119<br />

6.1.1.4 Restricted Statement Isolation . . . . . . . . . . . . . . . 119<br />

6.1.2 Statement Isolation and Convex Array Regions . . . . . . . . . . . 120<br />

6.2 Memory Footprint Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

6.2.1 Memory Footprint Estimate . . . . . . . . . . . . . . . . . . . . . . 121<br />

6.2.2 Symbolic Rectangular Tiling . . . . . . . . . . . . . . . . . . . . . . 122<br />

6.3 Redundant Load S<strong>to</strong>re Optimization . . . . . . . . . . . . . . . . . . . . . 123<br />

6.3.1 Redundant Load Elimination . . . . . . . . . . . . . . . . . . . . . 125<br />

6.3.1.1 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 125<br />

6.3.1.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />

6.3.1.3 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126<br />

6.3.1.4 Interprocedurally . . . . . . . . . . . . . . . . . . . . . . . 127<br />

6.3.2 Redundant S<strong>to</strong>re Elimination . . . . . . . . . . . . . . . . . . . . . 127<br />

6.3.3 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

6.3.4 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

6.3.5 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

6.3.6 Interprocedurally . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

6.3.7 Combining Load and S<strong>to</strong>re Elimination . . . . . . . . . . . . . . . . 128<br />

6.3.7.1 Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

6.3.7.2 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />

6.3.8 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

7 Compiler Implementations and Experiments 133<br />

7.1 A Simple OpenMP Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />

7.1.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 134<br />

7.1.2 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . 134<br />

7.1.3 Experiments & Validation . . . . . . . . . . . . . . . . . . . . . . . 137<br />

7.2 A GPU Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139<br />

7.2.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 139<br />

7.2.2 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . 141<br />

7.2.3 Experiments & Validation . . . . . . . . . . . . . . . . . . . . . . . 144<br />

7.3 An FPGA Image Processor Accelera<strong>to</strong>r Compiler . . . . . . . . . . . . . . 144<br />

7.3.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 146


x CONTENTS<br />

7.3.2 terapix Compiler Implementation . . . . . . . . . . . . . . . . . . 147<br />

7.3.2.1 Input Code Splitting . . . . . . . . . . . . . . . . . . . . . 147<br />

7.3.3 Experiments & Validation . . . . . . . . . . . . . . . . . . . . . . . 150<br />

7.4 A Retargetable Multimedia Instruction Set Compiler . . . . . . . . . . . . 152<br />

7.4.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 152<br />

7.4.2 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . 152<br />

7.4.3 Multimedia Instruction Set on Desk<strong>to</strong>p and Embedded Processors . 152<br />

7.4.4 Results & Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159<br />

8 Conclusion 161<br />

A The PIPS Compiler Infrastructure 165<br />

B The LuC language 169<br />

B.1 Syntactic Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />

B.2 Semantic Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170<br />

C Using PyPS <strong>to</strong> Drive a Compilation Benchmark 171<br />

D Using C <strong>to</strong> Emulate sse Intrinsics 173<br />

Glossary 177<br />

Acronyms 179<br />

Bibliography 183<br />

Personal Bibliography 197<br />

Index 198


List of Listings<br />

3.1 gcc pass manager initialization. . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

3.2 Dynamic phase ordering using the llvm pass manager command line interface. 47<br />

3.3 Usage of exceptions at the pass manager level. . . . . . . . . . . . . . . . . 54<br />

3.4 Example of workspace composition at the pass manager level using pyps. . 60<br />

3.5 Streaming simd Extension (sse) C intrinsics generated <strong>for</strong> a scalar product. 62<br />

3.6 Sequential intrinsic implementation of a sse scalar product. . . . . . . . . 62<br />

3.7 Native intrinsic implementation of a sse scalar product. . . . . . . . . . . . 62<br />

3.8 Fuzz testing with pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

4.1 Broadcast a single value in sse. . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

4.2 Vec<strong>to</strong>r type emulation in C. . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

4.3 Sequential implementation of _mm_set1_ps. . . . . . . . . . . . . . . . . . 73<br />

4.4 Example of structure removal. . . . . . . . . . . . . . . . . . . . . . . . . . 75<br />

4.5 Structure removal in the presence of function call. . . . . . . . . . . . . . . 76<br />

4.6 Two-step trans<strong>for</strong>mation from multi-dimensional arrays <strong>to</strong> pointers. . . . . 77<br />

4.7 Using naming convention <strong>to</strong> distinguish registers <strong>for</strong> the terapix architecture. 78<br />

4.8 Outlining of the inner loop of an erosion kernel. . . . . . . . . . . . . . . . 86<br />

4.9 Enhanced outlining of the inner loop of an erosion kernel. . . . . . . . . . . 86<br />

5.1 Excerpt from libmpcodecs/vf_gradfun.c file from the mplayer source tree. 93<br />

5.2 Sample of representation of a vec<strong>to</strong>r register using C type. . . . . . . . . . 97<br />

5.3 Sample of tree pattern in polish notation used <strong>for</strong> slp. . . . . . . . . . . . 98<br />

5.4 fma sequential version <strong>for</strong> a vec<strong>to</strong>r of 4 floats. . . . . . . . . . . . . . . . . 99<br />

5.5 fma generic operation implemented <strong>for</strong> the neon instruction set. . . . . . 99<br />

5.6 Vec<strong>to</strong>rized output <strong>for</strong> a matrix multiply. . . . . . . . . . . . . . . . . . . . 100<br />

5.8 Conditional loop tiling on a matrix-vec<strong>to</strong>r product. . . . . . . . . . . . . . 104<br />

5.9 Horizontal erosion code sample. . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

6.1 Illustration of statement isolation on a scalar assignment. . . . . . . . . . . 114<br />

6.2 Code after statement isolation. . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

6.3 Symbolic tiling of the outermost loop of an horizontal erosion with in<strong>for</strong>mation<br />

about in and out regions. . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />

6.4 Illustration of the redundant load s<strong>to</strong>re elimination algorithm. . . . . . . . 131<br />

7.1 Original PyPS script <strong>for</strong> openmp code generation. . . . . . . . . . . . . . . 136<br />

7.2 Makefile stub <strong>for</strong> openmp compilation. . . . . . . . . . . . . . . . . . . . . 137<br />

7.3 terapix assembly <strong>for</strong> a 3 × 3 convolution kernel. . . . . . . . . . . . . . . 148<br />

xi


xii LIST OF LISTINGS<br />

7.4 Illustration of terapix code generation, host part. . . . . . . . . . . . . . 153<br />

7.5 Illustration of terapix code generation, accelera<strong>to</strong>r part. . . . . . . . . . . 154<br />

7.6 Illustration of terapix compacted assembly. . . . . . . . . . . . . . . . . . 155<br />

A.1 A simple loop <strong>to</strong> illustrate pips analyses. . . . . . . . . . . . . . . . . . . . 166<br />

A.2 Example of precondition analysis. . . . . . . . . . . . . . . . . . . . . . . . 166<br />

A.3 Example of trans<strong>for</strong>mers analysis. . . . . . . . . . . . . . . . . . . . . . . . 167<br />

A.4 Example of cumulated memory effects analysis. . . . . . . . . . . . . . . . 168<br />

A.5 Example of convex array regions analysis. . . . . . . . . . . . . . . . . . . 168


List of Figures<br />

2.1 <strong>Heterogeneous</strong> computing model. . . . . . . . . . . . . . . . . . . . . . . . 27<br />

2.2 von Neumann architecture vs. opencl architecture. . . . . . . . . . . . . 28<br />

2.3 Impact of heterogeneous architecture on compilation. . . . . . . . . . . . . 30<br />

2.4 Example of hardware feature diagram. . . . . . . . . . . . . . . . . . . . . 32<br />

2.5 Multicore with vec<strong>to</strong>r unit feature diagram. . . . . . . . . . . . . . . . . . 33<br />

2.6 Comparison of C89 and C99 syntax on a complex matrix vec<strong>to</strong>r multiply. . 35<br />

2.7 Comparison of two versions of the HPEC Challenge benchmark: C89 vs. C99. 36<br />

2.8 Comparison of two versions of the Coremark benchmark: C89 vs. C99. . 37<br />

2.9 Comparison of two versions of the Linpack benchmark: C89 vs. C99. . . . 38<br />

2.10 Compilation flow in opencl. . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

3.1 A classical 3-phases retargetable compiler architecture. . . . . . . . . . . . 44<br />

3.2 Improved compilation flow <strong>for</strong> heterogeneous computing. . . . . . . . . . . 45<br />

3.3 pips as a generic compiler infrastructure sample. . . . . . . . . . . . . . . . 46<br />

3.4 pyps class hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

3.5 Usage of conditionals at the pass manager level. . . . . . . . . . . . . . . . 53<br />

3.6 Usage of loops at the pass manager level. . . . . . . . . . . . . . . . . . . . 54<br />

3.7 <strong>Source</strong>-<strong>to</strong>-source cooperation with external <strong>to</strong>ols. . . . . . . . . . . . . . . 56<br />

3.8 <strong>Heterogeneous</strong> compilation stages. . . . . . . . . . . . . . . . . . . . . . . . 57<br />

3.9 <strong>Source</strong>-<strong>to</strong>-source heterogeneous compilation scheme. . . . . . . . . . . . . . 58<br />

4.1 Using outlining <strong>to</strong> reduce analyse compilation time on an unrolled sequence<br />

of matrix multiplications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88<br />

5.1 Various way <strong>to</strong> program <strong>for</strong> the sse mis. . . . . . . . . . . . . . . . . . . . 94<br />

5.2 Comparison of llvm, gcc and icc vec<strong>to</strong>rizers using linpack. . . . . . . . 95<br />

5.3 Multimedia Instruction Set his<strong>to</strong>ry <strong>for</strong> x86 processors. . . . . . . . . . . . . 95<br />

5.4 Vec<strong>to</strong>rized implementation of float vec<strong>to</strong>r multiply-addition operation (epilogue<br />

omitted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98<br />

5.5 Parallelizing reductions in a sequence. . . . . . . . . . . . . . . . . . . . . . 107<br />

5.6 Effect of reduction parallelization on an unrolled loop. . . . . . . . . . . . . 108<br />

5.7 Manually vec<strong>to</strong>rizing an inner product vs. using a library. . . . . . . . . . . 109<br />

7.1 Multicore hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . 134<br />

xiii


xiv LIST OF FIGURES<br />

7.2 <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> openmp. . . . . . . . . . . . . . . 135<br />

7.3 Per<strong>for</strong>mance of an openmp directive genera<strong>to</strong>r pro<strong>to</strong>type on the polybench<br />

benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

7.4 nVidia Fermi architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

7.5 gpu hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

7.6 <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> gpu. . . . . . . . . . . . . . . . . 142<br />

7.7 Splitting a array sum example code in<strong>to</strong> host, loop proxy and accelera<strong>to</strong>r<br />

parts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143<br />

7.8 Median execution time on a gpu <strong>for</strong> dsp kernels. . . . . . . . . . . . . . . 145<br />

7.9 terapix architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />

7.10 terapix hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . 148<br />

7.11 terapix redundant computations. . . . . . . . . . . . . . . . . . . . . . . 149<br />

7.12 <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> terapix. . . . . . . . . . . . . . . 149<br />

7.13 mis hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . . . . 156<br />

7.14 <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> mis. . . . . . . . . . . . . . . . . . 156<br />

7.15 Pass reuse among 4 pyps-based compilers. . . . . . . . . . . . . . . . . . . 159


List of Algorithms<br />

1 Fuzz testing at the pass manager level. . . . . . . . . . . . . . . . . . . . . . 63<br />

2 Compilation complexity reduction with outlining. . . . . . . . . . . . . . . . 87<br />

3 Parametric vec<strong>to</strong>r instruction generation algorithm. . . . . . . . . . . . . . . 103<br />

4 Hybrid vec<strong>to</strong>rization at the pass manager level. . . . . . . . . . . . . . . . . 105<br />

5 Memory footprint reduction algorithm. . . . . . . . . . . . . . . . . . . . . . 123<br />

6 Redundant load s<strong>to</strong>re elimination algorithm at the pass manager level. . . . 130<br />

7 Parallel loop generation algorithm <strong>for</strong> openmp. . . . . . . . . . . . . . . . . 135<br />

8 terapix kernel extraction algorithm at the pass manager level. . . . . . . . 150<br />

9 C-<strong>to</strong>-terapix translation algorithm at the pass manager level. . . . . . . . 151<br />

xv


xvi LIST OF ALGORITHMS


List of Tables<br />

3.1 Comparison of source-<strong>to</strong>-source compilation infrastructures. . . . . . . . . . 55<br />

3.2 sloccount reports <strong>for</strong> the gcc and llvm compilers. . . . . . . . . . . . . . 65<br />

4.1 C dialects and targeted hardware. . . . . . . . . . . . . . . . . . . . . . . . 71<br />

7.1 sloccount report <strong>for</strong> an openmp directive genera<strong>to</strong>r pro<strong>to</strong>type written in<br />

pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137<br />

7.2 sloccount report <strong>for</strong> a cuda genera<strong>to</strong>r pro<strong>to</strong>type written in pyps. . . . . 144<br />

7.3 Description of a terapix microinstruction. . . . . . . . . . . . . . . . . . . 147<br />

7.4 Ratio between terapix microcode cycle counts <strong>for</strong> au<strong>to</strong>matic and manual<br />

code generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />

7.5 sloccount report <strong>for</strong> a terapix assembly genera<strong>to</strong>r pro<strong>to</strong>type written in<br />

pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151<br />

7.6 sloccount report <strong>for</strong> an avx intrinsic genera<strong>to</strong>r pro<strong>to</strong>type written in pyps. 159<br />

7.7 Summary of the sloccount reports <strong>for</strong> the compiler pro<strong>to</strong>types written in<br />

pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160<br />

xvii


xviii LIST OF TABLES


Notations Quick Reference Sheet<br />

This thesis makes use of various notations from different computer science fields. They<br />

are concisely and in<strong>for</strong>mally summarized here. Occasionally, a symbol may be used with<br />

different meaning than shown here. Context should always be sufficient <strong>to</strong> make these<br />

cases clear. For example, P is used <strong>to</strong> denote both the power set of a set and also—as<br />

indicated here—the domain of all programs.<br />

Sets and domains are in general denoted with capital letters in a cursive font, <strong>for</strong><br />

example S. Set members are lowercase, <strong>for</strong> example s. Semantic functions are denoted in<br />

bold capitals, <strong>for</strong> example S.<br />

Where notations given here are unclear, the reader is urged <strong>to</strong> continue, referring back<br />

<strong>to</strong> this sheet as needed.<br />

Set Opera<strong>to</strong>rs<br />

P(A) The power set of A.<br />

|A|<br />

Ā<br />

The number of elements in the set A.<br />

The convex hull of set A.<br />

⌈A⌉ The rectangular hull of set A.<br />

A ∪ B The union of sets A and B.<br />

A ¯∪ B The convex union of sets A and B.<br />

Ak A<br />

The set of all k-tuples of A.<br />

∗<br />

<br />

k∈N Ak .<br />

xix


xx NOTATIONS QUICK REFERENCE SHEET<br />

Syntactic Domains<br />

P The domain of programs p.<br />

F The domain of functions f.<br />

S The domain of statements s.<br />

E The domain of expressions e.<br />

I The domain of identifiers id.<br />

L The domain of memory locations l.<br />

R The domain of references r.<br />

C The domain of constants cst.<br />

Op The domain of opera<strong>to</strong>rs op.<br />

D The domain of declarations d.<br />

T The domain of types t.<br />

Semantic Domains<br />

V The domain of denotable values v.<br />

Σ : L → (V ∪ {unbound}) The domain of s<strong>to</strong>res σ.<br />

Semantic Functions<br />

S : S × Σ → Σ S evaluates statement s in s<strong>to</strong>re σ0 <strong>to</strong> produce s<strong>to</strong>re σ1.<br />

P : P × V ∗ → V ∗ P evaluates the body of function main from the program p in<br />

the s<strong>to</strong>re {〈istdin, [v0, . . . , vn−1]〉} and returns the list of values<br />

accumulated in istdout, where istdin is an identifier reserved <strong>for</strong><br />

standard input and istdout is an identifier reserved <strong>for</strong> standard<br />

output.<br />

E : E × Σ → V E evaluates expression e in s<strong>to</strong>re σ <strong>to</strong> produce value v.<br />

D : D × Σ → Σ D evaluates declaration d in s<strong>to</strong>re σ0 <strong>to</strong> produce s<strong>to</strong>re σ1.<br />

R : R × Σ → P(L) R evaluates reference r in s<strong>to</strong>re σ <strong>to</strong> produce locations l.<br />

T : T × Σ → N T returns the number of memory cells occupied by type t.<br />

I : I → P(L) I returns the locations l associated <strong>to</strong> the identifier i.<br />

Syntactic Opera<strong>to</strong>rs<br />

if E then E1 else E2 The statement denoted by the syntactic clause “if E then E1<br />

else E2”.


NOTATIONS QUICK REFERENCE SHEET xxi<br />

Miscellaneous Opera<strong>to</strong>rs<br />

f ◦ g The function f composed with function g.<br />

f[x → y] A function identical <strong>to</strong> f but <strong>for</strong> the element x where it returns<br />

y.<br />

loc : I × V × Σ → Σ loc produces v new locations from an identifier i, adds them <strong>to</strong><br />

the s<strong>to</strong>re σ and returns this new s<strong>to</strong>re.<br />

unbind(σ, id) : Σ × I → Σ unbind(σ, id) = σ[l → unbound | l ∈ I(id)], i.e. the function<br />

that removes all the locations accessible by an identifier from<br />

memory.<br />

<strong>for</strong>mal(id) : I → I The <strong>for</strong>mal parameter of function id.<br />

body(id) : I → S The body of function id.<br />

refs(s) : S → P(R) The set of all references syntactically found in statement s.<br />

Array Regions<br />

An array region is an application that associates a statement and a<br />

memory state <strong>to</strong> a set of references associated <strong>to</strong> this memory state.<br />

R = r : S × Σ → P(R) The exact array region of the references read by statement s<br />

evaluated in s<strong>to</strong>re σ.<br />

R r : S × Σ → P(R) An over-approximated array region of the references read by<br />

statement s evaluated in s<strong>to</strong>re σ.<br />

R = i : S × Σ → P(R) The exact array region of the references imported by statement<br />

s evaluated in s<strong>to</strong>re σ.<br />

R <br />

i : S × Σ → P(R) An over-approximated array region of the references imported<br />

by statement s evaluated in s<strong>to</strong>re σ.<br />

R = w : S × Σ → P(R) The exact array region of the references written by statement s<br />

evaluated in s<strong>to</strong>re σ.<br />

R w : S × Σ → P(R) An over-approximated array region of the references written by<br />

statement s evaluated in s<strong>to</strong>re σ.<br />

R = o : S × Σ → P(R) The exact array region of the references exported by statement<br />

s evaluated in s<strong>to</strong>re σ.<br />

R o : S × Σ → P(R) An over-approximated array region of the references exported<br />

by statement s evaluated in s<strong>to</strong>re σ.


xxii NOTATIONS QUICK REFERENCE SHEET


Résumé en français<br />

Cette section contient, pour chaque chapitre de la thèse, une traduction de son<br />

introduction et un résumé de chacune de ses sections. Le lecteur intéressé est<br />

invité à se reporter au chapitre correspondant pour obtenir des précisions sur<br />

un sujet particulier.<br />

Construction de compilateurs source à source<br />

pour cibles hétérogènes<br />

1. Introduction<br />

La loi de Moore [Moo65] a été invalidée il y a cinq ans, quand les transis<strong>to</strong>rs sont<br />

devenus si petits que le silicium ne pouvait plus dissiper l’énergie libérée par l’activité des<br />

processeur utilisés à leur fréquence maximale. William J. Dally a appelé ce phénomène<br />

« The End of Denial Architecture » [Dal09]. Pour dépasser ces limitations physiques, les fabricants<br />

de transis<strong>to</strong>rs ont augmentés le nombre de cœurs de calcul par processeur, menant<br />

aux architectures muli-cœurs. L’utilisation de plusieurs cœurs de calcul permet ainsi d’augmenter<br />

la puissance de calcul globale disponible sans avoir à monter en fréquence. Le même<br />

problème de dissipation de chaleur se posera néanmoins quand la densité d’intégration de<br />

ces nœuds de calcul sera telle que le silicium n’arrivera plus à dissiper la chaleur générée,<br />

même à des fréquences moindres. Pour continuer à améliorer les per<strong>for</strong>mances des applications,<br />

les coprocesseurs, ou accélérateurs, sont apparus et ont préparés l’arrivée du calcul<br />

hétérogène, où plusieurs unités de calcul avec des capacités différentes collaborent. Il existe<br />

de nombreux types de coprocesseurs, partant des processeurs spécialisés, très efficaces sur<br />

un petit nombre d’applications, aux plus généralistes. Ainsi, deux chemins vers le calcul<br />

haute per<strong>for</strong>mance sont maintenant possibles : l’utilisation de multi-cœurs homogènes, et<br />

l’utilisation d’une combinaison d’accélérateurs spécialisés. La frontière entre ces deux catégories<br />

est floue, comme le prouve le larabee [SCS + 08] d’Intel, où plusieurs unités de<br />

calcul homogènes sont chacune dotée d’une unité de calcul vec<strong>to</strong>riel de 512 bits.<br />

Il y a 30 ans de cela, de nombreuses machines parallèles ont été construites. La Con-<br />

1


2 RÉSUMÉ EN FRANÇAIS<br />

nection Machine et le MasPar MP2 en sont deux exemples notables. À cette époque, les<br />

supercalculateurs comme le Cray-1 coûtaient 58 M$ et avaient une per<strong>for</strong>mance moyenne<br />

de 80 Mflops. Sortie en 2006, la console de jeu PlayStation 3 (ps3) basée sur une architecture<br />

Cell be coûtaient alors 500$ et pouvait atteindre 230 GFLoating point Operations<br />

per Second (flops) grâce à la combinaison d’un processeur généraliste et de plusieurs<br />

processeurs spécialisés. Moins de 100 exemplaires du Cray-1 ont été construit alors que<br />

plus de 50 millions de ps3 ont été fabriquées. Ce changement de marché a inévitablement<br />

créé un besoin en outils de développements. En 2006, le parallélisme n’est plus l’affaire de<br />

quelques spécialistes comme ce fut le cas à l’époque des Cray.<br />

Il y a plus à apprendre de l’expérience de la ps3 : bien que lancée par sony un an avant<br />

la xbox360 de Microsoft, cette dernière a eu un plus grand succès, en partie à cause<br />

d’un catalogue de jeux plus riche. Il est apparu que de nombreux studios de développement<br />

de jeux vidéos ont trouvé la ps3 et son architecture hétérogène trop difficile à programmer.<br />

En effet, l’architecture Cell be utilise des espaces mémoires séparés, demande un contrôle<br />

manuel des transferts mémoires, la gestion de registres vec<strong>to</strong>riels de 128 bits et demande<br />

une gestion manuelle du cache. Celle complexité était peut-être de trop pour le développeur<br />

moyen.<br />

Il existe trois moyens pour faire face à une telle complexité : engager un ingénieur expert<br />

du domaine, qui maitrise complètement l’architecture cible ; développer une bibliothèque<br />

spécialisée qui masque le comportement de la machine derrière une Application Programming<br />

Interface (api) simplifiée ; ou construire un compilateur qui permette de traduire un<br />

langage de haut niveau vers du langage machine.<br />

La première option est la plus flexible, mais aussi la plus coûteuse. La seconde est efficace<br />

mais manque de flexibilité, d’autant plus que tirer le maximum des per<strong>for</strong>mances d’un<br />

matériel complexe peut rarement se faire à travers une api simple. La dernière approche<br />

combine les avantages des deux précédentes, à condition qu’un compilateur puisse effectivement<br />

être construit à un coût raisonnable. nVidia a par exemple adopté cette approche<br />

avec succès pour leurs General Purpose gpus (gpgpus). La plupart des programmeurs<br />

sur gpgpu de nVidia utilisent une extension du langage C/C++ — le langage Compute<br />

Unified Device Architecture (cuda) — et le compilateur nvcc fournit par nVidia.<br />

En effet, le calcul sur machines hétérogènes impose de <strong>to</strong>utes nouvelles contraintes sur<br />

le flot de compilation. D’après le dragon book [ALSU06],<br />

Definition. Un compilateur est un programme qui sait lire un programme dans un langage<br />

— le langage source — et le traduit en un programme équivalent dans un autre langage —<br />

le langage cible.<br />

Il n’est pas mentionné qu’un compilateur puisse avoir besoin de trans<strong>for</strong>mer son programme<br />

d’entrée en plusieurs programmes de sorties — un pour chaque accélérateur présent<br />

sur la machine hétérogène ciblée — chacun étant écrit dans un langage différent. Les architectures<br />

de machines existantes ne demandaient pas un tel traitement. Quand un fichier<br />

source se dérive en plusieurs fichiers de sortie, le compilateur a besoin de modéliser le système<br />

dans son ensemble afin de prendre les bonnes décisions, par exemple vis à vis de ses<br />

per<strong>for</strong>mances globales.


RÉSUMÉ EN FRANÇAIS 3<br />

Le cas le plus complexe est certainement celui du code existant écrit sans aucune primitive<br />

de parallélisme explicite. Les compilateurs parallélisant ont échoué dans les années<br />

80 à cause de la difficulté d’extraire du parallélisme à partir d’un code séquentiel. La parallélisation<br />

au<strong>to</strong>matique est impossible dans le cas général car les algorithme parallèles<br />

peuvent être complètement différents des algorithmes séquentiels. Même pour les cas où<br />

l’algorithme parallèle est identique à l’algorithme séquentiel, cas où le parallélisme pourrait<br />

être détecté au<strong>to</strong>matiquement par un outil, la tâche peut être rendue difficile à cause<br />

de trans<strong>for</strong>mations de code destinée à améliorer les per<strong>for</strong>mances pour le cas séquentiel.<br />

Alors que le parallélisme devient de plus en plus répandu, la façon dont les algorithmes<br />

sont conçus évolue elle aussi et commence à prendre en compte la dimension parallèle.<br />

Pour cette raison, il parait raisonnable de se concentrer sur la compilation de programmes<br />

explicitement parallèles, et non pas sur l’extraction de parallélisme.<br />

Les machines hétérogènes accentuent l’analogie donnée dans [ABC + 06], où le logiciel se<br />

pose comme un pont entre le matériel et les applications. Pour construire de tels ponts, une<br />

approche est la compilation, et c’est le compilateur qui fera se rencontrer applications et<br />

matériels. De nombreux composants nécessaires à la construction de tels ponts dans un environnement<br />

hétérogène existent déjà dans une littérature riche de plus de trente ans, mais<br />

pas <strong>to</strong>us. Dans [Pat10], David Patterson expose de nombreux aspects de la recherche en<br />

parallélisme depuis les années 60. Comme l’on peut s’y attendre, sa conclusion est qu’il n’y<br />

a pas de repas gratuit dans le domaine. Bien que des cas isolés aient rencontrés un certain<br />

succès, aucune solution globale n’a émergée concernant le problème du parallélisme. Il y a<br />

eu des solutions particulières pour des cas particuliers. On peut donc penser qu’écrire un<br />

compilateur capable de traiter le problème du calcul hétérogène est une tâche impossible.<br />

Il y a pourtant des détails encourageants. Ces solutions particulières peuvent être vues<br />

comme les fondements sur lesquels bâtir des solutions plus complexes. Si ses solutions<br />

sont conçues pour être réutilisables, alors la création de nouveaux compilateurs pour de<br />

nouvelles cibles peut en être simplifiée. L’idée de construire des compilateurs en composant<br />

les blocs de base est au centre de cette thèse. De nombreuses questions en découlent :<br />

Quels sont les blocs de base pertinents dans le contexte du calcul hétérogène ? Comment<br />

composer ces blocs de base en fonction de la cible ? Est-il possible de représenter l’ensemble<br />

des cibles matérielles dans une unique représentation interne ? Finalement, existe-t-il une<br />

méthodologie pour construire des compilateurs pour une cible hétérogène ?<br />

Dans une étude des techniques de conceptions croisées entre le matériel et le logiciel<br />

publiée en 1994 [Wol94], Wayne H. Wolfe déclarait que<br />

Pour être capable de continuer à exploiter l’augmentation des per<strong>for</strong>mances<br />

des CPUs rendue possible par la loi de Moore (. . .), nous devons développer<br />

de nouvelles méthodes de conception et de nouveaux algorithmes qui permettent<br />

aux concepteurs de prédire les coûts d’implémentation, de raffiner de<br />

manière incrémentale les modèles de machine à travers plusieurs niveaux<br />

d’abstraction, et de créer une première implémentation fonctionnelle.<br />

Dix-sept ans plus tard, le défi posé par Wolfe n’est <strong>to</strong>ujours pas relevé. Ni les codes<br />

sources ni les développeurs n’ont évolués aussi vite que le matériel. Par conséquent, les


4 RÉSUMÉ EN FRANÇAIS<br />

applications n’ont pas bénéficié des nouvelles architectures matérielles, sauf qu’en d’intenses<br />

ef<strong>for</strong>ts ont été fournis. La quantité de code existant est telle qu’il ne parait pas viable<br />

économiquement d’utiliser autre chose qu’un compilateur pour combler le fossé qui se<br />

creuse entre le matériel et le logiciel. Les conseils que Wolfe donne aux concepteurs de<br />

matériel tient <strong>to</strong>ut autant pour les concepteurs de compilateurs, ceux-là même qui doivent<br />

concevoir des outils capables de traduire la base de code existante en des codes capables de<br />

tirer parti des capacités des coprocesseurs parallèles. : raffiner de manière incrémentale la<br />

conception des compilateurs en utilisant plusieurs niveaux d’abstraction et en créant<br />

une première implémentation fonctionnelle. Cette tâche est rendue d’autant plus<br />

difficile que les outils utilisés dans un flot de compilation classique sont souvent spécifiques à<br />

un langage de programmation, et ne sont pas <strong>to</strong>ujours adaptés à la compilation hétérogène.<br />

Cette thèse suit l’approche en trois étapes proposée par Wolfe pour assembler des<br />

compilateurs ciblant autant de machines hétérogènes, de la classique machine multi-cœur<br />

au processeur basé sur des Field Programmable Gate Array (fpga)s et spécialisé dans le<br />

traitement d’image en passant par les Graphical Processing Unit (gpu) ou encore les unités<br />

d’instructions vec<strong>to</strong>rielles intégrées aux General Purpose Processors (gpps). L’objectif n’est<br />

pas de proposer les meilleurs des compilateurs pour chaque cible, mais plutôt de construire<br />

des compilateurs raisonnablement efficaces en réutilisant le plus possible de blocs de base.<br />

Pour atteindre cet objectif, nous commençons par une étude des machines hétérogènes<br />

et des modèles de programmation existants au chapitre 2. Trois familles de contraintes<br />

matérielles ressortent de cette étude, correspondant à autant de sources d’hétérogénéité :<br />

l’Instruction Set Architecture (isa), l’organisation mémoire et la source d’accélération —<br />

le parallélisme. Le chapitre 3, montre que les infrastructures de compilation existantes<br />

manquent de la flexibilité requise pour composer des trans<strong>for</strong>mations de code à grain<br />

fin nécessaires pour lever les contraintes matérielles. Nous proposons un modèle afin de<br />

résoudre ce problème. Le chapitre 4, le chapitre 5 et le chapitre 6 examinent les trois<br />

familles de contraintes identifiées et détaillent plusieurs trans<strong>for</strong>mations de code originales<br />

et indépendantes de la cible. Elles <strong>for</strong>ment les briques de base de notre approche.<br />

Cette thèse n’est pas uniquement un exercice théorique. En se reposant sur les idées<br />

développées dans ce manuscrit, plusieurs compilateurs ont été réalisés. Leur conception et<br />

leur implémentation ainsi que des rapports de per<strong>for</strong>mance sont présentés au chapitrer 7.<br />

Le chapitre 8 résume les contributions de cette thèse et en tire plusieurs conclusions.<br />

Toutes les trans<strong>for</strong>mations et les compilateurs décrits dans ce document ont été implémentés<br />

dans l’infrastructure de compilation Paralléliseur Interprocedural de Programmes<br />

Scientifiques (pips) développée par le Centre de Recherche en In<strong>for</strong>matique (cri) des mines<br />

ParisTech avec les contributions de la jeune pousse hpc project, de Télécom Sud-Paris et<br />

de Télécom Bretagne. Une vue plus détaillée du projet est donnée dans l’appendice A.<br />

Une grande partie des idées développées dans cette thèse sont déjà intégrées dans l’outil<br />

Par4All développé par hpc project.


RÉSUMÉ EN FRANÇAIS 5<br />

2. Calcul sur machines hétérogènes<br />

Le 15 février 2007, nVidia a proposé, à travers le langage cuda et le Software Development<br />

Kit (sdk) associé, une nouvelle manière de développer des applications généralistes<br />

sur gpus, un type de matériel généralement limité au calcul graphique. Depuis, les gpgpus<br />

sont utilisés pour simuler des phénomènes physiques, pour faire du traitement vidéo ou encore<br />

du chiffrement, et de nombreuses machines hétérogènes offrent désormais des solutions<br />

efficaces pour le calcul scientifique : fpgas de Xilinx, Software On Chip (soc) ou MultiProcessor<br />

System-on-Chip (mp-soc) de Texas Instruments ou micro contrôleurs de<br />

Atmel.<br />

En novembre 2011, le supercalculateur Tianhe-1A, une grappe de machine contenant<br />

plus de 7k cartes Tesla, est arrivé en haut du <strong>to</strong>p500. Depuis, les le calcul hétérogène se pose<br />

en alternative viable au calcul sur machines multi-cœurs. Cependant, le calcul hétérogène<br />

est très différent du calcul homogène, <strong>to</strong>ut particulièrement en ce qui concerne les trois<br />

aspects critiques suivant : mémoire, parallélisme et jeux d’instructions. Cela conduit à un<br />

entrelacement de concepts spécifiques à chaque cible, mais indépendant de l’organisation<br />

du code à haut niveau. Cette complexité a été illustrée de façon intéressante lors de la<br />

conférence Supercomputing en 2010 : pendant une présentation, un groupe d’experts a<br />

discuté des trois P du calcul hétérogène.<br />

Le premier P est pour Per<strong>for</strong>mance : contrairement aux applications génériques qui<br />

doivent fournir une per<strong>for</strong>mance moyenne pour <strong>to</strong>ut type d’applications, le but d’un accélérateur<br />

matériel est de fournir une per<strong>for</strong>mance de pointe à des applications spécifiques.<br />

Cet objectif est atteint par l’usage de parallélisme, de type Single Instruction stream, Multiple<br />

Data stream (simd), Multiple Instruction stream, Multiple Data stream (mimd) ou<br />

pipeline, ou à l’aide de routine optimisée cablées en dur.<br />

Le second P est pour Power (Énergie) : la consommation énergétique est une contrainte<br />

critique aussi bien pour les systèmes embarqués, pour maximiser l’utilisation entre deux<br />

chargements, et pour les supercalculateurs, pour minimiser les coûts en électricité. Les<br />

accélérateurs matériels sont de bons candidats pour améliorer le rapport flops par Watt,<br />

e.g. parce que moins d’énergie est dépensée dans des opération non liées au calcul.<br />

Le troisième P est pour la Programmabilité, la principale faiblesse des accélérateurs<br />

matériel. En effet, développer sur un accélérateur matériel implique un changement de<br />

modèle depuis la programmation séquentielle vers la programmation de circuit via vhsic<br />

Hardware Description Language (vhdl) [TM08] pour des cartes fpga, programmation 3D<br />

via Open Graphics Library (opengl) ou directX pour les gpus, our programmation parallèle<br />

e.g. via cuda ou Open Computing Language (opencl). De plus, le modèle d’exécution<br />

introduit la complexité supplémentaire de la gestion de la mémoire partagée.<br />

Le chapitre 2 présente en détail le modèle de calcul hétérogène dans la section 2.1 et<br />

ces conséquences sur le modèle de programmation en section 2.2. Le concept de contrainte<br />

matériel est introduit en section 2.3 comment un moyen de modéliser les interactions entre<br />

le matériel et un hypothétique compilateur, et la section2.4 discute la pertinence d’utiliser<br />

le langage C comme langage d’entrée pour programmer ces machines. Nous donnons en


6 RÉSUMÉ EN FRANÇAIS<br />

section 2.5 une analyse d’un modèle de programmation standardisé, opencl. D’autres<br />

modèles sont présentés en section 2.6.<br />

2.1 Modèles de calcul hétérogène<br />

Il existe de nombreux types de machines hétérogènes, suivant le types d’accélérateurs<br />

mis en œuvre : gpgpus, cartes fpga, Application-Specific Integrated Circuits (asics) etc.<br />

Généralement, l’ordonnancement entre ces différents éléments est géré par un processeur<br />

maître, typiquement un gpp en fonction de leurs capacités. Cela implique le passage à<br />

un modèle de calcul distribué. Chaque accélérateur tire son accélération de spécificités<br />

architecturales qui impliquent un nouveau modèle architectural. Ils peuvent également<br />

avoir une mémoire séparée des autres avec sa propre hiérarchie, ce qui ajoute la difficulté<br />

d’un nouveau modèle mémoire.<br />

2.2 Influence sur le modèle de programmation<br />

La complexité du modèle mémoire a un impact important sur le modèle de programmation<br />

: il faut s’affranchir de la mémoire distribuée ou la gérer à travers des appels à distance.<br />

Les contraintes de taille mémoire peuvent également être bloquantes, notamment dans le<br />

domaine des processeurs embarqués. De même, les spécificités du modèle architectural font<br />

qu’un même code doit être adapté à plusieurs isas, impliquant souvent d’écrire plusieurs<br />

version d’un même code, une par cible. Enfin, le modèle d’exécution, étroitement lié à<br />

différentes <strong>for</strong>mes de parallélisme, rappelle les nombreuses difficultés liées à l’expression du<br />

parallélisme au niveau du modèle de programmation.<br />

2.3 Contraintes matérielles<br />

Nous proposons de modéliser les difficultés liées au calcul hétérogène sous <strong>for</strong>me de<br />

contraintes matérielles. Une contrainte peut être liée à l’isa, la mémoire ou à l’accélération.<br />

Elle est soit obliga<strong>to</strong>ire, auquel cas elle doit être respectée pour utiliser la cible, soit<br />

facultative, auquel cas on peut s’attendre à un meilleur comportement de l’accélérateur si<br />

elle est respectée. Un accélérateur est alors décrit par un diagramme de contraintes qui<br />

aide le développeur à appréhender la cible.<br />

2.4 Notes sur le langage C<br />

His<strong>to</strong>riquement, les compilateurs ciblant des architectures ne respectant pas le modèle<br />

von Neumann se sont souvent basés sur la langage C, ce qui nous amènera à considérer son<br />

emploi dans le cadre de machines hétérogènes. Dans ce document, nous traiterons des codes<br />

de haut niveau écrits en C99 et utilisant des tableaux multidimensionnels de taille variable,


RÉSUMÉ EN FRANÇAIS 7<br />

ce qui dégrade peu les per<strong>for</strong>mances par rapport à un code écrit en C89 en utilisant des<br />

pointeurs mais augmente sensiblement la qualité du résultat des analyses de code.<br />

2.5 Analyse du modèle de programmation de OpenCL<br />

opencl est un standard offrant un modèle et une api C pour programmer dans un<br />

environnement hétérogène. On y retrouve, exhibés au niveau développeur, les différentes<br />

contraintes matérielles que nous avons mentionnées. Le parallélisme y existe sous plusieurs<br />

<strong>for</strong>mes : instructions vec<strong>to</strong>rielles, entre noyaux et entre contextes, certaines combinaisons de<br />

types et de qualificateurs sont interdites pour <strong>to</strong>ucher un plus grand nombre d’architectures<br />

et le développeur gère manuellement les transferts mémoires, à plusieurs niveaux, à l’aide de<br />

Direct Memory Accesss (dmas). Le code spécifique à un accélérateur est dérivé à l’exécution<br />

d’une description générique, chaque vendeur de matériel doit donc fournir le sien, ainsi<br />

qu’une implémentation de l’api utilisateur, pour être compatible avec le standard.<br />

2.6 Autres modèles de programmation<br />

Il n’existe pas de standard autre qu’opencl pour programmer sur machines hétérogènes.<br />

Cependant, de nombreuses approches ont été proposées pour des cibles particulières : compilation<br />

à partir d’un sous-ensemble du langage C, par exemple pour la génération de<br />

vhdl ; extension d’un langage existant avec des concepts propres à la cible, e.g. cuda ;<br />

utilisation d’un code séquentiel annoté avec des directives, e.g. Hybrid Multicore Parallel<br />

Programming (hmpp).<br />

3. Conception de compilateurs pour machines<br />

hétérogènes<br />

Jusqu’à la fin du 20ème siècle, un type d’architecture matériel dominait le marché des<br />

machines génériques : le modèle von Neumann. Par conséquent, les principaux compilateurs<br />

ont été construits pour cibler efficacement les machines implémentant ce modèle :<br />

un front-end analyse le code en entrée et le traduit en une représentation intermédiaire,<br />

un middle-end applique différents optimisations, au niveau bloc de code, boucle, fonction<br />

ou programme, et un back-end génère le code assembleur spécifique à la cible. Ces trois<br />

composants ont été largement étudiés et sont bien connus, comme l’illustre la mise à jour<br />

régulière du Dragon Book [ALSU06].<br />

La complexification de l’architecture même des compilateurs, cristallisée par la difficulté<br />

d’adaptation aux cibles hétérogènes, favorise une approche plus modulaire. En effet, il<br />

parait raisonnable de réutiliser les compilateurs, applications et bibliothèques existant et<br />

capable d’effectuer efficacement une tâche précise. Les combiner au lieu de réinventer la


8 RÉSUMÉ EN FRANÇAIS<br />

roue devrait être possible. PLuTo [BBK + 08] est un bon exemple d’une telle approche :<br />

l’outil combine un front-end C, un optimiseur polyédrique, un générateur de code cuda<br />

et le compilateur cuda de nVidia pour générer du code <strong>to</strong>urnant efficacement sur gpu.<br />

Les schémas de construction d’application se complexifient également quand il s’agit<br />

d’assembler des fichiers objet générés depuis différents langages par différentes chaînes<br />

de compilation. Par exemple, l’outil sloccount dénombre 210 <strong>Source</strong> Lines Of Code<br />

(sloc) dans common.mk, un Makefile générique fournit dans le sdk de cuda. Cela rend<br />

la génération de code plus ardue : il faut non seulement générer du code pour différents<br />

cibles, mais aussi trouver un moyen de les lier au sein d’une même application. Cette tâche,<br />

qui peut s’avérer non triviale dans un environnement homogène, peut devenir très complexe<br />

dans un environnement hétérogène.<br />

Le chapitre 3 étudie l’impacte du modèle de machine hétérogène, présenté dans le<br />

chapitre 2, sur la construction de compilateurs. Il propose la combinaison d’une infrastructure<br />

de compilation générique et d’un gestionnaire de passe programmable pour aborder<br />

le problème évoqué. Cette approche se concentre sur la modularité, la ré-utilisabilité, la<br />

capacité à se re-cibler et la flexibilité du schéma dans son ensemble.<br />

Pour commencer, la section 3.1 étudie l’adéquation entre les infrastructures de compilations<br />

utilisées en production et la nouvelle cible que représente les machines hétérogènes<br />

et propose un modèle pour supporter la gestion de passe programmable, un aspect critique<br />

pour la modularité des compilateurs. Ensuite, la section 3.2 argumente en faveur de<br />

l’usage de trans<strong>for</strong>mations source-à-source, utilisant le fichier de code C comme un moyen<br />

de communication entre outils. Finalement, la section 3.3 introduit une api de haut niveau<br />

pour construire des gestionnaires de passe. Elle expose une abstraction suffisante du comportement<br />

interne du compilateur au développeur qui peut ainsi raisonner efficacement au<br />

niveau passe. Le schéma complet est illustré par l’interface Pythonic PIPS (pyps) développée<br />

au dessus de l’infrastructure de compilation pips. Les travaux connexes sont étudiés<br />

dans la section 3.4.<br />

3.1 Extension des infrastructures de compilation<br />

Les infrastructures de compilation classique reposent sur une architecture en trois<br />

niveaux : un ou plusieurs front-end, un middle-end commun et un ou plusieurs back-end.<br />

La diversité des accélérateurs présents sur une machine hétérogène rend l’utilisation d’un<br />

unique middle-end difficile : il en faut autant que de cibles, ce qui limite également la possibilité<br />

d’avoir plusieurs back-end. Le problème qui se pose est celui de la réutilisation de<br />

code entre middle-end, et la composition des trans<strong>for</strong>mations en des schémas complexes.<br />

Nous proposons un modèle de trans<strong>for</strong>mation de code assurant la validité de leur composition<br />

et de leur application. Ce modèle est utilisé par le gestionnaire de passe, l’entité<br />

responsable de l’ordonnancement des trans<strong>for</strong>mations de code au sein du middle-end. Une<br />

hiérarchie de classes est dérivée de ce modèle pour aboutir à une api utilisable par le<br />

gestionnaire de passes.


RÉSUMÉ EN FRANÇAIS 9<br />

3.2 À propos des compilateurs source-à-source<br />

Un compilateur source-à-source est un compilateur qui prend en entrée un code écrit<br />

dans un langage de haut niveau et produit un code écrit dans un langage de haut niveau.<br />

Cette approche est souvent utilisée par les compilateurs parallélisant qui s’épargnent ainsi le<br />

besoin de gérer la génération de code binaire. Cette approche est particulièrement valable<br />

dans le cadre du calcul hétérogène, puisqu’il existe souvent des compilateurs source-àbinaire<br />

spécifiques à une cible, mais générant du code de qualité moyenne. Un compilateur<br />

source-à-source peut s’interfacer avec de tels outils — ou d’autre compilateurs source-àsource<br />

— pour générer du code de meilleur qualité. Formulé autrement, il parait bénéfique<br />

d’utiliser des outils complémentaires et variés pour s’attaquer à des cibles variées.<br />

3.3 pyps, une api de haut niveau pour la gestion de passe<br />

Nous avons implémenté l’api mentionnée plus haut dans le langage Python en se basant<br />

sur l’infrastructure de compilation source-à-source pips. L’utilisation d’un langage de<br />

script permet de raccourcir les cycles de développement et de pro<strong>to</strong>typer rapidement des<br />

enchaînements complexe de trans<strong>for</strong>mations de code, en travaillant uniquement au niveau<br />

du gestionnaire de passe. Tous les compilateurs assemblés durant cette thèse se basent sur<br />

cette implémentation, nommée pyps.<br />

3.4 Travaux connexes<br />

L’assemblage complexe de trans<strong>for</strong>mations de code au sein d’un compilateur a donnée<br />

lieu à des travaux visant à démontrer les limitations de l’approche traditionnelle dans le<br />

cadre de la compilation itérative, à la production de gestionnaires de passes modulaires voire<br />

programmables. D’autres travaux se concentrent sur la validité de la composition de passes<br />

et sur la sémantique que l’on peut y associer. Enfin, certains traitent les problèmes liés à<br />

l’extension de la représentation interne pour représenter de nouvelles cibles en capitalisant<br />

sur les passes existantes.<br />

4. Représentation d’un jeu d’instruction dans le langage<br />

C<br />

Lors d’une présentation donnée au Fusion Developers Summit 2011 à Bellevue, Washing<strong>to</strong>n,<br />

Phil Rogers annonçait que<br />

Le Fusion System Architecture (fsa) est indépendant de l’isa, que ce soit pour<br />

les gpus ou les Central Processing Units (cpus). C’est un point très important


10 RÉSUMÉ EN FRANÇAIS<br />

car nous invi<strong>to</strong>ns des partenaires à se joindre à nous dans <strong>to</strong>us les secteurs ;<br />

d’autres entreprise de matériel à implémenter fsa et à se joindre à nous au<strong>to</strong>ur<br />

de cette plate<strong>for</strong>me. . .<br />

Sous le capot, le Fusion System Architecture (fsa) repose sur une isa virtuelle. Au<br />

chapitre 3, nous avons énoncé qu’il est important, si l’on veut garder un bon niveau d’abstraction,<br />

de garder la représentation interne indépendante du matériel. Mais est-il possible<br />

de représenter <strong>to</strong>us les raffinements de l’isa cible dans une représentation interne proche<br />

du langage C [ISO99] ? D’après Brian Kernighan [Ker03],<br />

Le C est peut-être le meilleur équilibre jamais trouvé par un langage de programmation<br />

entre expressivité et efficacité. (. . .) Il était si proche de la machine<br />

que l’on pouvait voir ce qu’allait être le code machine (et il n’était pas difficile<br />

d’écrire un bon compilateur), mais il prenait bien garde de rester au dessus du<br />

niveau instruction de façon à pouvoir cibler <strong>to</strong>utes las machines sans avoir à<br />

réfléchir à des astuces particulières pour une machine particulière.<br />

Cela nous rappelle que le langage C a été conçu pour être proche du matériel. Ainsi,<br />

même sir la représentation interne choisie reste proche du C, elle peut-être d’un niveau<br />

suffisamment bas pour exprimer certaines des caractéristiques propres à l’isa ciblée. Cet<br />

aspect est examiné à la section 4.1. Nous détaillons ensuite <strong>to</strong>ues les aspects d’une isa et<br />

montrons que, sous contrainte de respecter certaines conventions et après trans<strong>for</strong>mations,<br />

il est possible d’adapter un code C aux contraintes d’une isa spécifique. La section 4.2<br />

examine les types de base ; la section 4.3 liste les différents types de registres spécifiques ;<br />

la section 4.4 détaille les liens entre fonctions intrinsèques et instructions machines ; et la<br />

section 4.5 parcoure les différences générées par la hiérarchie mémoire. Les problèmes liés<br />

aux frontières entre appels de fonctions sont examinés en section 4.6 et les appels à des<br />

bibliothèques externes sont examinés en section 4.7.<br />

4.1 C comme dénominateur commun<br />

Une étude des langages utilisés par cinq compilateurs pour accélérateurs, Handel-C,<br />

Mitrion-C, c2h, cuda and opencl montrent que des dialectes du langage C sont souvent<br />

utilisés comme langage d’entrée par des compilateurs ciblant du matériel spécialisé. La traduction<br />

en C sans intrinsèque du fichier d’en-tête xmmintrin.h fournissant les intrinsèques<br />

pour utiliser le jeu d’instruction Streaming simd Extension (sse) montre qu’il est parfois<br />

possible de représenter directement en C certaines fonctionnalité du matériel.<br />

4.2 Types de données natifs<br />

Certains type de donnés natifs en C peuvent ne pas être supportés par le compilateur<br />

cible. Dans ce cas, il est parfois possible de se ramener à un type supporté en utilisant<br />

une trans<strong>for</strong>mation de code : utilisation de la virgule fixe en l’absence de support pour les


RÉSUMÉ EN FRANÇAIS 11<br />

nombres flottants, éclatement des enregistrements en autant de variables que de champs<br />

ou remplacement des tableaux de taille fixe par autant de scalaires.<br />

4.3 Registres<br />

Le langage C permet de recommander le placement d’une variable dans un registre via<br />

le mot clé register. Si la machine cible possède des registres dédiés, on peut dé<strong>to</strong>urner<br />

l’usage de ce mot clé et utiliser une convention de nommage particulière pour affecter<br />

certaines variables à certains registres particuliers <strong>to</strong>ut en restant au niveau C.<br />

4.4 Instructions<br />

Il est courant qu’une architecture particulière possède des instructions particulières<br />

qui n’ont pas d’équivalent direct en C : instructions vec<strong>to</strong>rielles, opérations a<strong>to</strong>miques,<br />

Fused Multiply-Add (fma) ou dma sont des exemples classiques. L’approche traditionnelle<br />

dans ce cas est de représenter ces instructions par des fonctions intrinsèques, des fonctions<br />

possédant une signature C correcte mais traitées de manière particulière par le compilateur<br />

pour générer l’instruction assembleur idoine.<br />

4.5 Architecture mémoire<br />

Le calcul hétérogène fait intervenir et communiquer entre eux plusieurs espaces mémoires.<br />

Le langage C ne permet pas de faire la distinction entre une variable déclarée<br />

dans un espace mémoire et une variable déclarée dans un autre espace mémoire. Là encore,<br />

l’utilisation de conventions de nommage appropriées permet de ne pas modifier la<br />

représentation interne.<br />

4.6 Appels de fonction<br />

Certaines architecture ne propose pas de mécanisme bas niveau permettant d’effectuer<br />

un appel de fonction. Dans ce cas, on peut effectuer une expansion de procédure. Par ailleurs<br />

les appels de fonctions peuvent être utilisé pour simuler un appel à un accélérateur. Dans<br />

ce cas, on a besoin de sélectionner une portion de code et de l’extraire dans une nouvelle<br />

fonction qui représentera l’appel. Cette trans<strong>for</strong>mation se base sur l’analyse des effets<br />

mémoires pour limiter le nombre de paramètres de cette nouvelle fonction et restreindre<br />

l’utilisation de pointeurs pour simuler un passage par référence aux cas utiles.


12 RÉSUMÉ EN FRANÇAIS<br />

4.7 Appels de bibliothèques externes<br />

Dans le contexte des analyses inter procédurales, on peut avoir besoin de connaître le<br />

comportement de certaines fonctions pour lesquelles on ne dispose pas d’une implémentation<br />

complète. Le fournisseur de fonction permet de séparer ce problème en plusieurs<br />

parties. C’est un composant capable de fournir plusieurs versions d’une même fonction :<br />

une qui reproduit les effets mémoires de la fonction, destinée au compilateur ; et une par accélérateur<br />

qui correspond à son implémentation réelle. On peut ainsi s’abstraire des appels<br />

de bibliothèque spécifiques à une cible.<br />

5. Exploitation du parallélisme avec des instructions multimedia<br />

Une fonctionnalité récurrente des accélérateurs matériels est l’utilisation qu’ils font du<br />

parallélisme. Ce parallélisme peut se trouver à plusieurs niveaux, et prendre plusieurs<br />

<strong>for</strong>mes, généralement un mélange de parallélisme de type simd et de parallélisme de<br />

type mimd, comme c’est le cas pour les gpgpu. Ces deux types de parallélisme ont<br />

été longuement étudiés, en se concentrant sur la parallélisation de boucle : recherche<br />

d’hyperplans [Lam74], gestion des dépendances de contrôle [AKPW83], vec<strong>to</strong>risation<br />

de boucles [AK87], extraction de parallélisme [WL91b], partitionnement en supernœuds<br />

[IT88], optimisation des communications [DUSsH93], interactions avec la mémoire<br />

cache [KK92], pavage [DSV96, AR97, YRR + 10]. David F. Bacon et al. ont passé en revue<br />

[BGS94] les trans<strong>for</strong>mations de code pour le High Per<strong>for</strong>mance Computing (hpc),<br />

trans<strong>for</strong>mations qui concernent majoritairement les boucles. Vivek Sarkar a étudié l<br />

sélection au<strong>to</strong>matique de trans<strong>for</strong>mations en se basant sur un modèle de coût [Sar97].<br />

Toutes ces techniques ont été appliquées avec succès dans des compilateurs de recherche tels<br />

SUIF [WFW + 94], Polaris [PEH + 93], pips [IJT91, AAC + 11], Rose [Qui00] ou Pocc [PBB10],<br />

et dans des compilateurs utilisés en production comme IBM XL [Sar97], Low Level Virtual<br />

Machine (llvm) [GZA + 11], gnu C Compiler (gcc) [TCE + 10], Intel C++ Compiler<br />

(icc) [DKK + 99], pgi [Wol10].<br />

Dans ce chapitre, nous nous concentrons sur deux aspects de la parallélisation de code :<br />

Instruction Level Parallelism (ilp) et parallélisation de réductions. Le premier cherche<br />

à tirer parti des possibilités de parallélisme au niveau instruction, telles qu’on peut les<br />

trouver dans des séquences de code ou à l’intérieur d’une boucle, en utilisant les Multimedia<br />

Instruction Sets (miss) disponibles sur la plupart des processeurs modernes. Cet aspect est<br />

décrit dans la section 5.1. Le second aspect est un problème critique quand un code mettant<br />

en jeu une réduction doit être exécuté sur du matériel en mode purement simd. Il est abordé<br />

en section 5.2. La section 5.3 propose un modèle simple basé sur le parallélisme que l’on<br />

peut trouver sur des accélérateurs à mémoire distribuée, pour décider s’il est profitable ou<br />

non de déporter un calcul.


RÉSUMÉ EN FRANÇAIS 13<br />

5.1 Parallélisation au niveau instruction<br />

La génération d’instruction vec<strong>to</strong>rielle peut s’effectuer de deux façon : au niveau boucle<br />

en utilisant des techniques de vec<strong>to</strong>risation [BGS94, Bik04], ou au niveau séquence en utilisant<br />

des algorithmes de détection de motif [LA00, SHC05]. Nous proposons un algorithme<br />

hybride capable d’exploiter le parallélisme présent dans les boucles et dans les séquences.<br />

L’idée est d’appliquer des trans<strong>for</strong>mations de boucle de haut niveau, type pavage, puis<br />

de dérouler les boucles internes pour faire apparaitre des séquences sur lesquelles on<br />

peut appliquer la détection de motif. Cet algorithme s’applique sur un jeu d’instruction<br />

générique [Roj04] caractérisé par la taille des registres vec<strong>to</strong>riels et permet donc de cibler<br />

<strong>to</strong>us les jeux d’instructions inclus dans celui-ci.<br />

Un problème récurent lié à la génération d’instructions vec<strong>to</strong>rielles est celui des transferts<br />

mémoires. L’algorithme de détection de motif proposé maintient un état complet<br />

du contenu des registres vec<strong>to</strong>riels et utilise cette connaissance pour limiter le nombre de<br />

transferts depuis la mémoire principale en les remplaçant par des opération de copies ou<br />

de mélange entre registres vec<strong>to</strong>riels.<br />

5.2 Parallélisation de réductions<br />

Du point de vue du parallélisme, les réductions constituent un goulet d’étranglement. Il<br />

existe plusieurs technique bien connues pour extraire du parallélisme de boucles possédant<br />

une réduction [KRS90, Lei92]. Ces techniques on été étendues à la séquence d’instruction<br />

indépendamment de la présence de boucle englobante. L’idée est d’identifier chaque variable<br />

de réduction de la séquence [JD89], et de créer un tableau de la taille idoine pour reporter<br />

la réduction à la fin de la séquence.<br />

Que ce soit en sortie de boucle ou en sortie de séquence, la parallélisation de réduction<br />

fait intervenir un postlude qui n’est pas parallèle. Pour éviter d’avoir à effectuer<br />

cette réduction en séquentiel, il peut exister des mécanismes matériels dédiés. Pour abstraire<br />

ce concept, le traitement du postlude est délégué à une bibliothèque tierce dont<br />

nous fournissons une implémentation séquentielle mais qui peut être remplacée par une<br />

implémentation matérielle si celle-ci est possible.<br />

5.3 Estimation de l’intensité de calcul<br />

Il n’est pas <strong>to</strong>ujours bénéfique de paralléliser une boucle. Ce constat est particulièrement<br />

vrai dans le cas des accélérateurs possédant une mémoire distribuée, en raison des<br />

temps de transfert mémoire. En calculant la somme des volumes [Cla96] des régions de<br />

tableaux convexes lues et écrites par une instruction, il est possible d’obtenir un majorant<br />

de l’empreinte mémoire d’une instruction. De même, pour un code à contrôle statique, il est<br />

possible d’estimer le nombre d’instructions exécutées par ce code. En se basant sur cette<br />

estimation et sur l’empreinte mémoire, on peut estimer le ratio entre calcul et communi-


14 RÉSUMÉ EN FRANÇAIS<br />

cations et émettre un critère de décision local et conservatif comme quoi une boucle ne<br />

dont pas être parallélisée si l’ordre de grandeur des transferts mémoires n’est pas inférieur<br />

à l’ordre de grandeur du nombre d’instructions exécutées.<br />

6. Trans<strong>for</strong>mations pour la taille mémoire et les mémoires<br />

distribuées<br />

Wm. A. Wulf and Sally A. Mckee ont conclu leur article [WM95] « Hitting the<br />

Memory Wall : Implications of the Obvious » publié en 1995 par la phrase suivante :<br />

La solution la plus « pratique » au problème serait la découverte d’une technologie<br />

de mémoire dense et ne chauffant pas, dont la vitesse reste proportionnelle<br />

à celle des processeurs. Nous ne sommes pas au fait qu’une telle technologie<br />

existe (. . .).<br />

Quinze ans plus tard, une telle technologie n’existe <strong>to</strong>ujours pas, et les aspects mémoires<br />

restent un problème critique pour de nombreuses application parallèles. Dans le<br />

contexte du calcul hétérogène, où la mémoire de l’hôte et la mémoire de l’accélérateur sont<br />

souvent séparées, il est important de gérer cette contrainte matérielle avec soin. Dans ce<br />

cadre, nous proposons trois trans<strong>for</strong>mations génériques : l’isolation d’instruction qui sépare<br />

l’espace mémoire de l’accélérateur de l’espace mémoire de l’hôte, présentée en section 6.1,<br />

la réduction d’empreinte mémoire qui trouve les bons paramètres de pavage d’un nid de<br />

boucle de façon à ce que les boucles internes logent dans la mémoire cible, présentée en<br />

section 6.2, et l’élimination de transferts redondants présentée en section 6.3.<br />

6.1 Isolation d’instructions<br />

L’isolation d’instruction est une trans<strong>for</strong>mation qui permet d’isoler une instruction<br />

spécifique dans un nouvel espace mémoire émulé par des variables nouvellement allouées.<br />

L’idée est de remplacer <strong>to</strong>utes les variables référencées par l’instruction concernée et de les<br />

remplacer par de nouvelles variables de même type. Une étape de génération de transfert<br />

génère les copies depuis les anciennes variables vers les nouvelles et réciproquement de<br />

façon à garantir la cohérences des valeurs lues par l’instruction.<br />

Cette trans<strong>for</strong>mation fait intervenir deux <strong>for</strong>mes d’optimisation : en se basant sur les<br />

régions de tableau lues et écrites par l’instruction, elle est capable de placer les tableaux<br />

référencés par l’instruction dans des tableaux de plus petite taille, limitant ainsi ta taille<br />

des transferts. En se basant sur les régions de tableaux produites et consommés par l’instruction,<br />

elle est capable de ne générer les copies que quand celles-ci sont utiles, limitant<br />

ainsi le nombre de transferts.


RÉSUMÉ EN FRANÇAIS 15<br />

6.2 Réduction de l’empreinte mémoire<br />

La réduction d’empreinte mémoire correspond à la recherche de paramètres de pavages<br />

garantissant que le volume mémoire nécessaire pour effectuer un calcul sur une tuile est<br />

inférieur à une borne donnée. Pour cela, on commence par effectuer un pavage rectangulaire<br />

paramétrique du nid de boucle considéré. Une fois le pavage effectué, on calcul un majorant<br />

de l’empreinte mémoire du calcul sur la tuile en se basant sur le volume des régions convexes<br />

lues et écrites. On obtient ainsi une expression fonction des paramètres de pavages que l’on<br />

cherche à maximiser. Les paramètres trouvés sont alors fixés pour revenir à un pavage<br />

statique qui nous garantit que la contrainte de capacité mémoire est respectée.<br />

6.3 Élimination de transferts redondants<br />

Cette trans<strong>for</strong>mation est l’extension de l’élimination de chargement et déchargement<br />

redondant que l’on connait pour les registres scalaires. Elle propose de considérer chaque<br />

tableau comme un registre, et de considérer chaque transfert mémoire généré par l’isolation<br />

d’instruction comme une affectation de registre. En se basant sur ce <strong>for</strong>malise, l’algorithme<br />

d’élimination de transfert redondant fait remonter les transferts mémoires au plus haut<br />

dans la représentation interne, aussi longtemps que la remontée satisfait les conditions<br />

de Bernstein [Ber66] et en utilisant des règles d’élimination pour fusionner certains accès<br />

redondants. Cette traversée de la représentation interne peut se faire de façon intra ou<br />

inter procédurale suivant les caractéristiques de la cible.<br />

7. Implémentations de compilateurs et expériences<br />

Cette thèse présente et décrit une méthodologie pour spécialiser des compilateurs pour<br />

différentes plate<strong>for</strong>mes hétérogènes, en se basant sur une boîte à outil de trans<strong>for</strong>mations<br />

source-à-source bien fournie, une api pour gestionnaires de passe programmable et une<br />

description simple du matériel. Elle ne saurait être complète sans validation expérimentale.<br />

La méthodologie proposée prétend rendre plus aisé l’assemblage de compilateurs. Pour<br />

la valider, nous avons choisi cinq cibles différentes : trois cpu généralistes avec différentes<br />

unités vec<strong>to</strong>rielles, un processeur [BLE + 08] sur fpga spécialisé dans le traitement d’images<br />

et un gpu de chez nVidia. Pour chacun d’entre eux, nous avons développé un pro<strong>to</strong>type<br />

de compilateur en utilisant les techniques présentées dans les chapitres 4, 5 et 6. L’efficacité<br />

du code généré par ces pro<strong>to</strong>types de compilateur est mesurée en utilisant des bancs de<br />

tests ou des applications du domaine concerné.


16 RÉSUMÉ EN FRANÇAIS<br />

7.1 Un compilateur OpenMP naïf<br />

Ce chapitre commence par un générateur de directives Open Multi Processing (openmp)<br />

simple présenté en section 7.1 pour montrer comment appliquer les principes discutés durant<br />

cette thèse sur un cas pratique et réel. Le compilateur pour gpus implémenté par<br />

hpc project en se basant sur nos travaux est détaillé en section 7.2. La section 7.3 présente<br />

terapyps, un compilateur du langage C vers le langage terasm, le langage assembleur utilisé<br />

pour le processeur de traitement d’image terapix. Finalement un compilateur re-ciblable<br />

pour jeux d’instruction multimédia est décrit en section 7.4 pour trois cibles :sse, Advanced<br />

Vec<strong>to</strong>r eXtensions (avx) et neon.<br />

Pour la génération de directives openmp, on commence par caractériser le matériel<br />

disponible comme utilisant une mémoire partagée et un parallélisme de type mimd, ce qui<br />

permet d’identifier les trans<strong>for</strong>mations de code à utiliser, à savoir principalement l’extraction<br />

et la détection de parallélisme. Il n’y a pas d’étape de post-traitement car les directives<br />

openmp font partie intégrante de la représentation interne. Le pro<strong>to</strong>type de compilateur<br />

assemblé ainsi est validé sur le banc de test polybench.<br />

7.2 Un compilateur pour GPU<br />

La génération de code pour gpu est plus riche : il faut prendre en compte la mémoire<br />

partagée. On se contente de prendre en compte le parallélisme de type simd, en<br />

ignorant les capacités mimd du matériel. Les trans<strong>for</strong>mations d’isolation d’instruction et<br />

d’élimination de transferts redondants sont utilisées. Le schéma de compilation est également<br />

plus complexe puisqu’il fait intervenir une étape de séparation du code hôte et du<br />

code accélérateur, une étape de conversion du langage C vers le langage cuda utilisé par le<br />

compilateur source-à-binaire de l’accélérateur, et la génération finale du binaire pour l’accélérateur<br />

par ce dernier, en sus de la génération de code pour l’hôte. Le gestionnaire de<br />

passe est utilisé pour combiner ces différentes étapes, l’extraction de procédure permettant<br />

de générer autant d’unité de compilation que nécessaire. Le compilateur assemblé à partir<br />

de ces briques de bases est validé sur plusieurs noyaux de traitement du signal, sur lesquels<br />

on obtient des accélérations moyennes d’un facteur ×25 sur de gros jeux de données.<br />

7.3 Un compilateur pour un processeur d’images sur FPGA<br />

L’accélérateur terapix, dédié au traitement d’image, pose un défi supplémentaire car<br />

la mémoire de l’accélérateur ne permet pas d’y loger assez de données pour traiter une<br />

image en une passe. De plus sonisa est très spécifique, en particulier de par l’utilisation<br />

d’un jeu d’instruction Very Long Instruction Word (vliw). La trans<strong>for</strong>mation de code<br />

réduction de l’empreinte mémoire permet de s’affranchir des limitations liées à la taille. Le<br />

schéma de compilation utilisé ajoute une passe par rapport aux gpus : la traduction d’un<br />

flot d’instruction séquentiel en un flot d’instruction vliw. Cette étape est assurée par un


RÉSUMÉ EN FRANÇAIS 17<br />

outil tierce, il s’agit donc de <strong>for</strong>mater le code produit pour qu’il satisfasse ses contraintes<br />

d’entrée, à l’aide de trans<strong>for</strong>mations de bas niveau comme la linéarisation de tableau, la<br />

détection d’itérateurs ou la conversion de boucles pour en boucles tant que. La chaîne<br />

de compilation complète permet de traduire au<strong>to</strong>matiquement des noyaux de traitement<br />

d’image écrits en C vers le langage assembleur de terapix, et d’obtenir des noyaux dont<br />

les per<strong>for</strong>mances sont proches d’une version optimisée manuellement (nombre de cycles de<br />

l’ordre de 125% du nombre de cycles optimum).<br />

7.4 Un compilateur reciblable pour jeux d’instructions vec<strong>to</strong>rielles<br />

La génération de code pour des processeurs génériques possédant une petite unité vec<strong>to</strong>rielle,<br />

e.g. de type sse, pose un défi différent, car les contraintes mémoires sont moins<br />

<strong>for</strong>tes, bien que similaires aux cas précédents. Dans ce cas, c’est l’extraction au<strong>to</strong>matique<br />

d’un parallélisme de type simd avec des vecteurs de petite taille qui est la clef de la per<strong>for</strong>mance.<br />

L’algorithme de vec<strong>to</strong>risation hybride présenté au chapitre 5 permet de lever<br />

ses contraintes, et se combine avec l’élimination de transferts redondants pour limiter le<br />

nombre de transferts. Comme les trans<strong>for</strong>mations mises en œuvre sont génériques, le compilateur<br />

assemblé ainsi est indépendant de la cible — mis à part la taille des registres<br />

vec<strong>to</strong>riels — et est facilement reciblable d’un jeu d’instruction à l’autre. En pratique, ce<br />

compilateur génère du code plus efficace que gcc et légèrement moins efficace que icc,<br />

mai supporte plus d’architecture que icc, par exemple l’ARM v7 et le jeu d’instruction<br />

neon.<br />

8. Conclusion<br />

La recherche de per<strong>for</strong>mance passe désormais par les machines hétérogènes : même<br />

l’ordinateur portable utilisé pour écrire cette thèse peut utiliser les capacités de calcul de<br />

deux processeurs génériques, des deux unités d’instructions vec<strong>to</strong>rielles sse associées, et<br />

d’un gpgpu. Le principal problème de ces unités de calcul et la difficulté de les programmer.<br />

Dans cette thèse, nous avons choisi le chemin de la compilation pour au<strong>to</strong>matiser la<br />

production de code pour des accélérateurs matériels. Nous nous sommes concentrés sur la<br />

capacité à produire rapidement plusieurs compilateurs pour des cibles différentes. Comme<br />

le matériel moderne est généralement déjà programmable dans un dialecte du langage C,<br />

nous nous sommes fixé pour objectif de traduire au<strong>to</strong>matiquement des algorithmes standard<br />

écrit en C vers différents noyaux de calcul écrits dans des dialectes du C, et de générer<br />

le code permettant de faire des appels à un accélérateur depuis le processeur hôte.<br />

L’avantage de cette approche est sa modularité : de nombreuses trans<strong>for</strong>mations peuvent<br />

être réutilisée d’un accélérateur à l’autre. Cela réduit les coûts de production de compilateurs,<br />

et l’utilisation du infrastructure de compilation source à source permet d’interagir<br />

avec les outils existants, en particulier avec les compilateurs générant du code binaire dédié


18 RÉSUMÉ EN FRANÇAIS<br />

à partir de dialectes de code C.<br />

8.1 Contributions<br />

Méthodologie pour la construction de compilateurs source-à-source<br />

Nous avons proposé de modéliser les accélérateur matériel par des diagrammes de contraintes<br />

matérielles. Ces diagrammes identifient les contraintes optionnelles et obliga<strong>to</strong>ires<br />

associé au matériel. L’association manuelle entre ces contraintes et des trans<strong>for</strong>mations de<br />

code guide le développeur de compilateurs dans son processus de développement.<br />

Conception d’une infrastructure de compilation générique<br />

L’hétérogénéité des accélérateurs matériel rend difficile la construction d’un unique<br />

compilateur capable de les cibler <strong>to</strong>us. Depuis, il existe déjà de nombreuses applications<br />

capable de traiter certains problèmes posés par ces machines. Au chapitre 3, nous avons<br />

proposé un schéma de compilation qui combine l’utilisation d’une boîte à outil de trans<strong>for</strong>mations<br />

de code source à source, une api pour la gestion de passes et un modèle de machine<br />

hétérogène. Cette méthodologie est validée au chapitre 7 pour 4 cibles différentes. Il est<br />

utilisé dans l’outil Par4All développé par hpc project.<br />

Trans<strong>for</strong>mations pour contraintes d’isa<br />

Les accélérateurs matériels doivent leur accélération à leur spécialisation : on peut dire<br />

qu’elles sont plus efficaces sur un champ d’application plus restreint. La conséquence directe<br />

est une spécialisation de l’isa. Cette spécialisation est visible au niveau des dialectes du<br />

langage C proposés pour programmer ces accélérateurs. Le chapitre 4 propose un ensemble<br />

de trans<strong>for</strong>mations source à source pour raffiner un code C de haut niveau vers un code de<br />

plus bas niveau. Cet ensemble comprend en particulier un algorithme original d’outlining,<br />

basé sur les régions de tableaux convexes.<br />

Algorithme slp hybride<br />

On trouve maintenant des jeux d’instruction multimédia sur <strong>to</strong>us les processeurs<br />

généralistes, et même sur des puces hybrides cpu/gpu. Nous avons développé un algorithme<br />

original basé sur l’état de l’art en vec<strong>to</strong>risation de boucles et de séquences. Cet<br />

algorithme unifie les deux approches et est paramétrés par une description au niveau C de<br />

l’isa. Il respecte ainsi les critères de re-ciblage évoqués au chapitre 3. L’algorithme a été<br />

validé sur trois familles de Multimedia Instruction Set : sse, avx et neon. Ce travail a été<br />

récompensé par le prix du troisième meilleur poster à PACT 2011.


RÉSUMÉ EN FRANÇAIS 19<br />

Trans<strong>for</strong>mations pour contraintes mémoires<br />

Les aspects mémoire sont critiques pour de nombreux systèmes hétérogènes : quand<br />

un accélérateur ne partage pas d’espace mémoire avec son hôte, des rpc et des dma sont<br />

nécessaires. Le modèle de programmation est bien plus complexe que les modèles classiques.<br />

Nous avons présenté au chapitre 6 trois trans<strong>for</strong>mations de code qui prennent ces aspects<br />

en compte : l’isolation d’instructions sépare l’espace mémoire de l’accélérateur de celui<br />

de l’hôte ; la réduction d’empreinte mémoire trouve la matrice de pavage garantissant<br />

que l’accélérateur a suffisamment d’espace mémoire pour exécuter l’application par tuile<br />

successives ; et l’élimination de transferts redondants élimine les mouvements de données<br />

inutiles.<br />

Implémentation<br />

Toutes les trans<strong>for</strong>mations présentées dans cette thèse ont été développés dans l’infrastructure<br />

de compilation source-à-source pips pour le langage C et ont été assemblée en<br />

utilisant le gestionnaire de passes pyps. Elles ont conduit à l’implémentation de quatre<br />

compilateurs : un pro<strong>to</strong>type de générateur de directives openmp, un compilateur reciblable<br />

pour jeux d’instructions vec<strong>to</strong>rielles, un générateur de microcode pour le processeur<br />

à base de fpga dédié au traitement d’image terapix, et un générateur de code pour<br />

gpu développé par hpc project. Cela valide à la fois l’infrastructure de compilation dans<br />

son ensemble et les algorithmes proposés dans ce manuscrit. Les expériences et les flots de<br />

compilations spécifiques sont détaillés au chapitre 7.<br />

Contributions à la communauté de pips<br />

Il est difficile de séparer l’activité de recherche de l’activité de développement dans<br />

une thèse en in<strong>for</strong>matique. L’intégration de nouvelles trans<strong>for</strong>mations dans l’infrastructure<br />

choisie et l’extension au langage C de passes conçues pour la langage Fortran sont des<br />

activités indispensables pour supporter les travaux de recherches mais elles nécessitent un<br />

investissement en temps important. En tant que membre de l’équipe pips, j’ai pris en<br />

charge la modernisation de l’infrastructure de compilation du projet et j’ai rationalisé la<br />

distribution sous <strong>for</strong>me de paquet.<br />

J’ai encadré cinq étudiants à Télécom Bretagne au cours de stages au<strong>to</strong>ur du projet<br />

pips, et j’ai contribué à la valorisation scientifique de l’outil à travers deux tu<strong>to</strong>riels dans<br />

des conférences internationales.<br />

8.2 Travaux à venir<br />

Le monde du hpc est en constante évolution. Un supercalculateur à base de Sparc64 est<br />

arrivé en tête du <strong>to</strong>p500 de juin 2011, alors que les gpu de nVidia menaient la danse 6 mois<br />

plus tôt. Dans cet environnement mouvant, rien n’est encore figé et les vendeurs de matériel<br />

continuent à proposer leur standards pour avoir un modèle de programmation commun


20 RÉSUMÉ EN FRANÇAIS<br />

associé à des outils d’ingénierie efficaces. Cela nécessite une coopération et de nombreuses<br />

interactions entre outils. Dans ce contexte, combler le fossé qui sépare la norme opencl des<br />

générateurs de vhdl existant est un défi intéressant et un thème de recherche encore ouvert.<br />

Cependant, le hpc reste un marché de niche comparé à celui des systèmes embarqués et<br />

des téléphones intelligents. Dans ces domaines, les contraintes matérielles sont encore plus<br />

importantes : consommation énergétique, poids, volume, etc. Les trans<strong>for</strong>mations de code<br />

et l’approche étudiées pendant cette thèse peuvent certainement y trouver des applications.<br />

Les mis devenant de plus en plus flexible, il est de plus en plus courant d’y trouver des<br />

instructions non-simd. Ces instructions au<strong>to</strong>risent des schémas de chargement/déchargement<br />

plus élaborés (e.g. accès non contigus) et permettre d’atteindre de meilleurs per<strong>for</strong>mances<br />

pour des applications contraintes par leurs accès mémoires. L’ajout incrémental de<br />

trans<strong>for</strong>mations de code capable de générer ces instructions dans notre compilateur pour<br />

mis est un sujet prometteur.<br />

Nous voyons deux possibilités d’extension de nos travaux sur les gestionnaires de passe.<br />

Premièrement, la combinaison d’opérateurs que nous avons décrite conduit à la construction<br />

d’un graphe orienté qui présente des opportunités de parallélisation à gros grain, au<br />

niveau du gestionnaire de passe. Cela améliorerait les temps de compilations en rendant le<br />

traitement parallèle. Deuxièmement, il apparait que certaines combinaisons de passes sont<br />

inutiles ou redondantes. L’ajout d’une sémantique précise aux trans<strong>for</strong>mations de code<br />

permettrait d’éliminer certaines séquences d’appels, par exemple dans le contexte de la<br />

compilation itérative.


Chapter 1<br />

Introduction<br />

Pont de l’Iroise, Brest, Finistère c○ lazzarello / flickr<br />

Moore’s law [Moo65] was invalidated five years ago, when transis<strong>to</strong>rs became so small<br />

that silicon could no longer dissipate the energy released by processor activity at maximum<br />

switching speeds. William J. Dally calls this “The End of Denial Architecture” [Dal09].<br />

To overcome thermal limitations, chip makers increased the number of cores on each die,<br />

producing multicore processors, an entirely new direction of development. As transis<strong>to</strong>rs<br />

have continued <strong>to</strong> shrink, multiple cores have provided additional computing power by<br />

putting more in the same die area with no increase in clock frequency. A second power<br />

wall is <strong>for</strong>ecast when transis<strong>to</strong>rs become so densely packed that silicon cannot conduct<br />

enough heat even at current, fixed clock rates. To continue improving application per<strong>for</strong>mance<br />

as this second wall approaches, co-processors have emerged and paved the way <strong>to</strong><br />

heterogeneous computing, where several computation units of different types collaborate.<br />

There are many types of co-processors from the specialized, which are highly efficient at<br />

certain tasks, <strong>to</strong> more general ones. All have common goals of low execution time and<br />

power consumption. In this manner, two general paths <strong>to</strong> high per<strong>for</strong>mance have arisen:<br />

homogeneous many-core machines and heterogeneous machines featuring various accelera<strong>to</strong>r<br />

technologies. Of course, creative processor architects have produced exceptions <strong>to</strong><br />

prove this rule. Intel’s larabee [SCS + 08] processor, <strong>for</strong> example, is a multicore design<br />

21


22 CHAPTER 1. INTRODUCTION<br />

with a vec<strong>to</strong>r arithmetic unit on-chip. The emerging design space is complex and likely <strong>to</strong><br />

change continuously.<br />

Beginning some thirty years ago, engineers built a large variety of parallel machines.<br />

The Connection Machine and the MasPar MP2 are notable examples. At that time,<br />

supercomputers like the Cray-1 cost $58 M and could per<strong>for</strong>m an average of 80 Mflops.<br />

By 2006, the PlayStation 3 (ps3) video game console based on the Cell be architecture<br />

cost $500 and could achieve 230 GFLoating point Operations per Second (flops) thanks<br />

<strong>to</strong> a combination of general-purpose and specialized processors. Of the Cray, less than<br />

a hundred units were built, while more than 50 millions ps3 consoles were produced.<br />

This change of market inevitably created a need <strong>for</strong> better development <strong>to</strong>ols. By 2006,<br />

parallelism was no longer the concern of specialists as in the Cray era.<br />

More can be learnt from the ps3 s<strong>to</strong>ry: although launched by sony one year be<strong>for</strong>e<br />

Microsoft’s xbox360, the latter had greater success, in great part due <strong>to</strong> a larger game<br />

catalog. It turned out that many game development studios found development <strong>for</strong> the ps3<br />

and its heterogeneous architecture <strong>to</strong> be <strong>to</strong>o difficult. Indeed, the Cell be architecture<br />

involved separated memory spaces, manual control of data transfers, manual handling of<br />

128-entry vec<strong>to</strong>r registers in a vec<strong>to</strong>r-only way, and manual cache management. This<br />

complexity proved <strong>to</strong>o much <strong>for</strong> average developers <strong>to</strong> master.<br />

Actually, there is and probably will always be three ways <strong>to</strong> handle such hardware<br />

complexity: hire an expert engineer who has a comprehensive understanding of the machine;<br />

develop a specialized library that exposes the hardware capabilities in an Application<br />

Programming Interface (api); or build a compiler that translates high-level code in<strong>to</strong> the<br />

target machine language.<br />

The first option is versatile but costly. The second is efficient but lacks flexibility.<br />

Drawing maximum per<strong>for</strong>mance from complex hardware using only a limited number of<br />

api calls may be impossible. The last approach combines the advantages of the first two,<br />

given that the needed compiler can be written at reasonable cost. nVidia, <strong>for</strong> example, has<br />

successfully adopted the compilation approach <strong>for</strong> their General Purpose gpus (gpgpus)<br />

technology. Most nVidia gpgpu programmers use an extension of C/C++—the Compute<br />

Unified Device Architecture (cuda) language—and the nVidia compiler nvcc.<br />

Indeed, heterogeneous computing places entirely new demands on the compilation process.<br />

Quoting the dragon book [ALSU06],<br />

Definition 1.1. A compiler is a program that can read a program in one language—the<br />

source language— and translate it in<strong>to</strong> an equivalent program in another language—the<br />

target language.<br />

There is obviously no mention that a compiler may need <strong>to</strong> trans<strong>for</strong>m its one or more<br />

inputs in<strong>to</strong> several outputs—one <strong>for</strong> each accelera<strong>to</strong>r processor in a heterogeneous system,<br />

each in its own language. Previous architectures simply did not require this complex<br />

behavior. In the common case where one input language results in several target programs,<br />

the job is complicated, requiring the compiler <strong>to</strong> internally model complete system<br />

per<strong>for</strong>mance in order <strong>to</strong> make good decisions about per<strong>for</strong>mance.


The most difficult case occurs when the source code is in a legacy language with no explicitly<br />

parallel constructs. <strong>Compilers</strong> failed in the 80’s because of the difficulty <strong>to</strong> extract<br />

parallelism from sequential codes. Au<strong>to</strong>matic parallelization is impossible in the general<br />

case because parallel algorithms are completely different from sequential algorithms and a<br />

compiler cannot create new algorithms. Even <strong>for</strong> cases were parallelism detection is within<br />

the scope of an au<strong>to</strong>mated <strong>to</strong>ol, it is made difficult when code is obfuscated by unconventional<br />

coding methods that were originally intended <strong>to</strong> optimize per<strong>for</strong>mance of earlier<br />

machines and compilers. As parallelism is becoming ubiqui<strong>to</strong>us, the way algorithms are designed<br />

will also evolve and will take in<strong>to</strong> account parallel aspects. For that reason, it seems<br />

reasonable <strong>to</strong> focus on the compilation of explicitly parallel programs <strong>for</strong> heterogeneous<br />

targets, not on the au<strong>to</strong>matic extraction of parallelism.<br />

<strong>Heterogeneous</strong> environments sharpen the analogy given in [ABC + 06], which depicts<br />

software as a bridge 1 between hardware and applications. If software is a bridge, then<br />

the compiler is the bridge-builder, responsible <strong>for</strong> making both ends meet. Many blocks<br />

<strong>for</strong> building such bridges in heterogeneous environments can be found in thirty-year-old<br />

literature, but not all. In [Pat10], David Patterson summarizes many aspects of research<br />

in parallel computing since the 1960’s. As one might expect, its conclusion is that there<br />

is no free lunch. Although there have been success s<strong>to</strong>ries, there exist no global solutions<br />

<strong>to</strong> the problem of parallel computing, but rather only local solutions <strong>to</strong> local problems. It<br />

follows that there is no good reason <strong>to</strong> believe building a compiler <strong>for</strong> any heterogeneous<br />

device will a straight<strong>for</strong>ward task.<br />

There are glimpses of hope, however. Local solutions <strong>for</strong> particular devices can be<br />

viewed as building blocks. If these are configured <strong>for</strong> easy reuse, then creating compilers<br />

<strong>for</strong> new target devices will be simplified. The idea of building compilers by composing basic<br />

blocks in order <strong>to</strong> suit a particular heterogeneous system is the core of this dissertation.<br />

Many interesting questions follow. What are the building blocks relevant <strong>to</strong> heterogeneous<br />

computing? How are building blocks <strong>to</strong> be chained depending on the target? Is it possible<br />

<strong>to</strong> embody all possible hardware specifications in a single Internal Representation (ir)?<br />

Ultimately, is there a standard methodology <strong>to</strong> build compilers <strong>for</strong> heterogeneous devices?<br />

In a survey of hardware-software co-design techniques published in 1994 [Wol94], Wayne<br />

H. Wolfe stated that<br />

To be able <strong>to</strong> continue <strong>to</strong> make use of the ever-higher per<strong>for</strong>mance CPUs<br />

made possible by Moore’s Law (. . . ), we must develop new design methodologies<br />

and algorithms which allow designers <strong>to</strong> predict implementation costs,<br />

incrementally refine a design over multiple levels of abstraction, and<br />

create a working first implementation.<br />

Seventeen years later, Wolfe’s challenge is largely unmet. Neither programmers’<br />

expertise nor the source code they produce have evolved as fast as hardware architectures.<br />

As a result, applications have not greatly benefited from recent alternative hardware designs<br />

except where intense ef<strong>for</strong>ts could be dedicated. The sheer amount of legacy code does not<br />

1. Inspired by the cover of Communications of the ACM, Vol. 52 No. 10, we illustrate each chapter of<br />

this thesis with pho<strong>to</strong>s of classical bridges in Brittany.<br />

23


24 CHAPTER 1. INTRODUCTION<br />

admit economically feasible solutions other than compilers, which appear <strong>to</strong> be the only<br />

way <strong>to</strong> bridge the hardware-software gap. The advice Wolfe gives <strong>to</strong> hardware designers<br />

still holds <strong>for</strong> compiler designers, who face the tremendous task of porting legacy codes that<br />

implement sequential algorithms written in a sequential language <strong>to</strong> ever changing parallel<br />

co-processors: incrementally refine a design, use multiple levels of abstraction and<br />

create a working first implementation. These tasks are made more difficult by <strong>to</strong>ols<br />

based on traditional compilation flows targeting a single language per <strong>to</strong>ol, which are<br />

ill-suited <strong>to</strong> heterogeneous plat<strong>for</strong>ms.<br />

This dissertation adopts Wolfe’s three steps <strong>to</strong> assemble compilers <strong>for</strong> various heterogeneous<br />

plat<strong>for</strong>ms, ranging from classical multicore <strong>to</strong> an Field Programmable Gate Array<br />

(fpga)-based image processor via a Graphical Processing Unit (gpu) and small vec<strong>to</strong>r<br />

units. Our objective is not <strong>to</strong> build the best possible compiler <strong>for</strong> each target, but <strong>to</strong><br />

realize a reasonable compilation scheme <strong>for</strong> each while reusing as many building blocks as<br />

possible.<br />

To achieve this goal, we begin with a study of heterogeneous devices and existing programming<br />

paradigms in Chapter 2. We note three families of constraints resulting from<br />

corresponding dimensions of heterogeneity: the Instruction Set Architecture (isa), the<br />

memory architecture, and the source of acceleration. Chapter 3 shows that traditional<br />

compiler frameworks lack the flexibility needed <strong>to</strong> compose fine-grained code trans<strong>for</strong>mations,<br />

which are needed <strong>to</strong> overcome the constraints. We propose a model <strong>to</strong> solve<br />

this problem. Chapter 4, Chapter 5 and Chapter 6 further examine the three families of<br />

constraints and detail innovative target-independent trans<strong>for</strong>mations <strong>to</strong> <strong>for</strong>m the building<br />

blocks of our approach.<br />

This dissertation is not solely a theoretical work. Several compilers have been realized<br />

with our ideas. Their design and implementation along with per<strong>for</strong>mance benchmarks are<br />

presented in Chapter 7. Chapter 8 summarizes the contributions of this thesis and draws<br />

final conclusions.<br />

All the trans<strong>for</strong>mations and compilers described in this document have been implemented<br />

using the Paralléliseur Interprocedural de Programmes Scientifiques (pips) compiler<br />

infrastructure developed by the Centre de Recherche en In<strong>for</strong>matique (cri) from<br />

mines ParisTech with contributions from the hpc project startup, Télécom Sud-Paris and<br />

Télécom Bretagne. A quick review of this project is given in Appendix A.<br />

Most of the ideas developed in this thesis are already integrated in the Par4All <strong>to</strong>ol<br />

within the hpc project.


Chapter 2<br />

<strong>Heterogeneous</strong> Computing Paradigm<br />

Pont Fleuri, Quimperlé, Finistère c○ Jean Louis Lemoigne<br />

On February the 15 th , 2007, nVidia introduced a way <strong>to</strong> program general purpose applications<br />

on Graphical Processing Units (gpus), a class of hardware generally confined <strong>to</strong><br />

the manipulation of computer graphics, through the Compute Unified Device Architecture<br />

(cuda) language and its associated Software Development Kit (sdk). Since then General<br />

Purpose gpus (gpgpus) have dug their way in<strong>to</strong> physical simulations, video file conversions<br />

or cryp<strong>to</strong>graphy, and many heterogeneous devices have appeared as efficient solutions<br />

<strong>to</strong> per<strong>for</strong>m specific computations. Many firms now propose dedicated hardware such as:<br />

Field Programmable Gate Array (fpga)s from Xilinx, Software On Chip (soc) or Multi-<br />

Processor System-on-Chip (mp-soc) from Texas Instruments or micro-controllers from<br />

Atmel.<br />

In November 2011, the Tianhe-1A supercomputer, a cluster of powered by more than<br />

7K Tesla cards, ranked first in the <strong>to</strong>p500. 1 Since then, heterogeneous computing has<br />

been assessed as one viable alternative <strong>to</strong> multicore computing <strong>for</strong> scientific computations.<br />

1. As of June 2011, the second, third and fifth systems of the <strong>to</strong>p500 are using nVidia gpus.<br />

25


26 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

However, heterogeneous computing is rather different from homogeneous computing, particularly<br />

with respect <strong>to</strong> three critical aspects: memory, parallelism and instruction sets.<br />

This leads <strong>to</strong> a complicated interleaving of concepts specific <strong>to</strong> each target, but nonetheless<br />

independent from the high level organization of the source code. An interesting illustration<br />

of this complexity was given during the 2010 conference of Supercomputing: During a talk,<br />

the panelists discussed the three P’s of heterogeneous computing.<br />

The first P stands <strong>for</strong> Per<strong>for</strong>mance: <strong>to</strong> the opposite of general purpose processors that<br />

must deliver average per<strong>for</strong>mance <strong>for</strong> any application, the goal of hardware accelera<strong>to</strong>rs<br />

is <strong>to</strong> deliver important peak per<strong>for</strong>mance <strong>to</strong> specific applications. They achieve this goal<br />

through extensive usage of parallelism, either Single Instruction stream, Multiple Data<br />

stream (simd), Multiple Instruction stream, Multiple Data stream (mimd) or pipelining,<br />

or through the use of hard-coded, optimized routines.<br />

The second P stands <strong>for</strong> Power: power consumption plays a key role in both embedded<br />

systems, <strong>to</strong> maximize usage time between two recharges, and supercomputers, <strong>to</strong> minimize<br />

electricity costs and building size. Hardware accelera<strong>to</strong>rs are relevant candidates <strong>to</strong> improve<br />

the FLoating point Operations per Second (flops) per Watt metric, e.g. because less power<br />

is spent in non-computational operations.<br />

The third of the three P’s stands <strong>for</strong> Programmability and is the major weakness of<br />

hardware accelera<strong>to</strong>rs. Indeed, programming a hardware accelera<strong>to</strong>r implies a paradigm<br />

shift from sequential programming <strong>to</strong> circuit programming via vhsic Hardware Description<br />

Language (vhdl) [TM08] <strong>for</strong> fpga boards, 3D programming e.g. via Open Graphics<br />

Library (opengl) or directX <strong>for</strong> gpu’s or parallel programming e.g. via cuda or Open<br />

Computing Language (opencl) <strong>for</strong> gpgpu’s. Moreover, the basic execution model generally<br />

introduces the additional complexity of remote memory management. It is a time<br />

consuming task and developers are not used <strong>to</strong> it.<br />

This chapter presents in detail the heterogeneous computing model in Section 2.1 and<br />

its consequences <strong>for</strong> the programming model in Section 2.2. The concept of hardware constraints<br />

is introduced in Section 2.3 as a way <strong>to</strong> model the interaction between the hardware<br />

and a hypothetic compiler and Section 2.4 discusses the relevancy of using the C language<br />

as an input language <strong>to</strong> program specific hardware components. As an illustration of a<br />

standardized programming model, we give an analysis of the one <strong>for</strong> opencl in Section 2.5.<br />

Other programing models are presented in Section 2.6.<br />

2.1 <strong>Heterogeneous</strong> Computing Model<br />

The opencl specifications [KOWG10] illustrates a heterogeneous computer organization<br />

with Figure 2.1, which shows that the main characteristics of a heterogeneous device<br />

is the presence of several computational units with different capabilities.<br />

The key <strong>to</strong> per<strong>for</strong>mance is <strong>to</strong> use these different capabilities <strong>to</strong> the best of their capacity<br />

in order <strong>to</strong> achieve efficient computations, because each device is specialized in one<br />

kind of computations. e.g. a gpgpu is well suited <strong>for</strong> intensive, regular computations,<br />

while a General Purpose Processor (gpp) per<strong>for</strong>ms better <strong>for</strong> irregular, unpredictable al-


2.1. HETEROGENEOUS COMPUTING MODEL 27<br />

Figure 2.1: <strong>Heterogeneous</strong> computing model.<br />

gorithms. A dedicated fpga board per<strong>for</strong>ms better <strong>for</strong> the particular streaming signal<br />

processing computation it was designed <strong>to</strong> handle. Even more specialized computing units,<br />

Application-Specific Integrated Circuits (asics) are also used <strong>to</strong> match very specific design<br />

2 . Similarly, some hardware accelera<strong>to</strong>rs provide functionalities dedicated <strong>to</strong> a specific<br />

task: the ClearSpeed Advance board [GG] accelerates the Intel Math Kernel Library<br />

(mkl), and fpga-based chips have been used <strong>to</strong> speed up Pho<strong>to</strong>shop TM image processing<br />

[Sin98]. Ten years later, Cray XD1 combined amd Opterons with Xilinx fpgas <strong>to</strong><br />

speedup Smith-Waterman algorithm [S<strong>to</strong>08]. The same approach is used in Intel’s Stellar<strong>to</strong>n,<br />

the combination of an A<strong>to</strong>m E600 and an Altera fpga, <strong>to</strong> find a balance between<br />

per<strong>for</strong>mance and flexibility, which shows the mainstream interest in such plat<strong>for</strong>ms.<br />

The devices enumerated above do not collaborate in a purely decentralized way. A<br />

host device, generally a gpp, is in charge of scheduling the computational tasks among<br />

the hardware accelera<strong>to</strong>rs, in a master-slave fashion. In many cases, each device has its own<br />

memory and has <strong>to</strong> communicate with the others in order <strong>to</strong> share computational tasks:<br />

this is a strong paradigm shift from sequential programming <strong>to</strong> distributed computing that<br />

emphasizes the first difficulty of heterogeneous computing: the plat<strong>for</strong>m model. It also<br />

has an influence on the memory model.<br />

Heterogeneity is also found in the computational units themselves, as illustrated by<br />

Figure 2.2. Figure 2.2a describes a typical von Neumann architecture and Figure 2.2b<br />

describes the opencl view of a generic computational device.<br />

In a recent article [Wol11], Michael Wolfe goes through no fewer than eight levels of<br />

parallelism that must be mastered <strong>to</strong> reach exascale per<strong>for</strong>mance. Heterogeneity is found at<br />

several levels, as illustrated by the French experimental grid Grid5000 [INR], a gathering<br />

of 9 clusters counting around 1, 500 nodes, almost 3, 000 processors and more than 7, 000<br />

cores. Heterogeneity is a fundamental characteristic of this grid: at the node level, with<br />

nodes from Altix, Bull, Carri System, Dell, hp, ibm or Sun; at the socket level<br />

with, <strong>for</strong> instance nVidia Tesla S1070 available at the Grenoble site; at the core level<br />

2. The main advantages over fpgas are the ability <strong>to</strong> cus<strong>to</strong>mize the circuit <strong>for</strong>m, lower unit costs and<br />

full cus<strong>to</strong>mization possibility.


28 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

Private<br />

Memory 0<br />

PE 0<br />

. . .<br />

Local<br />

Memory 0<br />

I/O<br />

Arithmetic<br />

Logic<br />

Unit<br />

Memory<br />

Control<br />

Unit<br />

(a) von Neumann Architecture.<br />

Private<br />

Memory M0<br />

PE M0<br />

Global/Constant Memory Data Cache<br />

Global Memory<br />

. . .<br />

Private<br />

Memory 1<br />

PE 0<br />

Constant Memory<br />

(b) Generic opencl node architecture.<br />

. . .<br />

Local<br />

Memory N<br />

Figure 2.2: von Neumann architecture vs. opencl architecture.<br />

Private<br />

Memory MN<br />

PE MN


2.2. INFLUENCE ON PROGRAMMING MODEL 29<br />

with 17 different kinds of processors from two main families (amd Opteron and Intel<br />

Xeon), and thus at the vec<strong>to</strong>r level with different supported Streaming simd Extension<br />

(sse) versions. Not <strong>to</strong> mention that different instruction sets are supported from one node<br />

<strong>to</strong> another. An application that is not aware of the specificities of each node it is running<br />

on cannot achieve minimal execution times.<br />

From the single processing element per computational unit found in homogeneous computing,<br />

heterogeneous computing considers the availability of multiple processing elements<br />

per device. In Flynn’s taxonomy [Fly72], this means a move from Single Instruction<br />

stream, Single Data stream (sisd) paradigm <strong>to</strong> either simd or mimd. A survey [BDH + 10]<br />

by André Rigland Brodtkorb et al. further refines this view and enumerates possible<br />

organizations of a modern computational device. The main concept is that acceleration<br />

is achieved through specialization, so specialized hardware with specialized instructions or<br />

organizations is used. It shows the second difficulty related <strong>to</strong> heterogeneous computing:<br />

the execution model.<br />

Moreover the memory described in Figure 2.2b is not flat, and sharing among processing<br />

elements is manda<strong>to</strong>ry; rather, a memory hierarchy is exposed inside the computational<br />

device, in addition <strong>to</strong> the memory partitioning imposed by remote execution. The potential<br />

benefits <strong>for</strong> the developer are optimized cache management and data movement handling,<br />

but it also seriously complicates the development of applications. This combination of<br />

distributed memory and hierarchical memory hierarchy is the third difficulty posed <strong>to</strong><br />

developers by heterogeneous computing: the memory model.<br />

2.2 Influence on Programming Model<br />

The heterogeneous computing model has an important impact on the programming<br />

models used <strong>to</strong> effectively program the devices. Considering the three difficulties raised in<br />

the section above, it is no surprise that many programming models have been proposed <strong>to</strong><br />

develop applications <strong>for</strong> heterogeneous architectures.<br />

The memory model involves ideas from the distributed computing community.<br />

Message passing pro<strong>to</strong>cols have been reviewed in [McB94]. Among them, active messages<br />

[vECGS92] offer a particularly elegant and efficient solution <strong>to</strong> distributed memory<br />

management, and Message Passing Interface (mpi) [WW94] has appeared as a leading and<br />

standardized solution. Such interfaces enable fine grain control over data movement and<br />

potentially higher per<strong>for</strong>mance, at the expense of manual management of data consistency.<br />

Several approaches have been proposed <strong>to</strong> relieve developers from the manual declarations<br />

of data transfers: one alternative is <strong>to</strong> use a shared virtual memory space, as described in<br />

survey [IS99]; another is <strong>to</strong> use Remote Procedure Call (rpc) and <strong>to</strong> rely on the runtime or<br />

the language <strong>to</strong> au<strong>to</strong>matically transfer arguments [BN84]. Both approaches face the problem<br />

of scheduling calls over the underlying architecture. However, this <strong>to</strong>pic is far beyond<br />

the scope of this thesis, <strong>for</strong> we only consider heterogeneous plat<strong>for</strong>ms with a single host<br />

and a single accelera<strong>to</strong>r. Au<strong>to</strong>matic generation of data transfers and scheduling of computational<br />

tasks was active during the day of High Per<strong>for</strong>mance Fortran [Wol96, ACIK97]


30 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

Device 2 code<br />

Compiler 2<br />

Object 2<br />

manual<br />

Sequential code<br />

manual<br />

Device 1 code<br />

Compiler 1<br />

Object 1<br />

manual<br />

Device 0 code<br />

Compiler 0<br />

Object 0<br />

Figure 2.3: Impact of heterogeneous architecture on compilation.<br />

host code<br />

Host<br />

Compiler<br />

Host Object<br />

and gpus have brought renewed interest in the <strong>to</strong>pic [LVM + 10, JPJ + 11]. Another issue<br />

of distributed computing is the difference of data representation across architectures. It<br />

<strong>for</strong>ces the use of a common representation or adds translation costs <strong>to</strong> all data transfers.<br />

The load-work-s<strong>to</strong>re idiom is typical of rpc. It puts a new constraint on the host code,<br />

because it serializes the processing of a computational task by a computational device in<br />

three <strong>to</strong> five steps:<br />

1. allocate: the host allocates memory on the remote accelera<strong>to</strong>r. On embedded devices,<br />

this steps may be optional, <strong>for</strong> memory allocation is managed by the user;<br />

2. load: the host transfers the data from its memory <strong>to</strong> the allocated remote (accelera<strong>to</strong>r)<br />

memory;<br />

3. work: the accelera<strong>to</strong>r per<strong>for</strong>ms computation on the loaded data and notifies the host<br />

of the computation end;<br />

4. s<strong>to</strong>re: the host transfers the data back from the remote memory <strong>to</strong> its own memory;<br />

5. deallocate: the host frees memory allocated in step 1. Likewise, this step may be<br />

optional.<br />

The main drawback of this approach is that only step 3 contributes <strong>to</strong> acceleration. All<br />

the other steps actually slow down the process.<br />

The architecture model implies that each device involved in a heterogeneous computation<br />

may use a different instruction set. Devices generally have <strong>to</strong> be programmed in<br />

different languages. For a collection of n different accelera<strong>to</strong>rs, it means n different versions<br />

of the code <strong>to</strong> maintain and <strong>to</strong> evolve simultaneously if we use a single accelera<strong>to</strong>r at a<br />

time. Figure 2.3 illustrates this concept.<br />

The interactions get more complex as several accelera<strong>to</strong>rs are combined in the same<br />

heterogeneous plat<strong>for</strong>m. Moreover, because accelera<strong>to</strong>rs are not general purpose processors,<br />

they are likely <strong>to</strong> require cross compilation. Code is strongly dependant on the device<br />

architecture combination, either on source or binary <strong>for</strong>m, which is a limitation <strong>for</strong> application<br />

portability. This limitation can however be overcome by bundling all possible object


2.3. HARDWARE CONSTRAINTS 31<br />

code or source code, at the expense of larger binaries, and have a host code aware of the<br />

possible change of hardware, as supported by middle-wares such as StarPU [ATNW09]<br />

or kaapi [HRF + 10]. An interesting consequence of Figure 2.3 is that the host code is<br />

relatively independent of the device codes, except <strong>for</strong> the development part. Following<br />

Amdhal’s law [Amd67], the host should also executes the less computation intensive part<br />

of the code. As a result, there are very few constraints on this part of the code.<br />

The execution model varies a lot across accelera<strong>to</strong>rs: there is little in common<br />

between a pipelined vec<strong>to</strong>r processor and a pure simd processor. However, most hardware<br />

accelera<strong>to</strong>rs get their speedup from parallelism, and many programming models have<br />

been proposed <strong>for</strong> parallel computing. Among them we denote au<strong>to</strong>matic parallelization<br />

through loop nest analysis [Lam74, IT88, WL91b, IJT91, DSV96] or <strong>for</strong> irregular applications<br />

[DUSsH93], partitioned global address space languages [CDMC + 05], domain-specific<br />

language like Chunk [WC03] or libraries like opengl [SA94]. Stream processing [Ste97]<br />

takes advantage of a limited <strong>for</strong>m of parallelism. Task level parallelism has also gained<br />

in popularity in the multicore era with languages like Cilk [BJK + 95], as well as directive<br />

oriented parallelization such as Open Multi Processing (openmp) [Ope11]. This small illustration<br />

of the large set of approaches introduced <strong>to</strong> bridge the gap between the user and<br />

various parallel paradigms should convince anybody that there is no holy grail in the field.<br />

An interesting note however is that most approaches listed above involve either the C or<br />

the Fortran languages. Similar approaches <strong>for</strong> more recent languages like Chapel or X10,<br />

or general purpose language like Java, have been proposed but they receive less audience,<br />

certainly due <strong>to</strong> the amount of legacy codes. It also shows that the diversity of the hardware<br />

au<strong>to</strong>matically leads <strong>to</strong> ad-hoc responses and a one-<strong>to</strong>-one binding between compilers<br />

and hardware plat<strong>for</strong>ms, the hardware vendor provides a means <strong>to</strong> program the device,<br />

either a language, a C extension or a library and the user must cope with it. A notable<br />

success of this approach is the nVidia cuda [NVI11] language that make it possible <strong>to</strong><br />

program nVidia graphical devices. The host code is not subject <strong>to</strong> these considerations<br />

and the language used <strong>to</strong> develop it is not as relevant <strong>for</strong> high per<strong>for</strong>mance computing as<br />

<strong>for</strong> device codes, providing a binding <strong>to</strong> a lower level language—say C—is available, like<br />

the interaction between gpgpu and high level languages, as provided by the gpulib. 3<br />

2.3 Hardware Constraints<br />

In the previous section, we described three aspects of the heterogeneous computing<br />

model. A key aspect is that there is not a unique heterogeneous computing model, but a<br />

collection of models, one per hardware target, all of them fitting in a general and extensible<br />

model, as illustrated by the feature diagram from Figure 2.4.<br />

The <strong>to</strong>pic of this dissertation is not <strong>to</strong> propose yet another taxonomy of parallel machines<br />

[Dun90, Che94, HP06], even with respect <strong>to</strong> the limited scope of heterogeneous<br />

machines. As a consequence, we selected a limited number of features among the existing<br />

ones, based on their presence on the following hardware:<br />

3. gpulib [MMG08] is the evolution of the pystream project, a Python binding <strong>for</strong> cuda.


32 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

memory<br />

rom ram<br />

Hardware<br />

Device<br />

shared distributed<br />

optional feature<br />

manda<strong>to</strong>ry feature<br />

a feature is manda<strong>to</strong>ry<br />

isa Acceleration<br />

Specialization Parallelism<br />

Figure 2.4: Example of hardware feature diagram.<br />

simd mimd<br />

– a desk<strong>to</strong>p computer with several cores and a modern gpu board;<br />

– a lap<strong>to</strong>p with several cores with vec<strong>to</strong>r instruction units;<br />

– an embedded processor with a single processor and a vec<strong>to</strong>r instruction unit;<br />

– an embedded device with a fpga-based accelera<strong>to</strong>r.<br />

The above targets exhibit three main sources of heterogeneity that must be dealt with:<br />

Instruction Set Architecture: The presence or absence of the following features at the<br />

instruction level require compiler support:<br />

– vec<strong>to</strong>r registers or instructions;<br />

– complex numbers;<br />

– maximum number of operands per instruction (generally 2 or 3);<br />

– supported operations.<br />

Memory: As many applications are memory-bound, taking in<strong>to</strong> account the hardware<br />

specificities of memory is often critical:<br />

– memory size;<br />

– memory hierarchy;<br />

– cache management;<br />

– distributed memory;<br />

– Read Only Memory (rom);<br />

– Direct Memory Access (dma) flexibility;<br />

– dma speed.<br />

Acceleration Features: One of the motivations <strong>for</strong> heterogeneous machines is per<strong>for</strong>-


2.3. HARDWARE CONSTRAINTS 33<br />

memory<br />

ram<br />

shared<br />

optional feature<br />

manda<strong>to</strong>ry feature<br />

Multicore<br />

Device<br />

isa Acceleration<br />

Parallelism<br />

mimd simd<br />

Figure 2.5: Multicore with vec<strong>to</strong>r unit feature diagram.<br />

mance 4 , per<strong>for</strong>mance that comes from one or more of the following features:<br />

– specialized computation unit;<br />

– mimd execution mode;<br />

– simd execution mode.<br />

Some important aspects are set aside in this dissertation, especially the memory / cache<br />

hierarchy and the availability of asynchronous dma. Both are critical <strong>to</strong> achieve high per<strong>for</strong>mance:<br />

caches misses and false sharing can drastically reduce per<strong>for</strong>mances on Central<br />

Processing Units (cpus), gpus’ shared memory has a faster access delay and asynchronous<br />

data transfers are commonly used <strong>to</strong> overlap communications and computation. Those<br />

trans<strong>for</strong>mations, although often critical <strong>to</strong> reach high throughput, are not manda<strong>to</strong>ry <strong>to</strong><br />

build a working compiler: we follow a conservative approach that aims at producing reasonably<br />

efficient code, not <strong>to</strong> be highly specialized <strong>for</strong> a single target, following the saying<br />

“have a working code be<strong>for</strong>e you consider optimizing it”.<br />

Taking advantage of read-only memory, shared memory or vec<strong>to</strong>r types is optional,<br />

while acceleration is a manda<strong>to</strong>ry feature specific <strong>to</strong> the hardware. A particular hardware<br />

device that fits in this model can be described using a restricted feature diagram. Figure 2.5<br />

shows the feature diagram of a typical multicore device with vec<strong>to</strong>r units.<br />

The difference between an optional and a manda<strong>to</strong>ry feature is important <strong>for</strong> per<strong>for</strong>mance:<br />

<strong>for</strong> a code <strong>to</strong> be executed on a specific piece of hardware, it must be aware of all the<br />

manda<strong>to</strong>ry features. A code aware of optional features would turn this extra knowledge<br />

in<strong>to</strong> per<strong>for</strong>mance boost, e.g. <strong>for</strong> nVidia’s gpgpus, the use of shared memory is optional<br />

but is critical <strong>to</strong> the per<strong>for</strong>mance of some applications, while unnecessary <strong>for</strong> others.<br />

4. Depending on the context, it can also be development costs or power consumption.


34 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

Definition 2.1. A source-<strong>to</strong>-binary compiler is a compiler that translates its input code<br />

in<strong>to</strong> machine code.<br />

Examples of source-<strong>to</strong>-binary compilers include icc, Intel compiler <strong>for</strong> the C++ language<br />

or nvcc or nVidia compiler <strong>for</strong> the cuda language.<br />

Mitrion-C used <strong>for</strong> C <strong>to</strong> vhdl translation ;<br />

c2h also used <strong>for</strong> C <strong>to</strong> vhdl translation ;<br />

gcc vec<strong>to</strong>r types used <strong>for</strong> au<strong>to</strong>matic multimedia instruction sets manipulation ;<br />

cuda used <strong>for</strong> nVidia gpgpu code generation ;<br />

opencl used <strong>for</strong> generic hardware accelera<strong>to</strong>r code generation.<br />

The goal of a compiler like c2h or nvcc is <strong>to</strong> generate hardware code or circuit from<br />

standard C code. The accepted idea concerning such high level synthesis is that what is<br />

gained in development time is lost in per<strong>for</strong>mance. However, a recent publication [CDL11]<br />

shows this approach can both reduce development time and increase per<strong>for</strong>mance.<br />

This kind of genera<strong>to</strong>r must ensure an original code matches all hardware features, and<br />

may use its optional features. Because some features are in conflict with the language, or<br />

because it is easier <strong>to</strong> support a subset of it, they move from standard C <strong>to</strong> dialects. The<br />

dialect is a reflection of the hardware features and we call them hardware constraints.<br />

A hardware constraint is embodied by a restriction or extension of the original language<br />

and <strong>for</strong>ms the core of the difficulty <strong>to</strong> program hardware devices using vendor compiler.<br />

2.4 Note About the C Language<br />

As shown above, most languages designed <strong>to</strong> abstract hardware complexity are extensions<br />

or dialects of the C language. His<strong>to</strong>rically, there have been three major versions<br />

of this language: C K&R (1978), C89 (1989) and C99 (1999). A Greek Athenian comic<br />

dramatist once quoted<br />

High thoughts must have high language.<br />

Aris<strong>to</strong>phanes, Frogs, 405 B.C<br />

However, many compilers still rely on C89 and do not benefit from the advantages of<br />

C99, not <strong>to</strong> mention the planned C1X. As a consequence, critical features such as native<br />

complex numbers and variable length arrays are not used, and are replaced by structures<br />

and pointers, respectively. This greatly lowers the expressiveness of the code, making it<br />

harder <strong>to</strong> maintain, and also harder <strong>to</strong> compile. Figure 2.6 illustrates this difference on a<br />

complex matrix vec<strong>to</strong>r multiply.<br />

Most available benchmarks are written in C89; there are two typical arguments in favor<br />

of this statement:<br />

1. It is compatible with more C compilers. In particular, the C++ language is not<br />

compatible with certain features of C99, such as variable length arrays ; 5<br />

5. Contrary <strong>to</strong> the ISO/IEC 14882:2003 standard, the <strong>for</strong>thcoming C++0x standard includes most of<br />

the C99 specificities.


2.4. NOTE ABOUT THE C LANGUAGE 35<br />

typedef struct { double r,i; } Complex ;<br />

void matrix_vec<strong>to</strong>r_multiply ( int M, int N, Complex * m, Complex *v,<br />

Complex * out ) {<br />

int i,j;<br />

<strong>for</strong> (i =0;ir=out ->i =0.;<br />

<strong>for</strong> (j =0;jr+=mi ->r*vi ->r-mi ->i*vi ->i;<br />

out ->i+=mi ->r*vi ->i+mi ->i*vi ->r;<br />

}<br />

}<br />

}<br />

(a) C89 version.<br />

void matrix_vec<strong>to</strong>r_multiply ( int M, int N, complex m[M][N], complex<br />

v[N], complex out [N]) {<br />

<strong>for</strong> ( int i =0;i


36 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

1.6<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

C99 vs. C89 speedup<br />

cfar ct db fdfir ga pm qr svd tdfir<br />

Figure 2.7: Comparison of two versions of the HPEC Challenge benchmark: C89 vs. C99.<br />

2. A direct code translation in<strong>to</strong> assembly favors the pointer versions.<br />

gnu C Compiler (gcc), Intel C++ Compiler (icc), ibm’s, hp’s and pgi’s compiler<br />

all support C99, so Argument 1 is mostly wrong. Among modern industrial compilers,<br />

only Microsoft’s does not provide this feature. We have carried out experiments on<br />

the High Per<strong>for</strong>mance Embedded Computing (hpec) Challenge benchmarks [LRW05], a<br />

benchmark written in C89, <strong>to</strong> verify how the switch <strong>to</strong> C99 impacts per<strong>for</strong>mance. The<br />

original version has been completely rewritten <strong>to</strong> take advantage of C99 features, then<br />

both the old and the new version have been compiled with icc version 12.0.3 on a desk<strong>to</strong>p<br />

computer. Each benchmark is run 100 times and the median value is picked. Figure 2.7<br />

shows results normalized against the original version using the -O3 flag. A result greater<br />

than one means the C99 version executes faster.<br />

Figure 2.8 show the result of C <strong>to</strong> C99 conversion on the CoreMark benchmark [Con].<br />

The behavior of both gcc and icc are evaluated and normalized with respect <strong>to</strong> the<br />

original version. The metric used is the number of iterations per seconds, as returned by<br />

the benchmark. gcc compiler flags used are -O3 -ffast-math -march=native and icc’s<br />

are -O3. A desk<strong>to</strong>p station running a 2.6.38-2-686 GNU/Linux on 2 Intel(R) Core(TM)2<br />

Duo CPU T9600 @ 2.80GHz is used <strong>to</strong> run all the benchmarks presented above.<br />

The same trans<strong>for</strong>mations have been per<strong>for</strong>med on the linpack benchmark [DLP03],<br />

used <strong>to</strong> compare machines ranking in the <strong>to</strong>p500. Results are displayed in Figure 2.9. It<br />

shows a small slowdown <strong>for</strong> the readability gain displayed in Figure 2.6.<br />

We observe that using C99 can imply a small per<strong>for</strong>mance loss, and icc is more impacted<br />

than gcc. This is due <strong>to</strong> two causes:<br />

– some kernels take advantage of pointer arithmetic <strong>to</strong> per<strong>for</strong>m optimized iterations<br />

over two-dimensional arrays, while indexed arrays suffer from non-optimized address


2.5. OPENCL PROGRAMMING MODEL ANALYSIS 37<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

C99 vs. C89 speedup<br />

icc gcc<br />

Figure 2.8: Comparison of two versions of the Coremark benchmark: C89 vs. C99.<br />

computations;<br />

– using data allocated on the heap as array pointers involve complex cast from void*<br />

<strong>to</strong>, say, int (*)[n][m] that disrupts the pointer analysis.<br />

However, C99 array declarations provide a readability gain. More important, polyhedral<br />

analyses are made harder by the use of non-linear array accesses. For instance, the<br />

Plu<strong>to</strong> [BBK + 08] compiler only handles affine loop nests. The Paralléliseur Interprocedural<br />

de Programmes Scientifiques (pips) [AAC + 11] compiler framework suffers from the same<br />

limitations and most of its analyses lose in accuracy in the presence of non-affine subscript<br />

expressions. Trans<strong>for</strong>mations have been proposed <strong>to</strong> au<strong>to</strong>matically delinearize array references<br />

by separating elements of a non-affine equation in<strong>to</strong> affine groups [Mas92] or <strong>to</strong><br />

recover array from pointers [FO03]. In this dissertation, assumption that input codes are<br />

written using the high level constructs of the language is used and arrays are assumed not<br />

<strong>to</strong> be linearized.<br />

2.5 OpenCL Programming Model Analysis<br />

opencl [KOWG10] is a recent proposal <strong>to</strong> standardize the way hardware accelera<strong>to</strong>rs<br />

are programmed. It provides a unified language derived from C99 <strong>to</strong> write kernels and an<br />

Application Programming Interface (api) <strong>to</strong> manage kernel calls.<br />

It is interesting <strong>to</strong> analyse the differences between opencl and the C99 language 6 :<br />

simpler function handling: no recursion, no function pointer, no variable number of<br />

arguments;<br />

6. As specified in the opencl sdk reference manual.


38 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

C99 vs. C89 speedup<br />

icc gcc<br />

Figure 2.9: Comparison of two versions of the Linpack benchmark: C89 vs. C99.<br />

limited pointer support: pointer <strong>to</strong> types that are fewer than 32 bits wide cannot be<br />

dereferenced;<br />

no variable size structures such as variable-length arrays or structure with flexible arrays;<br />

s<strong>to</strong>rage qualifiers: s<strong>to</strong>rage qualifiers are <strong>for</strong>bidden and replaced by __global,<br />

__constant, __local or __private;<br />

image support: built-in types and function <strong>to</strong> manipulate 2D and 3D images;<br />

math support: built-in geometric and mathematic functions such as cos 7 or dot but no<br />

math library;<br />

vec<strong>to</strong>r support <strong>for</strong> all primary types with up <strong>to</strong> 16 elements per vec<strong>to</strong>r.<br />

Data transfers are managed through synchronous or asynchronous dma plus prefetching<br />

and memory fences. It is possible <strong>to</strong> manage multiple devices in a thread-safe fashion. The<br />

api also includes facilities <strong>to</strong> use opengl buffers or textures with opencl code.<br />

It appears clearly from these differences that opencl was designed primarily <strong>for</strong> gpu<br />

devices and signal/video processing, as shown by the built-in image support, vec<strong>to</strong>r types<br />

and the api sharing with opengl. Lower level architectures suffer from the type restriction<br />

and the absence of bit-fields. Moreover, opencl programming model targets hierarchical<br />

array of processing elements with corresponding hierarchical memory structure, as found<br />

in gpus and multicore, but not in fpgas. To partially address this shortcoming, a relaxed<br />

version of opencl, called opencl embedded profile, has been released. It lowers the<br />

7. Note however that the standard is less restrictive that the IEEE 754 specification concerning Unit<br />

in the Last Place (ulp).


2.5. OPENCL PROGRAMMING MODEL ANALYSIS 39<br />

Host Object<br />

<strong>Source</strong><strong>to</strong>-Binary<br />

Compiler 0<br />

Host code<br />

+<br />

Device code<br />

Host<br />

Compiler<br />

<strong>Source</strong><strong>to</strong>-Binary<br />

Compiler 1<br />

<strong>Source</strong><strong>to</strong>-Binary<br />

Compiler 2<br />

Object 0 Object 1 Object 2<br />

Figure 2.10: Compilation flow in opencl.<br />

requirements on data types (no 64 bit integers, no 3D images), on floating-point compliance<br />

(Inf and NaN not required, lower requirements <strong>for</strong> some function accuracy) and on<br />

hardware capacity (minimal image height/width, local memory size).<br />

So opencl proposes a generic, library-based, approach <strong>to</strong> the heterogeneity of hardware<br />

targets problem. The core idea is <strong>to</strong> expose in a standard api the host calls <strong>to</strong> the different<br />

hardware devices. The devices code is generated at runtime using Just In Time (jit)<br />

compilation from a code shipped in the code as a textual representation, using a source<strong>to</strong>-binary<br />

compiler. It changes Figure 2.3 in<strong>to</strong> Figure 2.10, where only one device code<br />

exists.<br />

A unique representation of the device code is critical <strong>for</strong> source code portability. It<br />

achieves source code portability at the expense of deporting the complexity <strong>to</strong> source-<strong>to</strong>binary<br />

compilers. Since its first release, opencl has been used <strong>to</strong> generate code <strong>for</strong> the<br />

following targets:<br />

multiple nVidia’s gpgpu combined with multicores through the Nvidia’s opencl<br />

compiler ;<br />

multiple amd’s gpgpu combined with multicores through ATI Stream Software<br />

Development Kit ;<br />

Intel Multicore through Intel R○ opencl sdk;<br />

sse-enabled Multicore through the Fixstars opencl Cross Compiler;<br />

Cell engine through IBM opencl sdk.<br />

To our knowledge, no company supports fpga code generation and no paper on<br />

this subject has been published yet. In addition <strong>to</strong> the complexity of the transla-


40 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

tions listed above, this may be due <strong>to</strong> the existence of several <strong>to</strong>ols <strong>to</strong> per<strong>for</strong>m this<br />

task [KBM07, GCB07, GNB08]. As a consequence, switching <strong>to</strong> opencl would imply<br />

short-term development costs that overtake the long-term benefit of code portability.<br />

opencl does not relieve developers from host-side tasks: communication management<br />

is still handled manually, the api <strong>for</strong> marshalling arguments being at a rather low level 8 .<br />

As a consequence, input code has <strong>to</strong> be completely rewritten and split in<strong>to</strong> two codes: the<br />

host code with opencl calls and the kernel code.<br />

Also, opencl does not guarantee the per<strong>for</strong>mance portability: an opencl kernel tuned<br />

<strong>for</strong> a specific plat<strong>for</strong>m is not guaranteed <strong>to</strong> behave as well on another plat<strong>for</strong>m. For<br />

instance, a per<strong>for</strong>mance study [KSA + 10] shows that moving from one gpu <strong>to</strong> another<br />

requires adjusting kernel parameters <strong>to</strong> achieve the best results.<br />

From an engineering point of view, there are two aspects where the compilation scheme<br />

given in Figure 2.10 is limited in two ways. Firstly, there is no sharing of common optimizations<br />

between different opencl compilers. Actually the design does not prevent such<br />

sharing but, as shown in the above enumeration, the trend is <strong>to</strong> have each vendor develop<br />

its own version of an opencl compiler <strong>to</strong> support its hardware. In fact, these compilers<br />

are more basic compilers than optimizing compilers, in the sense that their primary goal<br />

is <strong>to</strong> generate code <strong>for</strong> the hardware, not <strong>to</strong> optimize user code. opencl leaves this task<br />

either <strong>to</strong> application developers, or <strong>to</strong> compiler developers. Given the cost <strong>to</strong> develop a<br />

full-fledged optimizing compiler, the task is generally left <strong>to</strong> developers.<br />

Still, opencl paves the way <strong>to</strong> heterogeneous computing normalization. In spite of<br />

its limitations, its programming model has received a good following, and the number of<br />

compilers that supports it, as well as software development <strong>to</strong>ols (debuggers, profilers,<br />

etc.), is quickly growing. Chapter 3 proposes an extended approach that borrows several<br />

ideas from the opencl model but makes host-side development easier, while achieving a<br />

good level of per<strong>for</strong>mance and en<strong>for</strong>cing compilation pass reuse.<br />

2.6 Other Programing Models<br />

<strong>Heterogeneous</strong> computing was first found in clusters of machines, where different nodes<br />

had different processors, and is now making its way <strong>to</strong> desk<strong>to</strong>p computers. Because per<strong>for</strong>mance<br />

of heterogeneous computations is linked <strong>to</strong> the proper scheduling of the different<br />

tasks that compose the program, many papers have studied the decomposition of a program<br />

in<strong>to</strong> independent tasks and their scheduling. [BSB + 01] compares several static approaches<br />

<strong>to</strong> the problem, while [SSM08] uses a s<strong>to</strong>chastic model. Scheduling decisions can also be<br />

dynamic, as in [ZWZD93]. Recently, frameworks such as Hadoop [Whi09] have emerged<br />

<strong>to</strong> provide file system and job scheduling integration.<br />

However, the number of devices involved in heterogeneous clusters and in heterogeneous<br />

computers is at a different scale, and the data transfer rate between the network connections<br />

in a cluster and the Peripheral Component Interconnect (pci) connections inside a<br />

8. Close <strong>to</strong> the way arguments are pushed on the stack during a traditional function call.


2.6. OTHER PROGRAMING MODELS 41<br />

computer introduces different fac<strong>to</strong>rs. For this reason, heterogeneous computers tend <strong>to</strong> be<br />

simpler <strong>to</strong> use efficiently. In particular, the scheduling issues are limited <strong>to</strong> a dozen nodes<br />

(e.g. eight <strong>for</strong> the Cell Processor [PAB + 06]). As a consequence, we do not focus on this<br />

<strong>to</strong>pic in this thesis.<br />

Apart from the opencl model discussed in Section 2.5, other models have been proposed.<br />

The case of pgi is particularly relevant because it is an industrial compiler, thus<br />

driven by user needs and working solutions. [Wol10] proposed an accelera<strong>to</strong>r model coupled<br />

<strong>to</strong> a high level programming model mostly targeted <strong>to</strong> gpus. It is based on compiler<br />

directives and thus benefits from the associated incremental development concept. The<br />

only required directive is #pragma acc, and the others can be used <strong>to</strong> refine the compiler’s<br />

analyses (e.g. <strong>to</strong> specify data movement or loop scheduling), an approach <strong>to</strong> parallelization<br />

that has been shown <strong>to</strong> provide good results [KS99]. This approach relieves developers<br />

from most low level manipulations and makes it possible <strong>to</strong> think of the code in terms of<br />

kernels only. The issue of per<strong>for</strong>mance portability is handled by bundling several versions<br />

of the kernel—one per targeted hardware—in the same binary.<br />

The directive approach is also taken by Hybrid Multicore Parallel Programming<br />

(hmpp) [BB09], but in that case nothing is au<strong>to</strong>matic and all the decisions must be made<br />

by developers and through directives. The advantage is that the user has greater control<br />

over its application behavior, at the expense of less au<strong>to</strong>mation.<br />

Hardware-software co-design [Wol94], and especially hardware-compiler codesign<br />

[ZPS + 96, WGN + 02], is an alternate approach where the hardware and the<br />

software are evolving hand-in-hand, whereas in the current situation, the software is<br />

struggling <strong>to</strong> match hardware evolution. This approach has been taken in the Delft<br />

workbench [PBV06], which relies on the molen [PBV07] machine organization. Retargetable<br />

compilation is achieved through the use of reconfigurable hardware <strong>to</strong> provide<br />

a user-specific instruction set, and of code trans<strong>for</strong>mation aware of reconfigurable architectures,<br />

e.g. <strong>to</strong> hide reconfiguration latencies with a specific instruction scheduling<br />

algorithm.<br />

A similar approach that also by-passes the limitation induced by communication overhead<br />

has been proposed: the Convey HC-1 computer [Bre10] is a hybrid system that uses<br />

Xilinx fpgas as co-processors, with the specificity of the processor and the co-processor<br />

sharing a virtual memory. The co-processor is accessed through user-defined instructions<br />

that rely on the concept of personalities—hardware level description of intrinsics exposed<br />

at the user level. Host and accelera<strong>to</strong>r code are consequently mixed <strong>to</strong>gether in a unique<br />

source code. The benefits of this approach is that developers are relieved from the management<br />

of data transfers. However, writing personalities involves writing a hardware<br />

description of each intrinsic, so part of the problem related <strong>to</strong> heterogeneous computing is<br />

not solved.<br />

The usage of an Application-Specific Instruction set Processor (asip) is another way<br />

<strong>to</strong> exploit heterogeneous computing. In a nutshell, instead of relying on a general purpose<br />

instruction set implemented on a general purpose processor, a dedicated processor is built <strong>to</strong><br />

run a single soc. This approach is only viable if the process of generating the processor can<br />

be au<strong>to</strong>mated: a compilation step is naturally involved <strong>to</strong> generate such processors. In such


42 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM<br />

situations, retargetability of the compiler is a key property <strong>to</strong> be able <strong>to</strong> generate a wide<br />

range of processors. Such a compiler is described in [GLGP06], using a compilation flow<br />

that involves a compiler infrastructure parametrized by a processor model and a Hardware<br />

Description Language (hdl) genera<strong>to</strong>r. The processor model contains an instruction set<br />

model, and both of them are described as a matter of functional units, cus<strong>to</strong>m data types,<br />

connectivity, s<strong>to</strong>rage, etc.<br />

2.7 Conclusion<br />

<strong>Heterogeneous</strong> computing and multicores are the two current most important keywords<br />

<strong>for</strong> High Per<strong>for</strong>mance Computing (hpc) and low power. The amount of available hardware<br />

implementing different programming models makes it hard <strong>for</strong> developers <strong>to</strong> adapt existing<br />

software <strong>to</strong> new architectures. Because of the fast pace of evolution, any porting ef<strong>for</strong>t or<br />

per<strong>for</strong>mance improvement may be jeopardized by a hardware change. In this situation,<br />

ef<strong>for</strong>ts have been made by the hardware community <strong>to</strong> leverage their interface, and by the<br />

software community <strong>to</strong> propose new languages or libraries <strong>to</strong> ease the port of applications.<br />

In between, the compiler community tries hard <strong>to</strong> generate efficient glue code between the<br />

hardware layer and the software layer. The difficulty lies in the choice of a combination of<br />

a programming model and a programming language that is simple enough <strong>for</strong> developers <strong>to</strong><br />

use, but sufficiently rich so that the compiler can extract enough in<strong>for</strong>mations <strong>for</strong> efficient<br />

hardware code generation. Pragmatically, a subset of the C99 language is used, <strong>to</strong> limit<br />

the cost of the technology shift, from both source code and developer points of view. The<br />

opencl standard paves the way <strong>for</strong> such an approach but it suffers from several design<br />

limitations.<br />

The contribution of this chapter is <strong>to</strong> present a state of the art of the existing alternatives<br />

<strong>for</strong> heterogeneous computing, centered on compilation aspects. A study of existing<br />

C dialects shows the advantages of using the C language <strong>to</strong> program hybrid architectures,<br />

and the usual hardware limitations exposed by the language. Three different benchmarks<br />

have been manually converted <strong>to</strong> C99-style variable-length array declaration <strong>to</strong> show the<br />

per<strong>for</strong>mance impact of a higher level description. Although current compilers generate<br />

slightly less efficient code from C99 input, the gain in expressiveness favors maintainability<br />

and makes it easier <strong>to</strong> apply high-level trans<strong>for</strong>mations such as those from the polyhedral<br />

model.


Chapter 3<br />

Compiler Design <strong>for</strong> <strong>Heterogeneous</strong><br />

Architectures<br />

Vieux Pont de Dinan, Ille-et-Vilaine c○ iris.din / Flick<br />

Until the end of the 20 th century, only one kind of architecture was used <strong>to</strong> build general<br />

purpose computers. As a consequence, typical compilers have been built <strong>to</strong> efficiently target<br />

those architectures: a front-end parses the input code in<strong>to</strong> an Internal Representation, a<br />

middle-end per<strong>for</strong>ms various optimizations (hopefully language-independent), either at the<br />

basic block level, loop level, function level or program level, and a back-end emits targetspecific<br />

assembly code. These three components have been studied <strong>for</strong> years in the literature<br />

and described intensively, as shown by the periodically updated Dragon Book [ALSU06].<br />

The complexity growth seen in compiler architecture and crystallized by heterogeneous<br />

plat<strong>for</strong>m favors a more modular compiler framework. Indeed, it seems reasonable <strong>to</strong> reuse<br />

existing compilers, software and libraries that already per<strong>for</strong>m efficiently a specific task.<br />

Combining them instead of reimplementing the wheel over again and again should be<br />

possible. Let us take the example of Graphical Processing Unit (gpu) code generation<br />

<strong>for</strong> regular kernels. PLuTo [BBK + 08] is an example of this approach: they combine a<br />

C front-end, a polyhedral optimizer, a Compute Unified Device Architecture (cuda) code<br />

genera<strong>to</strong>r and nVidia’s cuda compiler <strong>to</strong> generate efficient gpu kernels.<br />

Additionally, build processes are growing in complexity in order <strong>to</strong> assemble object<br />

files generated from different languages by different compilation chains. For instance, the<br />

43


44 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

C<br />

Fortran<br />

Java<br />

C Frontend<br />

Fortran Frontend<br />

Java Frontend<br />

Common<br />

Optimization<br />

Infrastructure<br />

x86 Backend<br />

ARM Backend<br />

MIPS Backend<br />

Figure 3.1: A classical 3-phases retargetable compiler architecture.<br />

x86<br />

ARM<br />

MIPS<br />

sloccount <strong>to</strong>ol finds out 210 <strong>Source</strong> Lines Of Code (sloc) in common.mk, the generic<br />

Makefile infrastructure bundled with the cuda distribution. It makes code generation<br />

more difficult: not only code <strong>for</strong> different targets must be generated, but a way <strong>to</strong> link them<br />

later on must be found. This task is already non-trivial in a homogeneous environment,<br />

and it quickly becomes complex in a heterogeneous environment.<br />

This chapter studies the impact of the heterogeneous computing model, presented in<br />

Chapter 2, on traditional compiler organization. It proposes the combination of a rich<br />

compiler infrastructure with a flexible pass manager as a framework <strong>to</strong> match the new<br />

architecture constraints. This approach focuses on modularity, re-usability, retargetability<br />

and flexibility of the compiler design.<br />

To begin with, Section 3.1 studies the adequacy between mainstream compiler infrastructures<br />

and heterogeneous machines; and proposes a model <strong>to</strong> represent programmable<br />

pass management, a critical aspect of compiler design <strong>for</strong> modularity. Then, Section 3.2<br />

argues in favor of using source-<strong>to</strong>-source trans<strong>for</strong>mations, using source files as a common<br />

medium between all existing <strong>to</strong>ols. Finally, Section 3.3 introduces a high level Application<br />

Programming Interface (api) <strong>to</strong> build pass managers, the entities in charge of managing<br />

the chaining of the compiler passes. It exposes a sufficient abstraction of the compiler<br />

internals <strong>to</strong> developers that want <strong>to</strong> contribute at the pass level. The whole scheme is<br />

illustrated by the Pythonic PIPS (pyps) interface developed on <strong>to</strong>p of the Paralléliseur<br />

Interprocedural de Programmes Scientifiques (pips) compiler infrastructure. Related work<br />

is studied in Section 3.4.<br />

3.1 Extending Compiler Infrastructures<br />

3.1.1 Existing Compiler Infrastructures<br />

A compiler is typically separated in<strong>to</strong> three parts, (see Figure 3.1):<br />

The Front End is responsible <strong>for</strong> converting the input source code in<strong>to</strong> compiler’s Internal<br />

Representation (ir). A single compiler can have many front-ends. For instance<br />

gnu C Compiler (gcc) offers a front-end <strong>for</strong> Ada, C, C++, Fortran, Java, Chill,<br />

ObjectiveC, Pascal, etc.<br />

The Middle End is in charge of the code optimizations, that are independent of the


3.1. EXTENDING COMPILER INFRASTRUCTURES 45<br />

Host code<br />

+<br />

Device code<br />

<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 0<br />

Host Compiler <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 1<br />

<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 2<br />

Compiler Infrastructure<br />

<strong>Source</strong>-<strong>to</strong>-Binary Compiler 0<br />

<strong>Source</strong>-<strong>to</strong>-Binary Compiler 1<br />

<strong>Source</strong>-<strong>to</strong>-Binary Compiler 2<br />

Figure 3.2: Improved compilation flow <strong>for</strong> heterogeneous computing.<br />

Host Object<br />

Object 0<br />

Object 1<br />

Object 2<br />

input language or of the output target. Some are parametrized (e.g. unrolling) and<br />

their parameter are target-dependent. Other can benefit from additional assumptions<br />

due <strong>to</strong> the input language, e.g. no aliasing between parameters in Fortran77. There<br />

is usually a single middle end per compiler infrastructure—this is the case <strong>for</strong> gcc,<br />

Low Level Virtual Machine (llvm) and Open64.<br />

The Back End generates target-specific assembly code generation from the ir. In a<br />

similar manner <strong>to</strong> front ends, there can be several back ends in a single compiler<br />

infrastructure. For instance, llvm can produce assembly code <strong>for</strong> the following targets,<br />

as of version 2.7 : x86, sparc, powerpc, alpha, arm, mips, cellspu, pic16, xcore,<br />

msp430, systemz, blackfin, cbackend, msil, cppbackend, mblaze.<br />

For the purpose of code generation <strong>for</strong> heterogeneous hardware from raw source files<br />

(i.e. without directives or language extensions), only the middle and the back-end are<br />

affected. In Open Computing Language (opencl), the host compiler is assisted by several<br />

source-<strong>to</strong>-binary compilers, one <strong>for</strong> each targeted device. That is a middle end and a back<br />

end <strong>for</strong> each targeted device. This scheme could be improved by merging the middle ends<br />

of each source-<strong>to</strong>-binary compiler in<strong>to</strong> a generic middle end and as many back-ends. This<br />

approach, while tempting, is not possible because, as we have shown in Chapter 2, there is<br />

a lot of diversity in hardware accelera<strong>to</strong>rs, so the code trans<strong>for</strong>mations involved can only<br />

be shared partially. One alternative is <strong>to</strong> provide a common compilation infrastructure, on<br />

which all middle-ends are based. Figure 3.2 illustrates this idea by splitting each source-<strong>to</strong>binary<br />

compiler in<strong>to</strong> two parts: a source-<strong>to</strong>-source compiler, called the hardware language<br />

transla<strong>to</strong>r, and a source-<strong>to</strong>-binary compiler. The hardware language transla<strong>to</strong>r manages<br />

the target-specific optimization process at the source level and the source-<strong>to</strong>-binary is solely<br />

in charge of the translation <strong>to</strong> binary code. both are built using basic blocks found in the<br />

compiler infrastructure.<br />

Section 3.2.1 proposes an approach <strong>to</strong> make annotated code compatible with Figure 2.3.<br />

gcc, llvm or Open64 are examples of such compiler infrastructures. Let us now propose<br />

a very generic definition of a compiler infrastructure.<br />

Definition 3.1. A compiler infrastructure is a set of passes and analyses organized by a<br />

consistency manager and made available <strong>to</strong> compiler developers through a pass manager.


46 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

<strong>Compilers</strong> & Tools<br />

p4a pipscc pypsearch sac terapyps<br />

Analyses<br />

DFG, array regions...<br />

Pass Manager<br />

pyps tpips<br />

Consistency Manager<br />

pipsmake<br />

Passes<br />

inlining, unrolling...<br />

Internal Representation<br />

Pretty Printers<br />

C, Fortran, XML...<br />

Figure 3.3: pips as a generic compiler infrastructure sample.<br />

Passes and analyses are defined <strong>for</strong>mally in Section 3.1.2, so we just give in<strong>for</strong>mal<br />

definitions.<br />

Definition 3.2. A pass is a code trans<strong>for</strong>mation that modifies its input source <strong>to</strong> generate<br />

a new version.<br />

For instance, loop unrolling, inlining or <strong>for</strong>ward substitution are passes. In the following,<br />

passes are also referred as trans<strong>for</strong>mations.<br />

Definition 3.3. An analysis produces an abstraction of the code <strong>to</strong> be used by further<br />

passes.<br />

A call graph, a dependence graph or a polyhedral model are results of analyses.<br />

Definition 3.4. A consistency manager is a component in charge of providing up-<strong>to</strong>-date<br />

analyses <strong>to</strong> the passes.<br />

Definition 3.5. A pass manager is a component in charge of chaining the passes.<br />

Figure 3.3 shows the hierarchical organization of these components, based on the infrastructure<br />

of pips. The infrastructures of gcc [Nov06] and llvm [LA03] follow a similar<br />

pattern.<br />

Typically, a compiler <strong>for</strong> a single target applies the same sequence of passes on each<br />

function from its input code. As stated in gcc’s manual [Wik09]:<br />

Its [the pass manager’s] job is <strong>to</strong> run all the individual passes in the correct<br />

order, and take care of standard bookkeeping that applies <strong>to</strong> every pass.<br />

The gcc pass manager works on a sequence of passes, as shown by the implementation<br />

of the init_optimization_passes function. An excerpt of this function taken from gcc<br />

source code is given in Listing 3.1.


3.1. EXTENDING COMPILER INFRASTRUCTURES 47<br />

void init_optimization_passes ( void )<br />

{<br />

struct tree_opt_pass **p;<br />

# define NEXT_PASS ( PASS ) (p = next_pass_1 (p, & PASS ))<br />

/* Interprocedural optimization passes . */<br />

p = & all_ipa_passes ;<br />

NEXT_PASS ( pass_early_ipa_inline );<br />

NEXT_PASS ( pass_early_local_passes );<br />

NEXT_PASS ( pass_ipa_cp );<br />

NEXT_PASS ( pass_ipa_inline );<br />

[...]<br />

Listing 3.1: gcc pass manager initialization.<br />

# dead code elimination + constant propagation + inlining<br />

opt input .bc -o output .bc -dce - constprop -inline<br />

Listing 3.2: Dynamic phase ordering using the llvm pass manager command line interface.<br />

The passes use a function pointer bool (*gate)(void) <strong>to</strong> potentially guard its execution<br />

and unsigned int(*execute)(void) runs the pass. The pass order is essentially fixed, but<br />

it is possible <strong>to</strong> turn off some of the passes using the gate function and <strong>to</strong> sidestep some of<br />

the issues using the plug-in mechanism. This intrinsically static pass scheduling is a severe<br />

limitation <strong>for</strong> iterative compilation <strong>to</strong>ols, and its linearity does not match heterogeneous<br />

compilation requirements.<br />

The pass manager used in llvm is more sophisticated: it can register several kind of<br />

passes, depending on the type of object they work on—the whole program, a function, a<br />

call graph, a loop, a basic block or some machine code—while gcc’s passes only work on<br />

functions. Regarding pass management, compiler developers can either rely on the existing<br />

one or provide their own implementation. A Command Line Interface (cli) on <strong>to</strong>p of the<br />

pass manager, called opt, allows <strong>to</strong> dynamically change the phase ordering as presented in<br />

Listing 3.2.<br />

Iterative compilation is also possible. However, the lack of advanced control structures<br />

over pass scheduling hardly makes it a candidate <strong>for</strong> heterogeneous computing. A compiler<br />

<strong>for</strong> a heterogeneous plat<strong>for</strong>m must apply different sequences <strong>to</strong> different parts of the code,<br />

say one per target. As the compiler not only optimizes the code <strong>for</strong> a specific device, but<br />

also modifies the code <strong>to</strong> meet the hardware constraints, it can be expected that unusual<br />

combination patterns happen. The compilers built during this thesis and presented in<br />

Chapter 7 validate this assumption.


48 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

3.1.2 A Simple Model <strong>for</strong> Code Trans<strong>for</strong>mations<br />

The core of a compiler optimizer is the pass ordering. This section studies from a <strong>for</strong>mal<br />

point of view the interaction between passes, inspects the consequences of this <strong>for</strong>malism<br />

and elaborates on several trans<strong>for</strong>mation composition rules.<br />

3.1.2.1 Trans<strong>for</strong>mations: Definition and Compositions<br />

Let us first define <strong>for</strong>mally what a trans<strong>for</strong>mation is. Let P be the set of well-<strong>for</strong>med<br />

programs 1 , F the set of all possible functions 2 and T the set of code trans<strong>for</strong>mations.<br />

Definition 3.6. A program is a n-tuple of functions:<br />

∀p ∈ P, ∃n ∈ N : p ∈ F n<br />

The first element of a program is called the entry point of the program. The cardinality<br />

of a program is given by the | · | opera<strong>to</strong>r.<br />

If p is a program, we denote Vin(p) be the set of its possible input values; and given<br />

vin ∈ Vin(p), P(p, vin) denotes the results of the evaluation of p.<br />

Definition 3.7. A code trans<strong>for</strong>mation is a P → P application.<br />

The identity function on T is denoted idT . The set T , <strong>to</strong>gether with the ◦ operation,<br />

the function composition, and idT , is not a group because many trans<strong>for</strong>mations are not<br />

injective and thus have no inverse.<br />

Proof. The trans<strong>for</strong>mation that evaluates constant expressions is not injective, as it produces<br />

int a = 4; from either int a = 3+1; or int a = 2*2;.<br />

The <strong>for</strong>mer definition of a trans<strong>for</strong>mation ignores an important aspect of code trans<strong>for</strong>mations:<br />

they can fail. For instance loop fusion of two loops fails if the second loop<br />

carries backward dependencies on the first. A loop-carried backward dependency can prevent<br />

loop vec<strong>to</strong>rization, or the pass can even crash because of a lousy implementation 3 .<br />

As a consequence, we introduce an error state and propose:<br />

Definition 3.8. A code trans<strong>for</strong>mation is an application P → (P ∪ {error}) that either<br />

succeeds or preserves the semantics of the program or fails.<br />

And one can revert <strong>to</strong> the previous state by defining a failsafe opera<strong>to</strong>r:<br />

1. We call “well-<strong>for</strong>med programs” programs whose behavior is not undefined according <strong>to</strong> its language<br />

norms.<br />

2. The term module is used in pips instead of the more commonly used term function. As a consequence,<br />

the term module is used in some code excerpts.<br />

3. This un<strong>for</strong>tunately happens a lot in research compilers where many PhD students—just like me—<br />

contribute.


3.1. EXTENDING COMPILER INFRASTRUCTURES 49<br />

Definition 3.9. The failsafe opera<strong>to</strong>r ˜· : T → T is defined by<br />

∀t ∈ T , ∀p ∈ P,<br />

<br />

˜t(p)<br />

t(p)<br />

=<br />

p<br />

if t(p) =error<br />

otherwise<br />

and then a failsafe composition:<br />

Definition 3.10. The failsafe composition ˜◦ : T × T → T is defined by<br />

∀(t0, t1) ∈ T 2 , t1 ˜◦ t0 = ˜t1 ◦ ˜t0<br />

Chaining trans<strong>for</strong>mations can be done using the ˜◦ opera<strong>to</strong>r and most compilers are<br />

using this semantics as their primary way <strong>to</strong> compose trans<strong>for</strong>mations. New passes can<br />

be defined as the composition of existing passes, promoting modularity instead of monolithic<br />

passes. For instance a pass that generates Open Multi Processing (openmp) code<br />

can be written as the failsafe combination of loop fusion, reduction detection, parallelism<br />

extraction and directive generation instead of a single directive generation. Lattner advocates<br />

[Lat11] <strong>for</strong> this low granularity in pass design.<br />

Still, the fact that a trans<strong>for</strong>mation fails carries some important in<strong>for</strong>mation. In the<br />

example above, if the loop vec<strong>to</strong>rization succeeds, a vec<strong>to</strong>r instruction can be generated,<br />

otherwise parallelism extraction may be tried. This kind of behavior is represented by a<br />

conditional composition:<br />

Definition 3.11. The conditional composition ◦ : T × T × T → T is defined by<br />

∀(t0, t1, t2) ∈ T 3 , ∀p ∈ P, ((t1, t2) ◦ t0)(p) =<br />

(t1 ◦ t0)(p) if t0(p) = error<br />

t2(p) otherwise<br />

The ◦ opera<strong>to</strong>r is not used in llvm or gcc, although it provides interesting perspectives.<br />

Let us assume the existence of 3 trans<strong>for</strong>mations tgpu, tsse and <strong>to</strong>mp that convert a<br />

sequential C code in<strong>to</strong> a code with cuda calls, Streaming simd Extension (sse) intrinsic<br />

calls and openmp directives, respectively. Then the following expression:<br />

(idT , tsse ˜◦ <strong>to</strong>mp) ◦ tgpu<br />

means try <strong>to</strong> trans<strong>for</strong>m the code in<strong>to</strong> a gpu code or if it fails try <strong>to</strong> generate openmp<br />

directives then sse intrinsics, whether openmp directives were generated or not. It builds<br />

a decision tree that allows complex compilation strategies.<br />

If the intended behavior is <strong>to</strong> s<strong>to</strong>p as soon as an error occurs, and <strong>to</strong> keep on applying<br />

trans<strong>for</strong>mations otherwise, as in the sequence:<br />

(t3, idT ) ◦ t2, idT<br />

<br />

◦ (t1, idT ) ◦ t0<br />

<br />

then writing the default skip trans<strong>for</strong>mation is bothersome. Thus we define an error propagation<br />

opera<strong>to</strong>r:


50 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

Definition 3.12. The error propagation opera<strong>to</strong>r ◦ : T × T → T is defined by<br />

∀(t0, t1) ∈ T 2 , t1 ◦ t0 = (t1, idT ) ◦ t0<br />

which makes it possible <strong>to</strong> rewrite the above example as<br />

t3 ◦ t2 ◦ t1 ◦ t0<br />

For instance, let us assume the existence of a trans<strong>for</strong>mation <strong>to</strong>pt_dma that optimizes<br />

the usage of Direct Memory Access (dma) functions by trying <strong>to</strong> remove redundant ones,<br />

<strong>to</strong> merge them, etc. 4 It is not relevant <strong>to</strong> apply it if no dma function was generated by<br />

a tgen_dma trans<strong>for</strong>mation. Supposing that tgen_dma returns an error if it failed <strong>to</strong> generate<br />

dma operations, this kind of interaction can be represented using the expression:<br />

<strong>to</strong>pt_dma ◦ tgen_dma.<br />

3.1.2.2 Parametric Trans<strong>for</strong>mations<br />

Many trans<strong>for</strong>mations are parametric. For instance, loop unrolling not only takes a<br />

program as input, but it also needs a particular loop in a particular function and an unroll<br />

rate <strong>to</strong> be completely defined. To represent this, we introduce the concept of trans<strong>for</strong>mation<br />

genera<strong>to</strong>r<br />

Definition 3.13. An application f is a parametric trans<strong>for</strong>mation if there exists a set A<br />

so that f : A → T .<br />

For instance loop unrolling is a parametric trans<strong>for</strong>mation <strong>for</strong> which A = F × L × N ∗ ,<br />

where first argument is the function <strong>to</strong> work on, the second argument the loop statement<br />

<strong>to</strong> unroll and the third argument is the unroll rate.<br />

Two particular classes of trans<strong>for</strong>mation genera<strong>to</strong>rs are commonly found in compilers:<br />

function trans<strong>for</strong>mation and loop trans<strong>for</strong>mation.<br />

Definition 3.14. A parametric trans<strong>for</strong>mation g : A → T is a function trans<strong>for</strong>mation if<br />

there exists a set B so that A = F × B.<br />

Definition 3.15. A parametric trans<strong>for</strong>mation g : A → T is a loop trans<strong>for</strong>mation if<br />

there exists a set B so that A = F × L × B.<br />

3.1.2.3 From Model <strong>to</strong> Implementation<br />

Moving from a <strong>for</strong>mal description of pass opera<strong>to</strong>rs <strong>to</strong> a programming language can be<br />

tricky. Although it is tempting <strong>to</strong> design a new language that directly reflects the ˜·, ˜◦ , ◦ ,<br />

and ◦ opera<strong>to</strong>r, we make the following points:<br />

– a trans<strong>for</strong>mation changes a program state and behaves like a method in Object<br />

Oriented Programming (oop);<br />

4. A similar trans<strong>for</strong>mation is described in Section 6.3.


3.1. EXTENDING COMPILER INFRASTRUCTURES 51<br />

– trans<strong>for</strong>mation genera<strong>to</strong>rs are methods granted with extra parameters; in particular<br />

function trans<strong>for</strong>mation genera<strong>to</strong>rs can be represented by methods of a hypothetic<br />

function class and loop trans<strong>for</strong>mation genera<strong>to</strong>rs can be represented by methods of<br />

a hypothetic loop class;<br />

– the composition opera<strong>to</strong>r is similar <strong>to</strong> the sequence opera<strong>to</strong>r found in many programming<br />

languages;<br />

– the failsafe opera<strong>to</strong>r is similar <strong>to</strong> a try ... catch block that surrounds every trans<strong>for</strong>mation;<br />

– the conditional composition is similar <strong>to</strong> if ... then ... else block;<br />

– The error propagation opera<strong>to</strong>r has a similar semantics as exception propagation.<br />

As a consequence, we choose <strong>to</strong> use a general purpose programming language instead<br />

of designing our own: the class hierarchy with the appropriate methods described in next<br />

section embodies all the concepts detailed in this section.<br />

3.1.3 Programmable Pass Management<br />

3.1.3.1 A Class Hierarchy <strong>for</strong> Pass Management<br />

The hierarchy between the host and the accelera<strong>to</strong>rs, a consequence of the masterworker<br />

paradigm, implies a similar hierarchy between the host code and the accelera<strong>to</strong>r<br />

code. However, it is not enough in practice: depending on the input code, a single function<br />

can have several loops candidate <strong>to</strong> offloading. Moreover, opencl takes as input a selfcontained<br />

code, so the notion of compilation unit—the set of variables, types and functions<br />

defined in the same source file— is also important. Because compilation <strong>for</strong> heterogeneous<br />

computing usually leads <strong>to</strong> the creation of new functions, it is also important <strong>to</strong> be able<br />

<strong>to</strong> keep track of them and adapt the processing according <strong>to</strong> their origin. This relationship<br />

must be visible <strong>to</strong> the pass manager because different compilation schemes apply <strong>to</strong><br />

each part. The model presented in the previous section confirmed by the experience gathered<br />

during the development of several heterogeneous compilers has led <strong>to</strong> the hierarchy<br />

described below:<br />

Program: Code trans<strong>for</strong>mations manipulate programs. Interprocedural analyses such as<br />

the call-graph computation and trans<strong>for</strong>mations such as constant propagation trans<strong>for</strong>m<br />

the program as a whole. This is the coarser trans<strong>for</strong>mation grain.<br />

Compilation units: Processing can be different depending on the enclosing compilation<br />

unit. A typical example is <strong>for</strong> hand-optimized runtime code, code passed <strong>to</strong> the compiler<br />

so that it has all the definitions required <strong>for</strong> interprocedural analyses of generated<br />

code (see Section 3.2.1). This code should only be analyzed by the compiler, but not<br />

modified. The same situation occurs when some properties of a source file have been<br />

certified: the certification does not hold any longer if the code is changed.<br />

Functions: We have modeled a program as a set of functions, and the pass manager<br />

generally works at this level of granularity. gcc works at this level. It is especially<br />

relevant <strong>for</strong> heterogeneous computing where some functions are executed on the host<br />

and some are executed on an accelera<strong>to</strong>r.


52 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

Workspace<br />

1<br />

1<br />

Program<br />

1<br />

1<br />

*<br />

*<br />

Maker<br />

Compilation<br />

Unit<br />

1<br />

*<br />

Figure 3.4: pyps class hierarchy.<br />

1<br />

Function Loop<br />

*<br />

*<br />

Loops: Many scientific programs spend a lot of time in loops and Single Instruction<br />

stream, Multiple Data stream (simd) parallelism is commonly found in loops. Exposing<br />

loops hierarchy <strong>to</strong> the pass manager is pertinent in many cases, <strong>for</strong> example<br />

<strong>to</strong> apply loop-level trans<strong>for</strong>mations, <strong>to</strong> outline a loop in a new function, (see Section<br />

4.6.2), or <strong>to</strong> isolate it from the rest of the memory, as in Section 6.1. Likewise<br />

the loop nest hierarchy provides significant in<strong>for</strong>mation concerning the code structure<br />

and potential candidate <strong>for</strong> loop trans<strong>for</strong>mations such as loop tiling or loop fusion.<br />

These relationships can be represented by the class hierarchy shown in Figure 3.4.<br />

The combination of control flow and hierarchical structure makes it possible <strong>to</strong> express<br />

constructs such as “select each loop nest of the program that does not belong <strong>to</strong> the<br />

runtime, compare its number of operations <strong>to</strong> its number of memory accesses and if it is<br />

computationally intensive enough, outline the loop nest in<strong>to</strong> a new function in the gpu<br />

space. Then trans<strong>for</strong>m this new function in<strong>to</strong> a suitable kernel <strong>for</strong> a gpu device”.<br />

Note that using the <strong>for</strong>malism presented above, this sentence can be denoted as:<br />

∀p ∈ P, ∀f ∈ F \ R, ∀l ∈ L,<br />

G(f, f ′ )◦ O(f, l, f ′ )◦ C(f, l) (p)<br />

where R is a set of functions representing the runtime, C : F × L → T is a loop<br />

trans<strong>for</strong>mation that raises an error if the given loop is not computationally intensive;<br />

O : F × L × F → T is a loop trans<strong>for</strong>mation that outlines the given loop in<strong>to</strong> a new<br />

function; and G : F × F → T is a function trans<strong>for</strong>mation that turns a function in<strong>to</strong> a<br />

kernel and a kernel call.<br />

Figure 3.4 also introduces a Maker class. This component represents the build process<br />

and is involved in both the source code generation process and its final compilation in<strong>to</strong><br />

machine code. Indeed if the compilation chain is <strong>to</strong> remain independent of the targeted<br />

language—remember everything is represented in C—a specialization step is needed. This<br />

specialization step, called post-processing, takes care of the various steps needed <strong>to</strong> switch<br />

from a C representation <strong>to</strong> the targeted dialect, say cuda or opencl. The Maker class<br />

plays this role and describes the final steps required <strong>to</strong> generate the proper external representation.<br />

Additionally, it generates a Makefile <strong>to</strong> au<strong>to</strong>mate the complex build process<br />

resulting from the combination of different targets in the same build process.<br />

0,1


3.1. EXTENDING COMPILER INFRASTRUCTURES 53<br />

3.1.3.2 Control Flow and Pass Management<br />

Control flow features are profitable in many situations. The use cases presented in this<br />

section are excerpts of existing compiler instances presented in Chapter 7. The examples<br />

are written using the pyps interface described in Section 3.3 but they should nevertheless<br />

be understandable.<br />

Sequences represent the ◦ opera<strong>to</strong>r. They are common <strong>to</strong> all pass managers. They are<br />

used each time trans<strong>for</strong>mation chaining is used.<br />

Methods represent the mathematical function definition. They are used <strong>to</strong> structure<br />

the compiler code and <strong>to</strong> en<strong>for</strong>ce re-ruse. For instance the generation process <strong>for</strong><br />

multimedia instructions is common <strong>to</strong> sse, Advanced Vec<strong>to</strong>r eXtensions (avx) and<br />

neon instruction sets with minor parameter changes (e.g. size of the vec<strong>to</strong>r register).<br />

It can be packed in<strong>to</strong> a function at the pass manager level as shown in Listing 7.1.<br />

Conditionals are extensions of the ◦ opera<strong>to</strong>r. They are used when the pass scheduling<br />

depends on compiler switches or on the current compilation status. Figure 3.5 shows<br />

two situations where it can occur: The pass manager has <strong>to</strong> activate or deactivate<br />

some passes, or modify the compilation scheme.<br />

# check if if_conversion is asked <strong>for</strong><br />

if params . get (’if_conversion ’):<br />

module . if_conversion ()<br />

(a) Using conditionals as switches.<br />

# optimize a module if it does not belong <strong>to</strong> the runtime<br />

if module .cu != " myruntime ":<br />

module . optimize ()<br />

(b) Using conditionals <strong>to</strong> change compilation scheme.<br />

Figure 3.5: Usage of conditionals at the pass manager level.<br />

Do Loops represent the map on a set, they are used <strong>to</strong> per<strong>for</strong>m repetitive tasks. Such<br />

tasks are often found in a pass manager, e.g. apply a pass <strong>to</strong> each loop or function<br />

of a set, or apply a trans<strong>for</strong>mation iteratively with varying parameters.<br />

Exceptions are related <strong>to</strong> the error state. They can be used by a pass <strong>to</strong> signal an<br />

unexpected situation. For instance, the inlining phase raises an exception if it has<br />

no caller, or an attempt <strong>to</strong> offload a loop nest fails if generic pointer are involved<br />

and the engine does not know how <strong>to</strong> transfer them on the hardware accelera<strong>to</strong>r.<br />

Listing 3.3 shows an example of such a situation.<br />

Trans<strong>for</strong>mation genera<strong>to</strong>rs are described as class methods with parameters. To feed<br />

these parameters, and more generally <strong>to</strong> feed conditionals or loop ranges, some in<strong>for</strong>mation<br />

concerning the code being compiled is necessary. It raises the issue of the pass manager


54 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

<strong>for</strong> kernel in gpu_kernels :<br />

kernel . generate_communications ()<br />

(a) Iterate over sets.<br />

<strong>for</strong> pattern in [" add ", " minus ", " mul "]:<br />

module . pattern_recognition ( pattern )<br />

(b) Iterate over parameters.<br />

Figure 3.6: Usage of loops at the pass manager level.<br />

try :<br />

module . isolate_statement ()<br />

# if isolate statement succeeds ,<br />

# go on up <strong>to</strong> kernel generation<br />

module . outline ()<br />

...<br />

except RuntimeError as re:<br />

print (" Unable ␣<strong>to</strong>␣ generate ␣ GPU ␣ code :␣"+ str (re ))<br />

# maybe try SSE instructions instead ?<br />

Listing 3.3: Usage of exceptions at the pass manager level.<br />

interface granularity: should all the compiler internals be exposed <strong>to</strong> the pass manager or<br />

should it only show a subset of relevant in<strong>for</strong>mation? The <strong>for</strong>mer approach is taken by<br />

Rudy et al. [RCH + 10]: they use the lua scripting language <strong>to</strong> expose all the ir <strong>to</strong> the pass<br />

manager, based on which the compiler per<strong>for</strong>ms gpu code generation, using an iterative<br />

algorithm expressed at the script level.<br />

The benefits of using such an approach are unclear from a separation of concerns point<br />

of view: high level code is mixed up with lower level code, without a clear boundary between<br />

the two concerns. We propose an alternative approach based on the following assessment: if<br />

an access <strong>to</strong> the ir is needed, then a pass should be used <strong>to</strong> simplify the pass manager’s job.<br />

Otherwise it can be done at the pass manager level. That is the high-level trans<strong>for</strong>mations<br />

should be managed by a high-level language given the proper abstraction, and the low<br />

level trans<strong>for</strong>mation should be managed by the native language that has an access <strong>to</strong> all<br />

the infrastructure capabilities. Eventually these languages could be the same, but the fact<br />

they address different needs, basically programmability vs. per<strong>for</strong>mance, should lead <strong>to</strong><br />

different engineering choices.


3.2. ON SOURCE-TO-SOURCE COMPILERS 55<br />

Front Ends <strong>Targets</strong><br />

compiler C C++ Fortran openmp cuda sse vhdl mpi<br />

pips <br />

pocc ≡<br />

mercurium <br />

cetus ≡ ≡<br />

rose <br />

gecos <br />

hmpp <br />

: feature officially supported ≡: feature mentioned in a paper<br />

Table 3.1: Comparison of source-<strong>to</strong>-source compilation infrastructures.<br />

3.2 On <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> <strong>Compilers</strong><br />

In the previous sections, we have separated the job of the source-<strong>to</strong>-source compiler<br />

from the job of the source-<strong>to</strong>-binary compiler. The source-<strong>to</strong>-source compiler takes care of<br />

language-independent trans<strong>for</strong>mations and optimizations based on a common framework,<br />

and source-<strong>to</strong>-binary compilers are used as back-ends.<br />

Definition 3.16. A source-<strong>to</strong>-source compiler is a compiler that takes a source code written<br />

in a high level language as input, and outputs code in a high level language.<br />

3.2.1 Exploring <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Opportunities<br />

His<strong>to</strong>rically, many trans<strong>for</strong>mations have been expressed at the source code level, especially<br />

parallelizing trans<strong>for</strong>mations. Many successful compilers are source-<strong>to</strong>-source compilers:<br />

HPFC [Coe93], cilk [FLR98], acotes [MAB + 10], or based on source-<strong>to</strong>-source<br />

compiler infrastructures pips [IJT91], suif [WFW + 94], rose [Qui00], cetus [iLJE03],<br />

mercurium [AMG + 99], or GeCoS [DMM + ]. Such infrastructures provide many code<br />

trans<strong>for</strong>mations relevant <strong>to</strong> our problem, such as parallelism detection algorithm, variable<br />

privatization, etc.<br />

Table 3.1 gathers the result of our study of the number of hardware targets <strong>for</strong> seven<br />

source-<strong>to</strong>-source research or industrial compilers still in use or in development. Most of<br />

them are used <strong>to</strong> generate code <strong>for</strong> more than one target.<br />

All the compilers considered so far take a C dialect as input language, and do not<br />

provide a binary interface. Thus, a hardware transla<strong>to</strong>r is required <strong>to</strong> generate C code as<br />

a result of its processing. As stated above, a post-processor is needed <strong>to</strong> adjust syntactic<br />

changes between plain C and the targeted dialect. Such modifications can easily be done<br />

at the source level using common techniques. The first one is regular expressions: many<br />

language adjustments, such as adding the triple chevron from cuda can be per<strong>for</strong>med with<br />

a relevant regular expression. However, most programming languages are not regular, so


56 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

tr<br />

source-<strong>to</strong>source<br />

compiler<br />

tr<br />

external <strong>to</strong>ol<br />

tr<br />

source-<strong>to</strong>source<br />

compiler<br />

tr<br />

Figure 3.7: <strong>Source</strong>-<strong>to</strong>-source cooperation with external <strong>to</strong>ols.<br />

regular expressions are not sufficient <strong>for</strong> textual substitution. When combined with the C<br />

macro processor that has a weaker rewriting engine but is capable of diverting function<br />

calls using macro functions, they can catch more pattern, although not providing a reliable<br />

<strong>to</strong>ol <strong>for</strong> all situations (e.g. no pairing of brackets). This combination has been successfully<br />

used <strong>for</strong> the three compiler implementations described in Chapter 7.<br />

In addition <strong>to</strong> the intuitive collaboration with source-<strong>to</strong>-binary compilers, source-<strong>to</strong>source<br />

compilers can also collaborate between each other <strong>to</strong> achieve their goal, using source<br />

files as a common medium, at the expense of extra processing <strong>for</strong> the additional switches<br />

between Textual Representation (tr) and ir. Figure 3.7 illustrates this generic behavior.<br />

For instance, optimization of loop nests can be delegated <strong>to</strong> the Polyhedral Compiler<br />

Collection (pocc) <strong>to</strong>ol, a compiler optimized <strong>for</strong> polyhedral trans<strong>for</strong>mations.<br />

More traditional advantages of source-<strong>to</strong>-source compilers include their debugging features,<br />

because the ir can be dumped as a tr at anytime, then compiled and executed. For<br />

the same reason, they are very pedagogical <strong>to</strong>ols and make it easier <strong>to</strong> illustrate the behavior<br />

of a trans<strong>for</strong>mation. As claimed above, many trans<strong>for</strong>mations, such as loop interchange<br />

or loop unrolling, are easily described as source-<strong>to</strong>-source trans<strong>for</strong>mations.


3.2. ON SOURCE-TO-SOURCE COMPILERS 57<br />

.c<br />

<strong>Source</strong>-<strong>to</strong>-<br />

<strong>Source</strong> Compiler<br />

.c<br />

Post-Processor<br />

.c dialect<br />

<strong>Source</strong>-<strong>to</strong>-<br />

Binary Compiler<br />

Machine<br />

Code<br />

Figure 3.8: <strong>Heterogeneous</strong> compilation stages.<br />

3.2.2 Impact of <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compilation on the Compilation<br />

Infrastructure<br />

There are as many C dialects as hardware devices. As a consequence, a source-<strong>to</strong>-source<br />

compiler that aims at generating code <strong>for</strong> several targets has two possibilities: either write<br />

as many pretty-printers as existing dialects, or regenerate C code and use an external <strong>to</strong>ol<br />

<strong>to</strong> per<strong>for</strong>m the translation. This approach requires a post-processing step <strong>to</strong> fill the gap<br />

between the C code augmented with runtime functions and the hardware language, as<br />

shown in Figure 3.8.<br />

The combination of Figure 3.8 with Figure 3.9 results in the final compilation infrastructure<br />

diagram described in Figure 3.9. The main achievement shown by this figure is<br />

that most developments can be done in a source-<strong>to</strong>-source infrastructure using a common<br />

ir. This is a great progress compared <strong>to</strong> the situation with opencl described in Figure 2.10<br />

on page 39: it favors re-usability.


58 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 0<br />

Host code Host Compiler <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 1<br />

<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler 2<br />

<strong>Source</strong>-<strong>to</strong>-source Compiler Infrastructure<br />

PP 0<br />

PP 1<br />

PP 2<br />

<strong>Source</strong>-<strong>to</strong>-<br />

Binary<br />

Compiler 0<br />

<strong>Source</strong>-<strong>to</strong>-<br />

Binary<br />

Compiler 1<br />

<strong>Source</strong>-<strong>to</strong>-<br />

Binary<br />

Compiler 2<br />

Figure 3.9: <strong>Source</strong>-<strong>to</strong>-source heterogeneous compilation scheme.<br />

3.3 pyps, a High Level Pass Manager api<br />

Host<br />

Object<br />

Object 0<br />

Object 1<br />

Object 2<br />

PP n : Post-Processor<br />

The model presented in Section 3.1.2 leads <strong>to</strong> the design of an api <strong>for</strong> pass managers.<br />

All the compilers <strong>for</strong> heterogeneous devices presented in this thesis are based upon this<br />

pass manager. It uses a dynamic object-oriented scripting language, Python, <strong>for</strong> flexibility<br />

and ease-of-development without much per<strong>for</strong>mance loss, <strong>for</strong> the compiler passes are still<br />

implemented in a compiled language, C. It uses an object oriented language <strong>to</strong> represent the<br />

class hierarchy from Figure 3.4, <strong>to</strong> en<strong>for</strong>ce code reuse and <strong>to</strong> facilitate compiler composition.<br />

This section describes in details the methods exposed at the pass manager level <strong>for</strong> each<br />

of the class identified in the previous section: program, function, loop and maker.<br />

3.3.1 api Description<br />

This pass manager is implemented in Python on <strong>to</strong>p of the pips compiler infrastructure 5<br />

and named pyps. It consists in fewer than 700 sloc. In addition <strong>to</strong> the language properties<br />

mentioned above, Python has the advantage of having a rich set of libraries and a dynamic<br />

community. As an example of the benefits of using a mature and feature-rich language,<br />

combining pyps with the enhanced Python interpreter ipython has led <strong>to</strong> a powerful cli<br />

<strong>to</strong> pyps at almost no development cost, unlike other scripting <strong>to</strong>ols implemented over pips<br />

such as tpips. The integration with the C language is simple enough <strong>to</strong> allow an easy<br />

binding with pips libraries. Note however that the api design is completely independent<br />

from the underlying compiler infrastructure, which makes it a suitable candidate <strong>for</strong> other<br />

compiler infrastructures.<br />

In this api, two main entities are used <strong>to</strong> abstract the source-<strong>to</strong>-source compiler: a<br />

workspace and a Maker. The <strong>for</strong>mer represents a whole program and the trans<strong>for</strong>mation.<br />

The latter represents the global compilation scheme: post-processing, source-<strong>to</strong>-binary<br />

5. An overview of the pips compiler infrastructure is given in Appendix A.


3.3. PYPS, A HIGH LEVEL PASS MANAGER API 59<br />

compiler call, etc.<br />

A workspace provides the following methods:<br />

init(sources, flags ): a workspace is initialized from a set of files and preprocessor<br />

flags. Once created, it has knowledge of the full program code, and access <strong>to</strong> all the<br />

relevant runtimes;<br />

save(dir ): once all trans<strong>for</strong>mations have been per<strong>for</strong>med, this method is used <strong>to</strong> regenerate<br />

all source files without post-processing;<br />

build(dir, maker ) uses a Maker <strong>to</strong> save current workspace and per<strong>for</strong>m the postprocessing.<br />

The default Maker per<strong>for</strong>ms no post-processing and generates a Makefile<br />

<strong>for</strong> a traditional compilation scheme.<br />

compile(dir, maker ): gather all source files, save them in dir, post-processes them with<br />

maker and use the Makefile generated by the build method <strong>to</strong> compile them;<br />

checkpoint() saves current workspace state, returning an identifier;<br />

res<strong>to</strong>re(chk_id ) res<strong>to</strong>res workspace back <strong>to</strong> the state when chk_id was obtained.<br />

A typical use case is <strong>to</strong> create a workspace using init, per<strong>for</strong>m some trans<strong>for</strong>mations,<br />

save it <strong>to</strong> check the sequential result, generate a build chain using build and run<br />

compile <strong>to</strong> check the generated executable. The purpose of this dividing is <strong>to</strong> provide entry<br />

points—junction point in the aspect programming terminology—where compiler developers<br />

can insert their own code. Indeed the default workspace provide no direct facilities <strong>for</strong><br />

heterogeneous computing. Instead, a source-<strong>to</strong>-source compiler must inherits from it and<br />

implements the target-specific trans<strong>for</strong>mations using method overloading. More generically,<br />

given a code that contains n code fragments <strong>to</strong> be run on m distinct targets, transla<strong>to</strong>r<br />

developers must compose n workspaces <strong>to</strong>gether, using multiple inheritance and existing<br />

or newly created workspaces.<br />

Let us give a practical example: pyps is shipped with many workspaces types, including<br />

a workspace <strong>to</strong> instrument an application <strong>for</strong> benchmarking purpose, one <strong>to</strong> generate<br />

openmp code, one <strong>to</strong> generate avx code and one <strong>to</strong> generate cuda code. To build a<br />

compiler <strong>for</strong> a heterogeneous machine made of an nVidia gpu and several Intel cores<br />

with avx support—a classical hardware configuration nowadays—, one can rely on existing<br />

components and implement the compilation scheme as described in Listing 3.4. In<br />

this example, a new workspace that inherits from the existing one is created. In fact, a<br />

new compilation scheme is implemented at a high level, relying on existing ones. Code<br />

generation is au<strong>to</strong>matically <strong>for</strong>warded <strong>to</strong> the proper base class and does not need <strong>to</strong> be<br />

specified here.<br />

A source-<strong>to</strong>-source compiler typically inherits from the workspace class. The init<br />

method can be overridden <strong>to</strong> pass extra flags <strong>to</strong> the workspace and <strong>to</strong> provide additional<br />

source files <strong>to</strong> the compiler infrastructure. The latter can be third-party libraries stubs,<br />

declaration of functions used as parameters <strong>for</strong> a pattern-matching engine or runtime declarations<br />

needed by the code generation process. In a similar manner, the save method<br />

can be used <strong>for</strong> header substitution, i.e. <strong>to</strong> add extra header files and #include them 6 .<br />

6. When the ir cannot represent preprocessor symbols.


60 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

# load relevant packages<br />

import pyps , sac , openmp , cuda<br />

# assemble workspaces , order usually matters<br />

class my_workspace (sac , openmp , cuda ):<br />

pass<br />

# provide a per - function compilation scheme<br />

def my_compilation_scheme ( module ):<br />

w= module . workspace # recover current workspace<br />

chk =w. checkpoint () # save current state<br />

try : # CUDA code generation<br />

m. cuda () # raise an exception in case of failure<br />

except RuntimeError :<br />

w. res<strong>to</strong>re ( chk ) # res<strong>to</strong>re pre - CUDA state<br />

try : # OpenMP and AVX code generation<br />

m. openmp ()<br />

m. avx ()<br />

except : pass<br />

Listing 3.4: Example of workspace composition at the pass manager level using pyps.<br />

When the transla<strong>to</strong>r generates constructs that cannot be represented in C or use a specific<br />

compilation process, a specific Maker is fed <strong>to</strong> build <strong>to</strong> per<strong>for</strong>m the post-processing steps.<br />

For instance the multimedia instruction genera<strong>to</strong>r described in Section 7.4 overrides<br />

the init method <strong>to</strong> add its own runtime files <strong>to</strong> the workspace and then <strong>for</strong>ward the call<br />

<strong>to</strong> its parent. Likewise the save method is overridden <strong>to</strong> add a generic header file at the<br />

<strong>to</strong>p of each source file—a step that cannot be per<strong>for</strong>med earlier as it require an additional<br />

preprocessor run. The Maker class is extended <strong>to</strong> add special processing, and, depending<br />

on the maker the build method is given as arguments, it generates different code. With<br />

the default maker, the sequential version of the generated vec<strong>to</strong>r instructions is used. With<br />

an sse enabled maker, proper compiler flags and post-processing are activated.<br />

The composition of workspaces relies on two assumptions:<br />

1. all classes inheriting from workspace <strong>for</strong>ward calls <strong>to</strong> their parent;<br />

2. the compiler developer guarantees that the composition makes sense.<br />

Assumption 1 makes it possible <strong>to</strong> compose workspace and follows the idea that a<br />

workspace takes care of its target specific processing and delegates parts he does not know<br />

how <strong>to</strong> handle <strong>to</strong> its parent—in the end the default workspace manages the lef<strong>to</strong>ver. Assumption<br />

2 guarantees that the composition leads <strong>to</strong> error-free code: it does not make<br />

sense <strong>to</strong> generate avx calls inside a cuda kernel but it does make sense <strong>to</strong> do so in a loop<br />

annotated with openmp directives.<br />

A program is no more than a set of compilation units. All methods of programs are


3.3. PYPS, A HIGH LEVEL PASS MANAGER API 61<br />

available at the workspace level.<br />

A compilationUnit does not provide any additional method but is used as a structuring<br />

element by some passes. Indeed, an important characteristic of heterogeneous computing<br />

is that different compilation units may have different targets, and thus use different source<strong>to</strong>-binary<br />

compilers.<br />

A function provides the following methods:<br />

get_code():string : the fundamental feature of a source-<strong>to</strong>-source compiler is the capability<br />

of switching between the ir and the tr. This method builds the current tr as<br />

a string;<br />

set_code(code ): this method replaces current code by a new version given by a string.<br />

Combined with the previous method, it makes it possible <strong>to</strong> call an external <strong>to</strong>ol on<br />

the tr and use the output <strong>to</strong> build a new ir;<br />

callers, callees:functions : make it possible <strong>to</strong> navigate the (static 7 ) call graph;<br />

passXYZ (params ): all compiler trans<strong>for</strong>mations are exposed as function methods.<br />

Sub-classing of a function is used <strong>to</strong> provide new code trans<strong>for</strong>mations as a complex<br />

chaining of existing trans<strong>for</strong>mations.<br />

The Loop class provides the member below, mainly used <strong>for</strong> inter-pass communication.<br />

label: a read only value used <strong>to</strong> uniquely identify the loop and the statement holding it;<br />

pragma: a list of directives attached <strong>to</strong> the loop;<br />

loops: a list of all loops directly contained in this loop.<br />

The Maker class provides a unique method:<br />

generate(dir, sources ) eventually post-processes files given as sources and found in<br />

dir, and generates a Makefile <strong>to</strong> compile them.<br />

The generate method can be overridden <strong>to</strong> change the generated Makefile and <strong>to</strong><br />

add post-processing steps. For instance the Maker found in the openmp package adds the<br />

proper -fopenmp compilation flag <strong>to</strong> the Makefile.<br />

Let us illustrate the relevancy of this architecture on a practical situation: sse code<br />

generation from plain C code. The technical parts are detailed in Section 7.4, only the<br />

compiler architecture is described here.<br />

<strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> Compiler: this compiler is a classical vec<strong>to</strong>rizer that generates C intrinsics<br />

<strong>to</strong> represent vec<strong>to</strong>r operations. Intrinsics are used <strong>for</strong> both data movement<br />

and vec<strong>to</strong>r operations and have a sequential version written in C. An example of such<br />

generated code is given in Listing 3.5, and its sequential implementation is given in<br />

Listing 3.6.<br />

Post-Processing: because generated code is still C, it can be executed, although not<br />

efficiently, on a sequential processor. The only post-processing step is <strong>to</strong> add the<br />

relevant #include <strong>to</strong> all source files.<br />

7. The call graph is static because we do not consider the case of function pointers.


62 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

void vadd_l99999 ( float a[4] , float b [4])<br />

{<br />

// PIPS : SAC generated v4sf vec<strong>to</strong>r (s)<br />

v4sf vec00 , vec10 ;<br />

SIMD_LOAD_V4SF ( vec00 , &a [1 -1]);<br />

SIMD_LOAD_V4SF ( vec10 , &b [1 -1]);<br />

l99999 : ;<br />

SIMD_ADDPS ( vec00 , vec00 , vec10 );<br />

SIMD_STORE_V4SF ( vec00 , &a [1 -1]);<br />

}<br />

Listing 3.5: sse C intrinsics generated <strong>for</strong> a scalar product.<br />

void SIMD_ADDPS ( float *dst , ...)<br />

{<br />

int i;<br />

va_list ap , ap_f ;<br />

va_start (ap , dst );<br />

<strong>for</strong> (i = 0; i


3.4. RELATED WORK 63<br />

3.3.2 Usage Example<br />

The five compilers presented in this document are based on the pyps api. The fact<br />

that we were able <strong>to</strong> write them is a first step <strong>to</strong>ward the validation of the api.<br />

As an example of the api validity, we have used pyps <strong>to</strong> per<strong>for</strong>m fuzz testing on the pips<br />

compiler infrastructure. Fuzz testing is a software testing technique that injects random<br />

input in<strong>to</strong> a piece of software <strong>to</strong> test its behavior. The technique used is described in<br />

Algorithm 1 and the equivalent pyps code is given in Listing 3.8.<br />

Data: p ← a program<br />

Data: g ← a function trans<strong>for</strong>mation genera<strong>to</strong>r<br />

binary ← compile(p);<br />

repeat<br />

f ← random_function(p);<br />

p ′ ← g(f)(p);<br />

binary ′ ← compile(p ′ );<br />

until exec(binary) = exec(binary ′ );<br />

Algorithm 1: Fuzz testing at the pass manager level.<br />

This simple program assumes that the input code has a reproducible output printed on<br />

standard output. We have used it conjointly with a random C code genera<strong>to</strong>r from Eric<br />

Eide and John Regehr [ER08]. This program generates random C programs with deep<br />

call graphs that prints a hash value representing their execution on standard output. We<br />

tested 10 trans<strong>for</strong>mations with this fuzzer. For each of them, an erroneous instance was<br />

found. 8<br />

3.4 Related Work<br />

<strong>Compilers</strong> have been built <strong>for</strong> many years. The first complete Fortran compiler was<br />

released by an IBM team in 1956 [Bac57]. The first beta release of gcc by Richard M.<br />

Stallman dates back <strong>to</strong> the 22 th of March, 1987. <strong>Compilers</strong> are known <strong>to</strong> be complex<br />

pieces of software that evolve slowly, while hardware keeps evolving at a steady rate. We<br />

have run David A. Wheeler’s sloccount [Whe01] on 2 leading open source compilation<br />

projects, gcc and llvm and reproduce its output in Table 3.2. It shows that a compiler<br />

project involves a lot of development skills: many languages are used and the <strong>to</strong>tal number<br />

of sloc gets over 2 · 10 6 <strong>for</strong> gcc and 5 · 10 5 <strong>for</strong> llvm. These projects accumulate several<br />

difficulties <strong>for</strong> new comers: low level languages, important code database, diversity of the<br />

languages used.<br />

To tackle this problem, several approaches have bee proposed by the research community.<br />

M. Zenger and M. Odersky proposed in [ZO01] a compiler framework <strong>to</strong> quickly<br />

8. By courtesy <strong>to</strong> pips development team, I only tested trans<strong>for</strong>mations I contributed <strong>to</strong>. And I fixed<br />

most of the found bugs.


64 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

import pyps<br />

import random , sys<br />

while True : # loop as long as no error is found<br />

# instanciate a compiler from the source in first argument<br />

w= pyps . workspace ( sys . argv [1])<br />

# compile it using default source -<strong>to</strong> - binary compiler<br />

b=w. compile ()<br />

# keep output as a reference<br />

( r_ref , o_ref , e_ref )=w. run (b)<br />

# pick a random function in the input code<br />

f= random . choice (w. all_functions )<br />

# select the trans<strong>for</strong>mation given in second argument<br />

# and apply it<br />

getattr (f,sys . argv [2])()<br />

# compile the trans<strong>for</strong>med code<br />

b=w. compile ()<br />

# get its output<br />

(r,o,e)=w. run (b)<br />

# close the compiler instance<br />

w. close ()<br />

# check output versus reference and eventually raise an error<br />

if r!= r_ref or o!= o_ref or e != e_ref :<br />

sys . exit (1)<br />

Listing 3.8: Fuzz testing with pyps.


3.4. RELATED WORK 65<br />

ansic 2100307 (48.66%)<br />

java 681858 (15.80%)<br />

ada 680664 (15.77%)<br />

cpp 594473 (13.77%)<br />

f90 79927 (1.85%)<br />

sh 47006 (1.09%)<br />

asm 44318 (1.03%)<br />

xml 29271 (0.68%)<br />

exp 18422 (0.43%)<br />

objc 15086 (0.35%)<br />

<strong>for</strong>tran 9849 (0.23%)<br />

perl 4462 (0.10%)<br />

ml 2814 (0.07%)<br />

pascal 2176 (0.05%)<br />

awk 1706 (0.04%)<br />

python 1486 (0.03%)<br />

yacc 977 (0.02%)<br />

cs 879 (0.02%)<br />

tcl 392 (0.01%)<br />

lex 192 (0.00%)<br />

haskell 109 (0.00%)<br />

(a) gcc sloccount report.<br />

cpp 453835 (88.82%)<br />

ansic 16764 (3.28%)<br />

asm 13711 (2.68%)<br />

sh 12828 (2.51%)<br />

python 4322 (0.85%)<br />

ml 4274 (0.84%)<br />

perl 2093 (0.41%)<br />

pascal 1489 (0.29%)<br />

exp 431 (0.08%)<br />

objc 334 (0.07%)<br />

xml 283 (0.06%)<br />

ada 235 (0.05%)<br />

lisp 187 (0.04%)<br />

csh 117 (0.02%)<br />

f90 36 (0.01%)<br />

(b) llvm sloccount report.<br />

Table 3.2: sloccount reports <strong>for</strong> the gcc and llvm compilers.<br />

experiment with new language features. They focus on two aspects: the extension of the internal<br />

representation and the composition of compiler components. The <strong>for</strong>mer is achieved<br />

through the use of extensible algebraic types, <strong>to</strong> extend simultaneously the Abstract Syntax<br />

Tree (ast) and the existing phases, and the latter rely on an original design pattern<br />

called Context Component, <strong>to</strong> provide an extensible and hierarchical component system.<br />

In a keynote talk [Coo04], K. D. Cooper emphasises the limitation of existing compiler<br />

architectures <strong>for</strong> iterative compilation at the pass level and favors the use of complex patterns,<br />

beyond list scheduling, supported by a flexible architecture <strong>to</strong> explore several phase<br />

orderings.<br />

Given the difficulty <strong>to</strong> master existing compiler implementations, several recent papers<br />

have focused on the interaction between enlightened compiler developers and compiler<br />

infrastructures. Lattner and Adve introduced in [LA03] a modular pass manager <strong>for</strong><br />

llvm. Later on, gcc overcame its own limitations thanks <strong>to</strong> a plug-in mechanism described<br />

in “Plugins” chapter of [Wik09]. Likewise, the extensible micro-architectural optimizer<br />

described in [HRTV11] relies on the flexibility of their pass manager <strong>to</strong> load additional<br />

passes at run time using a plug-in mechanism.


66 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES<br />

Rudy et al. presented in [RCH + 10] an interactive pass manager based on the lua<br />

language [Ier06]. The pass manager is used <strong>to</strong> scan various parameters of polyhedral<br />

trans<strong>for</strong>mations such as loop unrolling rate or blocking size, and <strong>to</strong> selectively apply trans<strong>for</strong>mations<br />

such as loop interchange. The resulting code is turned in<strong>to</strong> a problem-specific<br />

cuda kernel and the most efficient one is selected, as well as the associated set of trans<strong>for</strong>mations.<br />

The advantage of the approach is that the compiler configuration <strong>for</strong> the specific<br />

target is s<strong>to</strong>red as the pass manager script itself, which can be reused without re-evaluation<br />

<strong>for</strong> further re-compilation.<br />

In [Yi11], separation between the analyses and the trans<strong>for</strong>mations is en<strong>for</strong>ced at the<br />

pass-manager level: A compiler is used <strong>to</strong> generate a valid sequence of pass, as a script<br />

written in a dedicated language, then this sequence is executed by the pass manager. This<br />

approach tries <strong>to</strong> decouple analyses and trans<strong>for</strong>mations, but cannot verify the validity<br />

of the generated script (otherwise it would require the genera<strong>to</strong>r <strong>to</strong> effectively execute<br />

the trans<strong>for</strong>mations) and does not favor reuse of the compiler infrastructure, as the pass<br />

manager executes passes written in another compiler.<br />

Our approach basically turns a compiler in<strong>to</strong> a program trans<strong>for</strong>mation system, which<br />

is an active research area. FermaT [War99] focuses on program refinement and composition<br />

is limited <strong>to</strong> sequences. cil [NMRW02] only provides a flag-based composition<br />

system, that is activation or deactivation of passes in a predefined sequence without taking<br />

in<strong>to</strong> account any feedback from the processing. The stratego [OOVV05] software<br />

trans<strong>for</strong>mation framework does not separate concepts as clearly as we do, but it uses the<br />

concept of strategies <strong>to</strong> describe the chaining of trans<strong>for</strong>mations using dedicated opera<strong>to</strong>rs,<br />

an approach similar <strong>to</strong> ours (see Section 3.1.2). However, the effective implementation relies<br />

on a new language, while we choose <strong>to</strong> map the concepts <strong>to</strong> existing constructs. Work<br />

on optimix by Aßmann [Aß96] proposes considering asts as graphs and passes as graph<br />

trans<strong>for</strong>mations, and <strong>to</strong> use a graph rewriting system <strong>to</strong> specify trans<strong>for</strong>mations. All the<br />

trans<strong>for</strong>mations and their composition are done at the ast level.<br />

In parallel <strong>to</strong> the pass management <strong>to</strong>pic, a growing number of studies has been conducted<br />

over the past few years <strong>to</strong> tackle heterogeneous plat<strong>for</strong>ms. In [ABCR10], a scheme<br />

<strong>to</strong> couple Just In Time (jit) compilation <strong>for</strong> multiple targets is described. They designed<br />

a new Object Oriented (oo) language named lime which has the property of being convertible<br />

in<strong>to</strong> two representations: one targets regular Central Processing Units (cpus) and<br />

one targets Field Programmable Gate Arrays (fpgas) through a complex compilation flow<br />

that involves the verilog language. An originality of the approach is that they provide<br />

a runtime that plays a bridging role between the two representations, allowing a so called<br />

“mixed mode execution” where the best representation is selected at runtime. In fact, the<br />

jit approach is orthogonal <strong>to</strong> our concern. It tackles a slightly different problem: code<br />

portability. Once a code has been compiled <strong>for</strong> a target, is it possible <strong>to</strong> retarget it <strong>for</strong><br />

another hardware, without the need of generating another binary? In the static compilation<br />

model, it is not possible 9 <strong>to</strong> address new hardware, hardware that is not known at<br />

9. pgi’s compiler works around this limitation by bundling several binaries, one per target, in the same<br />

executable. It does provide a kind of retargetability, but it does not address the issue of supporting new


3.5. CONCLUSION 67<br />

compilation time, while theoretically, an update of the jit compiler does.<br />

Ocelot [DKYC10] is a similar project that translates Parallel Thread eXecution (ptx)<br />

code in<strong>to</strong> x86 emulation code, amd gpu code or llvm code <strong>for</strong> parallel execution on multicore,<br />

thanks <strong>to</strong> a jit compilation infrastructure. Choosing ptx as a front end language<br />

provides the advantage of having a direct description of the parallelism, but it does not<br />

address legacy code.<br />

An obvious approach <strong>to</strong> tackle heterogeneity while taking advantage of existing trans<strong>for</strong>mations<br />

is <strong>to</strong> extend traditional compilers <strong>to</strong> support heterogeneous targets. For instance,<br />

gomet [BL10] is a gcc-based compiler that propose the following compilation<br />

flow: firstly, take advantage of gcc <strong>to</strong> parse input code, generate gimple representation<br />

and apply high-level passes such as Simple Static Assignment (ssa) or polyhedral<br />

loop trans<strong>for</strong>mations. Then build an interprocedural hierarchical dependence graph that<br />

is combined with a high-level target description <strong>to</strong> generate source files <strong>to</strong> be compiled <strong>for</strong><br />

the target. The target architecture description takes in<strong>to</strong> account three characteristics: a<br />

cost model <strong>to</strong> decide offloading profitability, the parallelism description <strong>for</strong> simd and/or<br />

mimd code generation, and the current load <strong>for</strong> runtime-scheduling.<br />

This approach shares several aspects with our methodology, but code generation is<br />

per<strong>for</strong>med under the assumption that target hardware compiler accepts plain C code as<br />

input, which is indeed the case <strong>for</strong> openmp and the cell engine, but not <strong>for</strong> gpus or fpgabased<br />

processors. It explains the lack of Instruction Set Architecture (isa) description<br />

in the architecture model. Moreover, the architecture description consists in a set of C<br />

functions that implement the correct behavior, which lacks in abstraction and flexibility.<br />

The compilation process is hard-coded <strong>for</strong> each target and not retargetable.<br />

3.5 Conclusion<br />

In this chapter, we have presented the impact of heterogeneous computing on compiler<br />

infrastructures. We have pointed out that the use of the C language with the proper<br />

conventions is a good choice of ir. We have described a compilation infrastructure that<br />

takes advantage of this choice combined with source-<strong>to</strong>-source capabilities <strong>to</strong> en<strong>for</strong>ce code<br />

reuse and make it easier <strong>to</strong> interact with third-party <strong>to</strong>ols. Based on a new model <strong>for</strong><br />

the combination of code trans<strong>for</strong>mations, we have specified an api <strong>for</strong> pass managers,<br />

called pyps, and combined it with a generic programming language <strong>to</strong> end up with a<br />

flexible way <strong>to</strong> design compilers <strong>for</strong> heterogeneous devices. The pyps api and its Python<br />

implementation are publicly available as part of the pips project.<br />

This api is used <strong>to</strong> combine the code trans<strong>for</strong>mations needed <strong>to</strong> match the hardware<br />

constraints identified in Chapter 2 and presented in the next three chapters.<br />

architectures.


68 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES


Chapter 4<br />

Representing the Instruction Set<br />

Architecture in C<br />

Pont de Sainte Catherine, Plounevezel, Finistère c○ Jean-Claude Even<br />

In a keynote given at the Fusion Developers Summit 2011, in Bellevue, Washing<strong>to</strong>n,<br />

Phil Rogers announced that<br />

The Fusion System Architecture (fsa) is Instruction Set Architecture (isa)<br />

agnostic <strong>for</strong> both Central Processing Units (cpus) and Graphical Processing<br />

Units (gpus). This is very important because we’re inviting partners <strong>to</strong> join<br />

us in all areas; other hardware companies <strong>to</strong> implement fsa and join in the<br />

plat<strong>for</strong>m. . .<br />

Under the hood, the Fusion System Architecture (fsa) relies on a virtual isa. In Chapter<br />

3, we state that keeping the Internal Representation (ir) independent from the targeted<br />

hardware is a requirement <strong>to</strong> reach a good level of abstraction. But is it possible <strong>to</strong> represent<br />

all the refinements of targeted isa in an ir that stays close <strong>to</strong> the C language [ISO99]?<br />

Quoting Brian Kernighan [Ker03],<br />

C is perhaps the best balance of expressiveness and efficiency that has ever<br />

been seen in programming languages. (. . . ) It was so close <strong>to</strong> the machine<br />

69


70CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

that you could see what the code would be (and it wasn’t hard <strong>to</strong> write a good<br />

compiler), but it still was safely above the instruction level and a good enough<br />

match <strong>to</strong> all machines that one didn’t think about specific tricks <strong>for</strong> specific<br />

machines.<br />

This reminds us that the C language was designed <strong>to</strong> be “close-<strong>to</strong>-the-metal”. So even if<br />

the chosen ir remains as close as possible <strong>to</strong> the C language, it can be sufficiently low level<br />

<strong>to</strong> express some of the specificities of the targeted isa. This is examined in Section 4.1. We<br />

then go through all the aspects of an isa and show that, provided some conventions and<br />

minor trans<strong>for</strong>mations, it is possible <strong>to</strong> adapt a C code <strong>to</strong> meet isa constraints. Section 4.2<br />

examines native data types; Section 4.3 reviews the use of specific registers, Section 4.4 details<br />

the link between intrinsics and instructions and Section 4.5 goes through the difference<br />

related <strong>to</strong> the memory architecture. Issues related <strong>to</strong> function boundaries are examined in<br />

Section 4.6 and external libraries are examined in Section 4.7.<br />

4.1 C as a Common Denomina<strong>to</strong>r<br />

The C language is the de fac<strong>to</strong> standard <strong>to</strong> program low-level devices. In this section,<br />

we study the relationships between the standard language and the dialects used <strong>to</strong> program<br />

hardware accelera<strong>to</strong>rs and argue the C language can be used <strong>to</strong> embody some aspects of<br />

the isa.<br />

4.1.1 C Dialects and <strong>Heterogeneous</strong> Computing<br />

A problem raised by heterogeneous architectures from a compiler point of view is the<br />

choice of a suitable ir. Representing all hardware specificities in a unique language is<br />

not a feasible task. Although there has been recent work [PBdD11] <strong>to</strong> express hardware<br />

constraints at the ir level, representing all of them in a common ir is not easy. Some<br />

compilers have however chosen <strong>to</strong> extend their ir <strong>to</strong> represent target-specific features: Low<br />

Level Virtual Machine (llvm) integrates vec<strong>to</strong>r types <strong>to</strong> their basic types. Alternatively,<br />

a target-specific language can be used as a basis <strong>for</strong> other architectures: fcuda [PGS + 09]<br />

translates Compute Unified Device Architecture (cuda) kernels in<strong>to</strong> Field Programmable<br />

Gate Array (fpga) circuits and swan [HF11] translates cuda codes in<strong>to</strong> Open Computing<br />

Language (opencl) ones.<br />

An opposite approach is <strong>to</strong> use the versatility of the C language <strong>to</strong> represent both high<br />

level concepts and low level concepts using an ir that matches the initial language without<br />

target-specific extensions. In essence, this is similar <strong>to</strong> the concept of language virtualization.<br />

The definition <strong>for</strong> language virtualization given by Hassan Chafi et al. [CDM + 10] is<br />

the following:<br />

A programming language is virtualizable with respect <strong>to</strong> a class of embedded<br />

languages if and only if it can provide an environment <strong>to</strong> these embedded<br />

languages that makes the embedded implementations essentially identical <strong>to</strong>


4.1. C AS A COMMON DENOMINATOR 71<br />

C dialects<br />

Handel-C [LWFK02] Mitrion-C c2h cuda opencl<br />

parent language C C++ C C++ C99<br />

target fpga fpga fpga gpu gpu & manycore<br />

Table 4.1: C dialects and targeted hardware.<br />

corresponding stand-alone language implementations in terms of expressiveness,<br />

per<strong>for</strong>mance and safety—with only modestly more ef<strong>for</strong>t than implementing the<br />

simplest possible complete embeddings.<br />

We propose a relaxed definition <strong>for</strong> a programming language that can be derived from<br />

another programming language providing minor rewriting while maintaining an equivalent<br />

semantic.<br />

Definition 4.1. A programming language L0 targeting a hardware H0 is embodied by<br />

another programming language L1 targeting an hardware H1 if there exists a code trans<strong>for</strong>mation<br />

τ : L0 → L1 so that the execution on H0 of a program P written in L0 yields the<br />

same operational result as the execution of τ(P ) on H1.<br />

This definition is deliberately imprecise on the term of “yields the same operational<br />

result”, because execution order of varying type size introduces output changes that may<br />

not be significant <strong>to</strong> the end user.<br />

Table 4.1 lists a number of languages used <strong>to</strong> program hardware accelera<strong>to</strong>rs and their<br />

target plat<strong>for</strong>ms. It perfectly shows that C dialects are often chosen as an interface between<br />

the programmer and the hardware.<br />

4.1.2 From the ISA <strong>to</strong> C<br />

In [JRR99], Jones et al. presented a language called C-- that aims <strong>to</strong> be<br />

The interface between high-level compilers and retargetable, optimizing<br />

code genera<strong>to</strong>rs.<br />

This language reduces the possibilities of the C language <strong>to</strong> make it easier <strong>to</strong> manipulate.<br />

It is quite low level and does not match the requirements of a high-level language <strong>to</strong> abstract<br />

concepts, but it paves the way <strong>for</strong> the idea of using a subset of C as an ir.<br />

C extensions such as Embedded-C [ISO08] have also been proposed <strong>to</strong> handle some<br />

specificities of heterogeneous computing, like dedicated registers or fixed point arithmetic,<br />

etc., but more important <strong>to</strong> us the availability of multiple address spaces. In the ISO/IEC<br />

TR 18037:2008 specification, a global address space is assumed and additional ones can<br />

be declared, in a nested fashion, and used as a qualifier. However, operations on disjoint<br />

address spaces are not allowed, thus data communication must be handled through<br />

overlapping address spaces or with intrinsic functions.<br />

In the context of heterogeneous computing, an ir must achieve two goals:


72CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

__m128 _mm_set1_ps ( float );<br />

typedef struct {<br />

float data [4];<br />

} __m128 ;<br />

Listing 4.1: Broadcast a single value in sse.<br />

Listing 4.2: Vec<strong>to</strong>r type emulation in C.<br />

1. make it easy <strong>for</strong> compiler developers <strong>to</strong> write passes and analyses;<br />

2. make it easy <strong>for</strong> compiler developers <strong>to</strong> write back-ends.<br />

The Hierarchical Control Flow Graph (hcfg) used in Paralléliseur Interprocedural<br />

de Programmes Scientifiques (pips) [CJIA11] somehow matches the first point under the<br />

constraint of source-<strong>to</strong>-source compilation, the idea being that the code hierarchy maps<br />

the hardware hierarchy. For instance if you target Instruction Level Parallelism (ilp),<br />

you can focus on loops and ignore the rest of the program; <strong>for</strong> task parallelism, you can<br />

focus on functions, etc. The second point raises the question: how <strong>to</strong> model hardware<br />

specific isa at the ir level? As described in [PH09], each piece of hardware has its own<br />

isa, and it is not sustainable <strong>to</strong> extend the ir <strong>to</strong> match each new hardware specificity, or<br />

each time a new feature appears. This is however the path taken by some compilers such<br />

as rose [Qui00, SQ03] where Parallel Thread eXecution (ptx) concepts are exhibited at<br />

the ir level. A direct consequence of this approach is that all existing algorithms must be<br />

extended <strong>to</strong> take this new constructs in<strong>to</strong> account. In this section, we propose an alternate<br />

approach that consists in modeling enough of the isa at the C level <strong>to</strong> leverage existing<br />

analyses.<br />

Let us take a short example, taken from the C <strong>to</strong> Streaming simd Extension (sse)<br />

compiler described in Chapter 7. To duplicate a single single precision floating point value<br />

in all 4 slots of an sse register, an intrinsic is available with the signature given in Listing 4.1<br />

Instead of adding vec<strong>to</strong>r type support <strong>to</strong> the ir, one could emulate its behavior using<br />

an array type, encapsulated in a structure <strong>to</strong> be able <strong>to</strong> use it as a returned value from a<br />

function, as shown in Listing 4.2, and provide a sequential implementation that per<strong>for</strong>ms<br />

the same computations and have the same memory effects. That way the same result is<br />

produced and the data dependencies are still correct. In this case, it leads <strong>to</strong> the code in<br />

Listing 4.3.<br />

To validate our approach, we have written a replacement of the header file xmmintrin.h<br />

that does not make use of vec<strong>to</strong>r extensions. Said otherwise, we embody the sse extension<br />

using the C language. An excerpt of this file is shown in Appendix D. It has been<br />

successfully tested on an sse-based implementation of the SHA-1 algorithm: the project<br />

is compiled twice, once using the default configuration and once using the same configuration<br />

plus and additional flag <strong>to</strong> tell the compiler <strong>to</strong> use our sequential implementation<br />

of xmmintrin.h. The same input are processed and we verify we get the same checksum.


4.2. NATIVE DATA TYPES 73<br />

__m128 _mm_set1_ps ( float v) {<br />

__m128 res = { { v ,v ,v ,v } };<br />

return res ;<br />

}<br />

Listing 4.3: Sequential implementation of _mm_set1_ps.<br />

Additionally, we measure that the sequential implementation runs 25 times slower than<br />

the optimized version, mainly due <strong>to</strong> repeated copy between union data types 1 , plus the<br />

absence of parallelism. 2<br />

4.2 Native Data Types<br />

As a result of the specialization of hardware devices, it is more common <strong>to</strong> find a device<br />

that does not support a data type than a device that supports a data type not supported<br />

by the C language. This section examines how <strong>to</strong> handle languages that do not support<br />

all the type constructions allowed by the C language.<br />

4.2.1 Scalar Types<br />

A common case however is the absence of floating point type. The usual fall-back is<br />

fixed-point arithmetic. This approach is favored when a Floating-Point Unit (fpu) would<br />

be <strong>to</strong>o expensive, in area or energy consumption, or when the accuracy loss is not a problem.<br />

Kum et al. [KKS00] propose an approach <strong>to</strong> convert C programs with floating-point types<br />

<strong>to</strong> equivalent program with fixed-point arithmetic.<br />

Some hardware support standard type with unusual sizes. For instance, the terapix<br />

architecture [BLE + 08] uses integer registers of 36 bits. This kind of situation is easily<br />

emulated using a type definition. See how the C99 standard headers defines int32_t and<br />

int64_t types.<br />

4.2.2 Records<br />

Handling the absence of structures requires more care. The obvious solution is <strong>to</strong> split<br />

each variable that has a structure type in<strong>to</strong> as many variables as the number of fields.<br />

The first step builds the set of all final types involved in a type definition. To do so we<br />

introduce a function split_struct : T × I → P(T × I) characterized by the following<br />

induction rules defined on the LuC language (see Appendix B):<br />

1. These copies could be avoided by using the C++ language, constant references and return value<br />

optimization, at the expense of losing C compatibility.<br />

2. Neither gnu C Compiler (gcc), Intel C++ Compiler (icc) or pips are capable of au<strong>to</strong>matically<br />

revec<strong>to</strong>rizing the sequential code because of the additional structure copies.


74CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

final_types(int, id) = {〈int, id〉}<br />

final_types(float, id) = {〈float, id〉}<br />

final_types(complex, id) = {〈complex, id〉}<br />

<br />

final_types(type[expr], id) =<br />

final_types(struct id s { fields }, id) = <br />

〈t,i〉∈final_types(type,id)<br />

〈t,i〉∈fields<br />

〈t[expr], i〉<br />

final_types(t, new_id(id, i))<br />

where “new_id” is a function that constructs a new identifier unique <strong>to</strong> the program p<br />

using a prefix pre such that no variable declared in p is prefixed by “pre”.<br />

new_id : I × I → I<br />

(i0, i1) ↦→ prei0_i1<br />

Then all references are rewritten using the rename : R → R function defined by the<br />

induction rules:<br />

rename(id) = id if id is a scalar<br />

rename(ref [expr]) = rename(ref )[renamee(expr)]<br />

where renamee : E → E is defined by:<br />

rename(ref.id) = rename(kref )_id<br />

renamee(cst) = cst<br />

renamee(ref = rename(ref )<br />

renamee(expr op expr) = renamee(expr) op renamee(expr)<br />

This trans<strong>for</strong>mation is illustrated by Listing 4.4, under the assumption of pre =__.


4.2. NATIVE DATA TYPES 75<br />

typedef struct { int f0; float f1; float f2 [3] } my;<br />

my var ;<br />

var .f0 =2;<br />

var .f2[var .f0 ]=4.2;<br />

int __var_f0 ; float __var_f1 ;<br />

float __var_f2 [3];<br />

__var_f0 =2;<br />

__var_f2 [ __var_f0 ]=4.2;<br />

⇓<br />

Listing 4.4: Example of structure removal.<br />

This trans<strong>for</strong>mation is problematic in three cases:<br />

– when a variable of structure type is passed as a function parameter;<br />

– when the address of a variable involved in a structure is taken;<br />

– when the sizeof opera<strong>to</strong>r is used.<br />

These situations cannot be handled by our <strong>to</strong>y language but arise in practice. For the<br />

first case, it is still possible <strong>to</strong> extend the function definition <strong>to</strong> take one parameter per<br />

final_types(t) element, as illustrated by Listing 4.5. In the second case we cannot do much<br />

because the memory layout is completely changed by the trans<strong>for</strong>mation and we cannot<br />

assume any previous pointer arithmetics is still correct. The third case is easy <strong>to</strong> handle,<br />

by replacing the sizeof expression by the actual type size. 3<br />

4.2.3 Arrays<br />

It is also possible <strong>to</strong> handle the absence of array types in two circumstances: if all arrays<br />

have fixed size and constant indices in access functions. 4 This is simply done by creating<br />

as many variables as the array size and then use a renaming convention <strong>for</strong> all constant<br />

array indices. The declaration trans<strong>for</strong>mation d : T × I → (T × I) + is given by:<br />

d(int, id) = {〈int, id〉}<br />

d(float, id) = {〈float, id〉}<br />

d(complex, id) = {〈complex, id〉}<br />

d(type[cst] id) =<br />

i=cst <br />

i=1<br />

d(type, new_id(id, cst))<br />

3. At that point we are already at the specialization step, so portability across hardware plat<strong>for</strong>m is no<br />

longer an issue and considerations like “the size of a type is plat<strong>for</strong>m dependant” are no longer relevant.<br />

4. This situation was found in code au<strong>to</strong>matically generated from the Faust programming language<br />

[GO03].


76CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

typedef struct { double re ,im; } complex ;<br />

void cmul ( complex *res , const complex *c0 , const complex *c1) {<br />

res ->re=c0 ->re*c1 ->re - c0 ->im*c1 ->im;<br />

res ->im=c0 ->re*c1 ->im + c0 ->im*c1 ->re;<br />

}<br />

/* ... */<br />

complex a={1 ,0} ,b={0 ,1} , c;<br />

cmul (&c ,&a ,&b);<br />

void cmul ( double * res_re , double * res_im ,<br />

const double * c0_re , const double *c0_im ,<br />

const double * c1_re , const double * c1_im ) {<br />

* res_re =* c0_re * * c1_re - * c0_im * * c1_im ;<br />

* res_im =* c0_re * * c1_im + * c0_im * * c1_re ;<br />

}<br />

/* ... */<br />

double a_re =1, a_im =0; b_re =0, b_im =1,c_re , c_im ;<br />

cmul (& c_re ,& c_im ,& a_re ,& a_im ,& b_re ,& b_im );<br />

⇓<br />

Listing 4.5: Structure removal in the presence of function call.<br />

and the reference r : R → R renaming is similarly given by<br />

r(id) = id if id is a scalar<br />

r(ref [cst]) = r(preref )_cst<br />

r( ref.id) = r( ref ).id<br />

which is similar <strong>to</strong> the structure case and suffers the same limitations.<br />

A common case is the absence of array type but support <strong>for</strong> pointers. The trans<strong>for</strong>mation<br />

from arrays <strong>to</strong> pointers requires two steps:<br />

1. a linearization step, where multi-dimensional arrays are converted <strong>to</strong> uni-dimensional<br />

ones;<br />

2. an array-<strong>to</strong>-pointer conversion step, using the equivalence between a[i] and *(a+i).<br />

Listing 4.6 illustrates the two steps of this trans<strong>for</strong>mation.<br />

4.3 Registers<br />

Depending on the isa, some registers can only be used <strong>for</strong> particular operations. For<br />

instance, the x86 architecture distinguishes between general purpose registers and floating


4.4. INSTRUCTIONS 77<br />

// initial code<br />

float a[n ][3];<br />

a [2* i ][1]++;<br />

// after linearization<br />

float a [3* n];<br />

a [2* i *3+1]++;<br />

// after pointer conversion<br />

float *a= alloca ( sizeof (*a)*n *3);<br />

*(a+2* i *3+1)++;<br />

⇓<br />

⇓<br />

Listing 4.6: Two-step trans<strong>for</strong>mation from multi-dimensional arrays <strong>to</strong> pointers.<br />

point registers. In the ptx [NVI10], special registers are dedicated <strong>to</strong> thread identifiers<br />

(%tid) and warp identifiers (%warpid).<br />

According <strong>to</strong> the compilation scheme proposed in Figure 3.9, low level trans<strong>for</strong>mations<br />

are taken care of by the vendor compiler. However, there are situations where such compilers<br />

are not available and only an assembler is given—we found ourselves in this situation<br />

<strong>for</strong> the terapix machine.<br />

In that case, a specific naming scheme is used <strong>to</strong> distinguish specific registers from<br />

others, and hardware-specific heuristics are used <strong>to</strong> distinguish general-purpose registers<br />

from others.<br />

Let us take the example of the terapix architecture [BLE + 08] that uses three kinds of<br />

registers with three prefixes: im <strong>for</strong> image pointers, ma <strong>for</strong> mask pointers and re <strong>for</strong> scalar<br />

variables. For each variable, depending on its type (array, pointer or scalar), it is possible <strong>to</strong><br />

determine whether it needs the re prefix. Then, <strong>to</strong> distinguish between mask data, s<strong>to</strong>red<br />

in a small read-only memory, and image data, s<strong>to</strong>red in a bigger read-write memory, we<br />

use a heuristic: any array that is written at least once goes <strong>to</strong> the image memory, and<br />

an array that is only read goes in the mask memory if its memory footprint is statically<br />

known <strong>to</strong> be less than a given constant. Listing 4.7 illustrates this trans<strong>for</strong>mation <strong>for</strong> a<br />

kernel that lightens up an image by a constant, given in a mask.<br />

4.4 Instructions<br />

Simple instructions usually have a C equivalent: set and move can be represented as<br />

assignments, though more complex functions may be needed <strong>to</strong> represent load from remote<br />

memory, as is the case <strong>for</strong> vec<strong>to</strong>r register. Likewise read and write from external devices


78CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

void launcher_0_microcode ( int I, int *in , int *out , int * cst )<br />

{<br />

int j;<br />

int *out0 , * in0 ;<br />

in000 = in00 ;<br />

out000 = out00 ;<br />

<strong>for</strong> (j = 0; j < I_18 ; j += 1) {<br />

* out000 = * in000 +* cst ;<br />

in000 = in000 +1;<br />

out000 = out000 +1;<br />

}<br />

}<br />

⇓<br />

void launcher_0_microcode ( int * FIFO2 , int *FIFO1 , int *FIFO0 , int N0)<br />

{<br />

int *im0 , *im1 , *im2 , *im3 , * ma4 ;<br />

int re0 ;<br />

ma4 = FIFO2 ;<br />

im3 = FIFO1 ;<br />

im2 = FIFO0 ;<br />

im0 = im2 ;<br />

im1 = im3 ;<br />

<strong>for</strong> ( re0 = 0; re0 < N0; re0 += 1) {<br />

* im1 = * im0 +* ma4 ;<br />

im0 = im0 +1;<br />

im1 = im1 +1;<br />

}<br />

}<br />

Listing 4.7: Using naming convention <strong>to</strong> distinguish registers <strong>for</strong> the terapix architecture.


4.4. INSTRUCTIONS 79<br />

can be abstracted as memcpy using the proper memory location abstractions, as discussed<br />

in Section 4.5.<br />

Basic arithmetic and logic operations are available in C, and if not (e.g. Fused Multiply-<br />

Add (fma) or min, max opera<strong>to</strong>rs) can be represented by an equivalent function.<br />

Control flow instructions such as branching, looping, conditional branch, indirect<br />

branch, jumps, etc. all have their C equivalent. If a complex control-flow is missing in the<br />

targeted language, it is generally possible <strong>to</strong> lower it <strong>to</strong> the desired level. Such trans<strong>for</strong>mations<br />

include the trans<strong>for</strong>mation of <strong>for</strong> loops <strong>to</strong> while loops, repeat until <strong>to</strong> while,<br />

or even while <strong>to</strong> go<strong>to</strong>s. if then else tests can be split in<strong>to</strong> two if then blocks, etc.<br />

The maximum number of operands can be restricted by the isa. In that case, complex<br />

C expressions can be split <strong>to</strong> take this constraint in<strong>to</strong> account.<br />

Parallel instructions as found in Advanced Vec<strong>to</strong>r eXtensions (avx) instruction sets<br />

can be emulated by their sequential counterparts, which correspond <strong>to</strong> one of the possible<br />

ways of scheduling the parallel operations.<br />

4.4.1 Instruction Selection<br />

Specialized instruction sets are used <strong>to</strong> speedup executions. At the extreme, it has led<br />

<strong>to</strong> Complex Instruction Set Computer (cisc). This specialization is often seen in dsp:<br />

fma, saturated arithmetic, min/max, etc. While not manda<strong>to</strong>ry <strong>to</strong> obtain a correct code,<br />

taking advantage of these instructions is manda<strong>to</strong>ry <strong>for</strong> per<strong>for</strong>mance.<br />

The process of mapping the target-independent ir <strong>to</strong> a target-specific instruction set<br />

is called instruction selection and is one of the last steps <strong>to</strong> be per<strong>for</strong>med in a traditional<br />

compiler. The problem is generally solved by using dynamic programing on expression<br />

trees [AJ75] or by sub-graph partitioning [API03].<br />

In our case, this process should be delegated <strong>to</strong> the source-<strong>to</strong>-binary compiler and not<br />

<strong>to</strong> the source-<strong>to</strong>-source compiler. However, it may be necessary <strong>to</strong> per<strong>for</strong>m this step:<br />

– be<strong>for</strong>e Single Instruction stream, Multiple Data stream (simd) instruction generation.<br />

For instance the neon instruction set supports a vec<strong>to</strong>rized fma, so the fma pattern<br />

must be found prior <strong>to</strong> simdization.<br />

– when the source-<strong>to</strong>-binary compiler does not per<strong>for</strong>m complex instruction selections.<br />

Basically, a few instruction have <strong>to</strong> be converted <strong>to</strong> function calls at the source level. As<br />

the expression tree of these few expressions either do not overlap (e.g. fma and maximum)<br />

or are subsets one of another (e.g. maximum and saturated add), a greedy algorithm<br />

per<strong>for</strong>ms well enough.<br />

An instruction is described by its name, the number and type of operands and the<br />

expression tree. In a very source-<strong>to</strong>-source manner, we represent an instruction by a regular<br />

C function, using the function name as instruction identifier, parameters as operands and<br />

body as expression tree. 5 From a list of patterns, the algorithm iteratively per<strong>for</strong>ms a<br />

5. This approach is quite similar <strong>to</strong> the way C intrinsics move assembly operations at the C function<br />

level. For instance gcc defines the intrinsic __sync_bool_compare_and_swap <strong>to</strong> issue an a<strong>to</strong>mic<br />

compare-and-swap. Compiling the call with gcc 4.6.1 on a x86 computer translates in<strong>to</strong> the assembly call<br />

lock cmpxchgl that can only be represented in C through an asm(...) call.


80CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

pattern-matching pass <strong>for</strong> each element of the sequence.<br />

4.4.2 N-Address Code Generation<br />

When a program contains long expressions, it is sometime needed <strong>to</strong> break these expressions<br />

in<strong>to</strong> smaller ones. This trans<strong>for</strong>mation is called N-address code generation. It<br />

is generally used <strong>for</strong> assembly code generation. It can however be useful at the source<br />

level, either because the back-end compiler takes assembly code as input, or <strong>to</strong> create<br />

opportunities <strong>for</strong> further trans<strong>for</strong>mations like invariant code motion.<br />

The problem can first be stated as follows: given a statement in the <strong>for</strong>m id = expr,<br />

where id is an identifier and an integer n ∈ N ∗ , trans<strong>for</strong>m it <strong>to</strong> a sequence of assignments<br />

so that any expression in the right hand side of the assignment does not involve more than<br />

n references.<br />

Let us define the trans<strong>for</strong>mation a : N + × S → S :<br />

n, id = cst ↦→ id = cst (4.1)<br />

n, id = ref ↦→ id = ref (4.2)<br />

<br />

id = e0 op e1 if depth(e0) + depth(e1) < n<br />

n, id = e0 op e1 ↦→<br />

a(n, id = e0); a(n, id0 = e1); id = id op id0 otherwise<br />

(4.3)<br />

where id 0 is a new identifier unique <strong>to</strong> the program and depth : E → N is defined by<br />

cst ↦→ 1<br />

id ↦→ 1<br />

id = id op expr ↦→ 1 + depth(expr)<br />

expr 0 op expr 1 ↦→ depth(expr 0) + depth(expr 1)<br />

Let us prove that this trans<strong>for</strong>mation is correct, that is, given n ∈ N + , a memory state<br />

σ ∈ Σ and a statement s ∈ S so that s = id = expr, we have:<br />

S(s, σ) = S(a(n, s), σ)<br />

Proof. We use a recursive reasoning. Input statement is unchanged by Equations (4.1)<br />

and (4.2) so the equality is verified <strong>for</strong> these two cases.


4.4. INSTRUCTIONS 81<br />

Case (4.3) when depth(expr 0) + depth(expr 1) ≥ n leads <strong>to</strong><br />

S(a(n, s), σ)<br />

=S(a(n, id = expr 0) ; a(n,id 0 = expr 1) ; id = id op id 0, σ)<br />

=S(a(n, id = id op id 0), S(a(n, id 0 = expr 1 ), S(a(n, id = expr 0), σ)))<br />

=S(id = id op id 0, S(id 0 = expr 1 , S(id = expr 0, σ))) from recursion hypothesis<br />

<br />

=S(id = id op id 0, S(id 0 = expr 1 , σ[R(id, σ) → E(expr 0, σ)]))<br />

=S(id = id op id 0, σ ′ [R(id 0, σ ′ ) → E(expr 1, σ ′ ))<br />

<br />

=S(id = id op id 0, σ ′ [R(id 0, σ) → E(expr 1, σ)])<br />

=σ ′′ [R(id, σ ′′ ) → E(id op id 0, σ ′′ )]<br />

=σ ′′ [R(id, σ ′′ ) → E(id, σ ′′ ) op E(id 0, σ ′′ )]<br />

=σ ′′ [R(id, σ ′′ ) → E(id, σ ′′ ) op E(id 0, σ ′′ )]<br />

=σ ′′ [R(id, σ ′′ ) → σ ′′ (R(id, σ ′′ )) op σ ′′ (R(id 0, σ ′′ ))]<br />

σ ′′<br />

Because id 0 is an identifier, ∀(σ0, σ1) ∈ Σ 2 , R(id 0, σ0) = R(id 0, σ1).<br />

=σ ′′ [R(id, σ) → σ ′′ (R(id, σ)) op σ ′′ (R(id 0, σ))]<br />

=σ ′′ [R(id, σ) → E(expr 0, σ) op E(expr 1, σ))]<br />

=S(id = expr 0 op expr 1, σ)[R(id 0, σ) → E(expr 1, σ)]<br />

The update of location R(id 0, σ) can be safely ignored as it is a new unique identifier.<br />

In the definition of the problem, we state that an assignment must not contain more<br />

then n references. However, a reference can itself contain expressions. Let us define an<br />

auxiliary function ar : N + × R → (R × S) with the following syntactic rules:<br />

n, id ↦→ 〈id, ;〉<br />

n, ref [expr] ↦→ 〈id 0[id 1], sr ; id 0 = rr ; id 1 = expr〉 where 〈rr, sr〉 = ar(n, ref )<br />

n, ref.id ↦→ 〈id 0.id, sr ; id 0 = rr〉 where 〈rr, sr〉 = ar(n, ref )<br />

Given a positive value n ∈ N + , a memory state σ ∈ Σ and a reference r ∈ R, let us<br />

prove that:<br />

σ ′


82CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

〈rr, sr〉 = ar(n, r),<br />

R(r, σ) = R(rr, S(sr, σ))<br />

Proof. We use an inductive reasoning over r. The property is true <strong>for</strong> case id as the<br />

denoted statement is unchanged.<br />

Consider the case r = ref [expr], that is<br />

ar(n, r) = 〈rr, sr〉 = 〈id 0[id 1], s ′ r ; id 0 = r ′ r ; id 1 = expr〉 where 〈r ′ r, s ′ r〉 = ar(n, ref )<br />

We have<br />

R(rr, S(sr, σ))<br />

=R(rr, S(s ′ r ; id 0 = r ′ r ; id 1 = expr, σ))<br />

=R(rr, S(id 1 = expr, S(id 0 = r ′ r,<br />

σ ′<br />

<br />

S(s ′ r, σ))))<br />

=R(rr, S(id 1 = expr, σ ′ [id0 → σ ′ (R(r ′ r, σ ′ ))]))<br />

<br />

=R(rr, S(id 1 = expr,<br />

σ ′′ ′<br />

<br />

=R(rr,<br />

σ ′′ [id1 → E(expr, σ ′′ )])<br />

=R(id 0[id 1], σ ′′ ′ )<br />

=R(id 0, σ ′′ ′ )[E(id 1, σ ′′ ′ )]<br />

=R(id 0, σ ′′ ′ )[σ ′′ ′ (id1)]<br />

=R(id 0, σ ′′ ′ )[E(expr, σ ′′ )]<br />

σ ′′<br />

σ ′ [id0 → σ ′ (R(rr, σ))]))<br />

=R(id 0, σ ′′ ′ )[E(expr, σ[id0 → . . . , id1 → . . . ])]<br />

=R(id 0, σ ′′ ′ )[E(expr, σ)]<br />

=R(r ′ r, σ ′ )[E(expr, σ)]<br />

=R(r ′ r, S(s ′ r, σ))[E(expr, σ)]<br />

=R(r ′ r, σ ′ )[E(expr, σ)]<br />

=R(ref , σ)[E(expr, σ)]<br />

=R(ref [expr], σ)<br />

Next, consider the case r = ref.id, that is<br />

ar(n, r) = 〈rr, sr〉 = 〈id 0.id, s ′ r ; id 0 = r ′ r〉 where 〈r ′ r, s ′ r〉 = ar(n, ref )


4.5. MEMORY ARCHITECTURE 83<br />

We have<br />

R(rr, S(sr, σ))<br />

=R(rr, S(s ′ r ; id 0 = r ′ r, σ))<br />

=R(rr, S(S(id 0 = r ′ r,<br />

σ ′<br />

<br />

S(s ′ r, σ))))<br />

=R(rr, S(σ ′ [id0 → σ ′ (R(r ′ r, σ ′ ))]))<br />

=R(rr, σ ′ [id0 → σ ′ (R(ref ))])<br />

=R(id 0.id, σ ′ [id0 → σ ′ (R(ref ))])<br />

=R(id 0, σ ′ [id0 → σ ′ (R(ref ))]).id<br />

=R(ref , σ ′ [id0 → σ ′ (R(ref ))]).id<br />

=R(ref , σ).id<br />

=R(r, σ)<br />

The application of ar generates valid input <strong>for</strong> a so that code that contains no more<br />

than n identifiers per statement can be generated.<br />

4.5 Memory Architecture<br />

The logical memory model <strong>for</strong> the C language is flat. Qualifiers can be used <strong>to</strong> change<br />

some of the s<strong>to</strong>rage properties: const, volatile, register, etc. Heap and stack concepts<br />

are not linked <strong>to</strong> the C language.<br />

However, heterogeneous machines have different separated memories. It is still possible<br />

<strong>to</strong> emulate these different memories using different address spaces <strong>for</strong> different memories,<br />

much like user space is separated from kernel space. To distinguish between two memory<br />

spaces, we use a naming convention over the variable type name. For instance, variables<br />

allocated on a gpu have their type name prefixed by __gpu_. An other option is the<br />

qualifier extension proposed <strong>for</strong> embedded C [ISO08] but it implies an extension of the ir.<br />

4.6 Function calls<br />

4.6.1 Removing Function Calls<br />

Some isa simply do not support function calls. At the source level, inlining [CH89,<br />

JM99] is a convenient solution <strong>to</strong> sidestep the issue without breaking the code structure<br />

as stack emulation or indirect go<strong>to</strong> would, as long as recursive calls are not involved.


84CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

Although inlining seems a straight-<strong>for</strong>ward trans<strong>for</strong>mation, there are several details<br />

that must taken care of in the context of source-<strong>to</strong>-source compilation, because generated<br />

code must still be correct:<br />

1. if the inlined function contains static declarations, these declarations must be made<br />

global;<br />

2. if the inlined function contains references <strong>to</strong> global variables, they must be declared<br />

with external s<strong>to</strong>rage at the call site; 6<br />

3. if the inlined function contains references <strong>to</strong> enumeration or structure fields that<br />

are not visible from the caller compilation unit, they must be redeclared in this<br />

compilation unit; 7<br />

4. most naming conflicts can be dealt with using an extra block declaration, but not<br />

when static variables are promoted as globals, neither <strong>for</strong> label names nor <strong>for</strong> effective<br />

parameter names;<br />

5. return instructions must be replaced by a go<strong>to</strong>, possibly preceded by an assignment.<br />

A basic inlining simulates by-copy parameter passing through additional assignments<br />

and generates a lot of go<strong>to</strong>s. Yet it is possible <strong>to</strong> use <strong>for</strong>ward substitution [Muc97] <strong>to</strong><br />

remove the spurious assignments and go<strong>to</strong> elimination [Ero95] <strong>to</strong> restructure the control<br />

flow.<br />

4.6.2 Outlining<br />

Most languages dedicated <strong>to</strong> hardware accelera<strong>to</strong>rs use functions <strong>to</strong> separate the host<br />

code from the accelera<strong>to</strong>r code, generally using a new qualifier (e.g. __kernel__) <strong>to</strong> identify<br />

accelera<strong>to</strong>r functions. However, the code fragment that is <strong>to</strong> be promoted as a kernel is<br />

usually part of another statement, so it is necessary <strong>to</strong> extract a function from a code<br />

fragment—a statement. This trans<strong>for</strong>mation is called outlining.<br />

4.6.2.1 Outlining Algorithm<br />

There exist two main ways of passing parameters:<br />

by copy: during function call, the <strong>for</strong>mal parameter is replaced by a copy of the actual<br />

parameter. It holds the same value but has a (possibly) different name and a different<br />

location. This is the passing mode used in C.<br />

by reference: during function call, the <strong>for</strong>mal parameter is directly replaced by the actual<br />

parameter. It holds the same value and has the same location. It is emulated in C<br />

by passing the address of a variable as parameter instead of the variable. This is the<br />

passing mode used in Fortran<br />

6. This also includes function calls.<br />

7. This also includes calls <strong>to</strong> static functions.


4.6. FUNCTION CALLS 85<br />

by constant reference: During function call, the <strong>for</strong>mal parameter is directly replaced<br />

by the effective parameter as in by reference parameter passing, but it is guaranteed<br />

that the corresponding memory locations are not written in the function body.<br />

Passing a parameter by copy generates a copy of the full variable. A common optimization<br />

<strong>for</strong> a variable that uses a non-trivial memory size is <strong>to</strong> pass those parameters by<br />

reference in order <strong>to</strong> avoid the extra copies. To make it clear that the variable is read-only,<br />

it is a common practice <strong>to</strong> pass it as a constant reference. 8<br />

The goal when designing the outlining trans<strong>for</strong>mation is <strong>to</strong> pass as few parameters<br />

as possible <strong>to</strong> the generated functions, with the more restrictive and efficient parameter<br />

passing mode.<br />

Let outline : S × Σ → P(R) × P(R) × P(R) be a function that maps a statement<br />

in a memory state <strong>to</strong> a triplet containing the parameters passed by copy, the parameters<br />

passed by reference and the parameters passed by constant reference. It is defined by:<br />

<br />

{r | r ∈ (Ri(s, σ) − Ro(s, σ)) ∧ typeof(r) ∈ Tscalar}<br />

<br />

outline(s, σ) = Ro(s, σ)<br />

{r | r ∈ (Ri(s, σ) − Ro(s, σ)) ∧ typeof(r) ∈ Tscalar}<br />

where Tscalar is the set of all scalar types.<br />

Each reference gathered by the function is textually used as an effective parameter<br />

and replaced in s by a new identifier of the corresponding type. Note that Ri(σ, s) and<br />

Ro(σ, s)) au<strong>to</strong>matically filter out private variables and locally declared variables.<br />

Listing 4.8 illustrates the outlining process, in which the internal loop is outlined as a<br />

new function kernel.<br />

The example from Listing 4.8 can be further improved by using a variant of common<br />

subexpression elimination: the variable i is only used <strong>to</strong> compute the sub-arrays in[i]<br />

and out[i]. It is possible <strong>to</strong> detect this situation and generate code as in Listing 4.9.<br />

The basic idea <strong>to</strong> per<strong>for</strong>m this trans<strong>for</strong>mation is <strong>to</strong> scan all statement references and<br />

look <strong>for</strong> a constant prefix. Let first introduce the concept of reference prefix.<br />

Definition 4.2. Given (r, r ′ ) ∈ R 2 , r ′ is a prefix of r, denoted r ′ ≺ r, if and only if<br />

∃n ∈ N ∗ , ∃(e1, . . . , en) ∈ E n : r = r ′ [e1][. . . ][en]<br />

For instance a[2*i] is a prefix of a[2*i][k+1].<br />

Given a set of references R ∈ outline(s, σ) and a reference r ∈ R, if ∃r ′ ∈ R, s.t. r ′ ≺ r,<br />

and if ∀k ∈ 1 : n, ∀x ∈ refs(s), x ∈ Rw(s, σ) then the prefix is constant with respect <strong>to</strong><br />

statement s and a constant prefix reference is found. r ′ is used as the effective parameter<br />

instead of r and the substitution in s is per<strong>for</strong>med accordingly.<br />

8. This practice is extensively used in C++.


86CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

void erode ( int n, int m, int in[n][m], int out [n][m]) {<br />

<strong>for</strong> ( int i =0;i


4.6. FUNCTION CALLS 87<br />

4.6.2.2 Using Outlining <strong>to</strong> Reduce Compilation Time<br />

Outlining can also be useful <strong>to</strong> reduce compilation time. Indeed many compiler algorithms<br />

have polynomial if not exponential complexity. 9 In that case it is beneficial <strong>to</strong> split<br />

a complex function in<strong>to</strong> several parts that can be considered independently: if the complexity<br />

c(s) of a statement trans<strong>for</strong>mation verifies the property c(s0; s1) > c(s0) + c(s1)<br />

then it is profitable <strong>to</strong> split computations.<br />

For instance a sequence of two loops can be outlined in two distinct functions and<br />

compiled separately. Of course this decision can kill some optimization opportunities, and<br />

thus must be used with care. We have however found it especially useful in some situations,<br />

like the generation of vec<strong>to</strong>r instructions in a function that contains two loop nests that<br />

cannot be merged.<br />

As many trans<strong>for</strong>mations focus on loop nests, we propose <strong>to</strong> isolate each loop nest in<br />

one single function using outlining and <strong>to</strong> apply further trans<strong>for</strong>mations on these functions.<br />

Note that loop fusion is tried be<strong>for</strong>e outlining, because it would be difficult <strong>to</strong> apply it after<br />

outlining. This process is described in Algorithm 2.<br />

Data: f ← a function<br />

Result: l, a list of functions<br />

loop_fusion(f);<br />

k ← 0;<br />

l ← ∅;<br />

<strong>for</strong> l ∈ outer_loops(f) do<br />

outline(l, fk);<br />

l ← l ∪ {fk};<br />

k ← k + 1;<br />

end<br />

return l;<br />

Algorithm 2: Compilation complexity reduction with outlining.<br />

A variant of Algorithm 2 is used <strong>for</strong> Multimedia Instruction Set (mis) generation: after<br />

some loop tiling <strong>to</strong> improve locality, the code that per<strong>for</strong>ms the computations on a tile<br />

is outlined <strong>to</strong> a new function, because each loop can be considered independently of the<br />

other.<br />

We carried out the following experiment on pips convex array region analysis: starting<br />

from two consecutive matrix multiplications, each loop nest is outlined <strong>to</strong> a new function<br />

and its innermost loop is unrolled by a given rate. It results in<strong>to</strong> several code versions with<br />

increasingly number of statements. We then per<strong>for</strong>med the computation of the convex<br />

array regions on each version.<br />

Figure 4.1 shows that an overhead linked <strong>to</strong> the inter-procedural analysis exists, but<br />

when the loop body holds enough statements, it is beneficial <strong>to</strong> apply the analysis on<br />

separated functions.<br />

9. Some cases of loop fusion are solvable in polynomial time and others are NP-complete [Dar99].


88CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C<br />

analysis time (s)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

without outlining<br />

with outlining<br />

0 20 40 60 80 100 120 140<br />

unroll rate<br />

Figure 4.1: Using outlining <strong>to</strong> reduce analyse compilation time on an unrolled sequence of<br />

matrix multiplications.<br />

4.7 Library Calls<br />

The C language is minimalist: little functionality is present in the language and many<br />

concepts are implemented in libraries (e.g. threading support, logging, multi-threading,<br />

etc.). As a consequence, the use of libraries is very common, and we cannot assume the<br />

input code does not contain library calls.<br />

The problem with libraries is that their source code may not be available on the targeted<br />

plat<strong>for</strong>m, or a similar library does exists but with a slightly different Application<br />

Programming Interface (api). This raises two issues:<br />

1. How do we per<strong>for</strong>m an inter-procedural analysis on extern calls?<br />

2. How can accelera<strong>to</strong>r code call an external library?<br />

To handle this problem, we introduce the concept of stub broker. A stub broker is a<br />

runtime library that interacts with the compiler <strong>to</strong> manages a collection of functions by<br />

representing a function as a 3-tuple 〈stub, seq, {〈archi, impl〉}〉, where stub is a C stub of<br />

the function that has the same memory effects and is analyzable by a compiler; seq is a<br />

sequential version of the function that has the same semantics as the original function 10 . A<br />

couple 〈archi, impl〉 s<strong>to</strong>res the implementation of the function <strong>for</strong> a particular architecture.<br />

The compiler infrastructure per<strong>for</strong>ms requests <strong>to</strong> the function broker, asking <strong>for</strong> a function<br />

10. It may be similar <strong>to</strong> stub but can also <strong>for</strong>ward the call <strong>to</strong> external libraries, something that a stub<br />

cannot do because it must be analyzable by the compiler infrastructure, thus be self-contained.


4.8. CONCLUSION 89<br />

and either a stub, a sequential version or a particular architecture version. The broker<br />

answers with the appropriate code, or an error <strong>to</strong> notify the infrastructure that the request<br />

cannot be fulfilled.<br />

During parsing and analysis of its input, the compiler infrastructure asks <strong>for</strong> stubs.<br />

During translation from ir <strong>to</strong> Textual Representation (tr), it asks <strong>for</strong> a sequential version<br />

and during specialization, i.e. during the post processing step described in Figure 3.9, the<br />

target-specific implementation is used.<br />

A particular case of external calls is the I/O. Depending on the hardware, I/O may be<br />

limited <strong>to</strong> data transfers through a Direct Memory Access (dma) call or it can be done<br />

through different devices (screen/printer/socket/. . . ). As the input code is not hardwarespecific,<br />

it cannot be aware of these facilities. However, the stub broker abstraction can<br />

benefit from them <strong>to</strong> provide working implementations of external calls using hardwarespecific<br />

calls.<br />

4.8 Conclusion<br />

In this chapter, we have enumerated the different aspects of an isa and we have shown<br />

that most characteristics can be represented at the C level using the proper conventions.<br />

To this end, we have listed basic trans<strong>for</strong>mations that can be incrementally used <strong>to</strong> lower<br />

a C code down <strong>to</strong> assembly-like code, while being still compatible with a C compiler:<br />

conversion from arrays <strong>to</strong> pointers, structure removal, constant array scalarization, naddress<br />

code generation, inlining, outlining, instruction selection, etc. have been detailed<br />

as source-<strong>to</strong>-source trans<strong>for</strong>mations. This set of fine grain trans<strong>for</strong>mations en<strong>for</strong>ces reuse<br />

and adaptability <strong>to</strong> the target.<br />

This approach en<strong>for</strong>ces the principle of “C as an Internal Representation” and is the<br />

key <strong>to</strong> be able <strong>to</strong> use a source-<strong>to</strong>-source compiler as a bridge between regular C code and<br />

C dialects as the ones used <strong>to</strong> program many hardware accelera<strong>to</strong>rs.<br />

Experimental results are given in Chapter 7, Sections 7.3, 7.4 and 7.2. The validity<br />

of the proposed trans<strong>for</strong>mations is illustrated in Chapter 7. Next chapter discusses the<br />

impact of parallelism constraints on compilers <strong>for</strong> heterogeneous plat<strong>for</strong>ms.


90CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C


Chapter 5<br />

Parallelism with Multimedia<br />

Instructions<br />

Pont de Saint Goustan, Morbihan c○ Gwenael AB / flickr<br />

A recurring feature of hardware accelera<strong>to</strong>rs is their use of parallelism <strong>to</strong> provide<br />

speedup. This parallelism can take various <strong>for</strong>ms and levels, generally a mixture of Single<br />

Instruction stream, Multiple Data stream (simd) and Multiple Instruction stream, Multiple<br />

Data stream (mimd) parallelism, as found in General Purpose gpu (gpgpu). Both<br />

kinds of parallelism have been studied <strong>for</strong> a long time, with a focus on loop parallelization:<br />

hyperplane loop trans<strong>for</strong>mation [Lam74], handling of control dependence [AKPW83], loop<br />

vec<strong>to</strong>rization [AK87], parallelism extraction [WL91b], supernode partitioning [IT88], communication<br />

optimizations [DUSsH93], interaction with caching [KK92] and tiling. [DSV96,<br />

AR97, YRR + 10] David F. Bacon et al. wrote an interesting survey [BGS94] on compiler<br />

trans<strong>for</strong>mations <strong>for</strong> High Per<strong>for</strong>mance Computing (hpc) that includes many loop trans<strong>for</strong>mations.<br />

Vivek Sarkar studied the au<strong>to</strong>matic selection of trans<strong>for</strong>mations based on a<br />

cost model [Sar97]. These techniques have been applied successfully in both research compilers,<br />

SUIF [WFW + 94], Polaris [PEH + 93], Paralléliseur Interprocedural de Programmes<br />

Scientifiques (pips) [IJT91, AAC + 11], Rose [Qui00], Pocc [PBB10], and production com-<br />

91


92 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

pilers, IBM XL [Sar97], Low Level Virtual Machine (llvm) [GZA + 11], gnu C Compiler<br />

(gcc) [TCE + 10], Intel C++ Compiler (icc) [DKK + 99], pgi [Wol10] and are used <strong>to</strong><br />

detect or extract parallelism and overcome some parallelization issues.<br />

In this chapter, we focus on two aspects of code parallelization: Instruction Level Parallelism<br />

(ilp) and reduction parallelization. The <strong>for</strong>mer takes advantage of intra-loop or<br />

intra-sequence parallelization opportunity using the Multimedia Instruction Sets (miss)<br />

available in most modern processors and is detailed in Section 5.1. The latter is a critical<br />

concern when a code involving a reduction must be mapped on a pure simd hardware<br />

and is addressed in Section 5.2. Section 5.3 proposes a simple model based on the parallelism<br />

found in remote accelera<strong>to</strong>rs <strong>to</strong> decide whether it is profitable or not <strong>to</strong> offload a<br />

computation.<br />

5.1 Super-word Level Parallelization<br />

Many processors now have a small vec<strong>to</strong>r unit, used by the main processor as a small<br />

accelera<strong>to</strong>r <strong>to</strong> speedup regular computations, typically multimedia applications. Modern<br />

Central Processing Units (cpus) have 128-bit (e.g. arm Cotex-A), 256-bit (e.g. Intel<br />

Sandy Bridge) or even 512-bit (e.g. Intel larabee) vec<strong>to</strong>r units. To au<strong>to</strong>matically take<br />

advantage of this extra computing power, there are two approaches: vec<strong>to</strong>r parallelism, at<br />

the loop level, and Super-word Level Parallelism, at the block level. This section presents<br />

an algorithm <strong>to</strong> combine both approaches at source level while maintaining retargetability:<br />

the proposed algorithm is parametrized by the targeted mis.<br />

5.1.1 Related Work<br />

Several approaches are able <strong>to</strong> take advantage of miss such as those found in Intel,<br />

amd and arm processors. Writing inline assembly code remains the best option <strong>for</strong> those<br />

who seek out speedup, but prohibitive development costs, difficulty of maintenance and<br />

limited portability all limit this approach <strong>to</strong> only critical code segments. For instance, the<br />

source code of the open source project mplayer contains many multimedia kernels that<br />

use manually tuned assembly code, such as the excerpt in Listing 5.1.<br />

Figure 5.1 illustrates several abstractions that can leverage the code portability beyond<br />

plain assembly:<br />

intrinsics: C functions that directly map <strong>to</strong> a sequence of one or more assembly instructions.<br />

This option remains a low-level one, but it is portable across compilers;<br />

vec<strong>to</strong>r types: syntactic sugar has been added in gcc with predefined vec<strong>to</strong>r types, but it<br />

is not portable on other compilers. Moreover it only deals with arithmetic opera<strong>to</strong>rs,<br />

so it only exposes a limited set of operations. The ArBB library [NSL + 11] uses a<br />

similar approach based on C++ templates and opera<strong>to</strong>r overloading. ;<br />

au<strong>to</strong>-vec<strong>to</strong>rization: <strong>for</strong> simple cases, an alternative is <strong>to</strong> let the compiler au<strong>to</strong>matically<br />

vec<strong>to</strong>rize the sequential version. It is the only approach that does not change the<br />

development cost, but it offers few guarantees of per<strong>for</strong>mance.


5.1. SUPER-WORD LEVEL PARALLELIZATION 93<br />

__asm__ volatile (<br />

" movd ␣␣␣␣␣␣␣␣␣␣␣%4,␣%% xmm5 ␣\n"<br />

" pxor ␣␣␣␣␣␣␣%% xmm7 ,␣%% xmm7 ␣\n"<br />

" pshuflw ␣$0 ,%% xmm5 ,␣%% xmm5 ␣\n"<br />

" movdqa ␣␣␣␣␣␣␣␣␣%6,␣%% xmm6 ␣\n"<br />

" punpcklqdq ␣%% xmm5 ,␣%% xmm5 ␣\n"<br />

" movdqa ␣␣␣␣␣␣␣␣␣%5,␣%% xmm4 ␣\n"<br />

"1:␣\n"<br />

" movq ␣␣␣␣␣␣ (%2 ,%0) , ␣%% xmm0 ␣\n"<br />

" movq ␣␣␣␣␣␣ (%3 ,%0) , ␣%% xmm1 ␣\n"<br />

" punpcklbw ␣␣%% xmm7 ,␣%% xmm0 ␣\n"<br />

Listing 5.1: Excerpt from libmpcodecs/vf_gradfun.c file from the mplayer source tree.<br />

Nonetheless, most developers of non time-critical code rely on au<strong>to</strong>matic vec<strong>to</strong>rization.<br />

In that field, proprietary compilers, such as icc, still outper<strong>for</strong>ms open-source ones like<br />

gcc or llvm on their processors. This is checked using the linpack [DLP03] benchmark<br />

<strong>to</strong> compare llvm, gcc and icc vec<strong>to</strong>rization engines. Figure 5.2 shows the score of icc,<br />

gcc and llvm on a desk<strong>to</strong>p station running 2.6.38-2-686 GNU/Linux on a 2 Intel(R)<br />

Core(TM)2 Duo CPU T9600 @ 2.80GHz. icc version 12.0.3 is run using the -O3 flag,<br />

llvm version 2.7 is run using the -O3 -march=native -ffast-math, and gcc version<br />

4.6.1 is run using the -O3 -march=native -ffast-math flags.<br />

Of course gcc and <strong>to</strong> a lower extent llvm support a wider variety of targets, including<br />

arm processors, than icc. On the other hand, icc achieves better per<strong>for</strong>mance. Nonetheless,<br />

from application developers point of view, au<strong>to</strong> vec<strong>to</strong>rization is the way <strong>to</strong> go, provided<br />

the generated code vec<strong>to</strong>rization is efficient.<br />

In fact, from a compiler developer point of view, the following points are strong constraints:<br />

– instruction sets are in constant evolution: Figure 5.3 summarizes their evolution over<br />

the past ten years and shows a steady evolution from Matrix Math eXtension (mmx)<br />

<strong>to</strong> Advanced Vec<strong>to</strong>r eXtensions (avx) in x86 processors <strong>for</strong> example;<br />

– debugging generated code (or intermediate code) is difficult;<br />

– integrating new code trans<strong>for</strong>mations is a long-term task;<br />

– dealing with code written in a legacy instruction set is difficult.<br />

This means that in addition <strong>to</strong> the constraints of au<strong>to</strong>-vec<strong>to</strong>rization and efficiency of<br />

generated code, compilers must also be retargetable <strong>to</strong> keep up with hardware design<br />

pace.<br />

Several solutions have been proposed <strong>to</strong> tackle the challenge of efficient simd code<br />

generation: the detailed view of icc internals given in [Bik04] shows that per<strong>for</strong>mance<br />

is reached through the intensive use of loop vec<strong>to</strong>rization techniques [BGS94] and that it<br />

relies on re-rolling techniques <strong>to</strong> vec<strong>to</strong>rize already manually unrolled loops. For obvious economical<br />

reasons, it does not focus on retargetability issues, whereas llvm [LA04, CSY10]<br />

and gcc [GS04, RNZ07] do. These latter both provide au<strong>to</strong>-vec<strong>to</strong>rizers, in early stage


94 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

movaps -24(% ebp ), % xmm0<br />

movaps -40(% ebp ), % xmm1<br />

addps %xmm1 , % xmm0<br />

movaps %xmm0 , -56(% ebp )<br />

# include < xmmintrin .h><br />

foo () {<br />

__m128 v0 ,v1 ,v2;<br />

v0= _mm_add_ps (v1 ,v2 );<br />

}<br />

# include < xmmintrin .h><br />

foo () {<br />

__v4sf v0 ,v1 ,v2;<br />

v0=v1+v2;<br />

}<br />

foo () {<br />

float v0 [4] , v1 [4] , v2 [2];<br />

<strong>for</strong> ( int i =0;i


5.1. SUPER-WORD LEVEL PARALLELIZATION 95<br />

2e+06<br />

1.5e+06<br />

1e+06<br />

500000<br />

0<br />

FLOPS<br />

icc gcc llvm<br />

Figure 5.2: Comparison of llvm, gcc and icc vec<strong>to</strong>rizers using linpack.<br />

avx<br />

sse4.2<br />

sse3<br />

sse2<br />

sse<br />

3DNow!<br />

mmxPentium<br />

I<br />

Pentium III<br />

AMD I<br />

Pentium 4 ’Prescott’<br />

Pentium 4<br />

Sandy Bridge<br />

Core I7<br />

1996 1998 2000 2002 2004 2006 2008 2010 2012<br />

Figure 5.3: Multimedia Instruction Set his<strong>to</strong>ry <strong>for</strong> x86 processors.


96 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

<strong>for</strong> llvm [CSY10], more advanced <strong>for</strong> gcc, with two approaches. One [RNZ07] is<br />

based on the combinations of loop vec<strong>to</strong>rization techniques and Super-word Level Parallelism<br />

(slp) [LA00, SCH03, SHC05], the other relies on the strength of the polyhedral<br />

model [TNC + 09, GZA + 11].<br />

Indeed, the his<strong>to</strong>ric approach <strong>to</strong> simd instruction generation is inherited from loop<br />

vec<strong>to</strong>rization techniques, where loop nests are first optimized (e.g. using loop fusion, interchange,<br />

skewing, distribution . . . ) then strip mined by the register width. Larsen et al.<br />

introduced in [LA00] the slp algorithm, which uses a pattern matching algorithm <strong>to</strong> find<br />

vec<strong>to</strong>r instructions in code sequences, and thus capturing potentially more parallelism that<br />

the previous method. However, it cannot optimize loop nests with the same efficiency as<br />

polyhedron-based approaches. A good illustration of this assessment is the introduction<br />

of the unroll-and-jam trans<strong>for</strong>mation in the context of slp. It is a particular case of loop<br />

tiling combined with loop unrolling [WL91a].<br />

Some papers [ZC98, BGGT02, JMH + 05] focus on the discovery of instruction-set specific<br />

patterns, e.g. Fused Multiply-Add (fma) or horizontal-add. Finding such patterns<br />

provide significant improvement <strong>for</strong> some kernels depending on the target architecture.<br />

The growing number of mis and the steady pace at which they evolve has led <strong>to</strong> several<br />

attempt <strong>to</strong> build a retargetable vec<strong>to</strong>rizer: swarp [PBSB04] used annotated C code <strong>to</strong><br />

describe simd instruction patterns and combines this representation with a generic pattern<br />

matching engine. This approach offers a very flexible way <strong>to</strong> describe mis, but as the<br />

register width grows, as one can expect considering the <strong>for</strong>thcoming larabee that contains<br />

a 512-bit vec<strong>to</strong>r processing unit, pattern matching gets slower and less practical. Moreover,<br />

the pattern description proved, from its author, <strong>to</strong> be “<strong>to</strong>o hard <strong>to</strong> maintain”. The approach<br />

of [HEL + 09] achieves retargetability through detailed description of each instruction at the<br />

source-level. Pre-processing phases are in charge of applying unrolling and scalar expansion<br />

be<strong>for</strong>e their vec<strong>to</strong>rization engine.<br />

Recently, joint work of Nuzman et al. [NRD + 11] has proposed an interesting new<br />

approach <strong>to</strong> the problem. The vec<strong>to</strong>rization engine generates calls <strong>to</strong> an abstract mis<br />

parametrized by the vec<strong>to</strong>r length. A Just In Time (jit) compiler is in charge of the<br />

generation of target-dependant code at execution time. Likewise, optimization related <strong>to</strong><br />

data alignment can be deferred. That way, there is no need <strong>to</strong> recompile an application<br />

<strong>to</strong> benefit from the vec<strong>to</strong>r instruction unit. But because vec<strong>to</strong>rization is per<strong>for</strong>med in a<br />

target-independent way at compile time, this approach does not take advantage of some<br />

optimization opportunities: intra-loop vec<strong>to</strong>rization or slp, cycle-shrinking [BGS94], etc.<br />

It is also difficult <strong>to</strong> per<strong>for</strong>m a vec<strong>to</strong>rization profitability estimation without in<strong>for</strong>mation<br />

about the target.<br />

In spite of the many ef<strong>for</strong>ts from the research community, production compilers either<br />

target (relatively) efficiently a single architecture, e.g. icc, or inefficiently multiple<br />

architectures, e.g. gcc.


5.1. SUPER-WORD LEVEL PARALLELIZATION 97<br />

typedef float v4sf [4];<br />

Listing 5.2: Sample of representation of a vec<strong>to</strong>r register using C type.<br />

5.1.2 A Meta-Multimedia Instruction Set<br />

To achieve retargetability, decoupling the code trans<strong>for</strong>mations from the targets is necessary:<br />

we propose a generic and parametric mis as a single target <strong>for</strong> the whole trans<strong>for</strong>mation<br />

process. This kind of meta-mis has already been proposed in the past [Roj04, Sch09].<br />

This instruction set is parametrized by the vec<strong>to</strong>r size, but unlike their approach, this<br />

size is set at compile time and not at execution time, which provides more vec<strong>to</strong>rization<br />

opportunities. Another difference with existing approaches is that all instructions and<br />

vec<strong>to</strong>r types are described in C, following the principle given at Section 4.1, so that the<br />

compiler infrastructure can analyse them and per<strong>for</strong>m trans<strong>for</strong>mations that are correct<br />

with respect <strong>to</strong> the sequential implementation.<br />

The supported simd types are vec<strong>to</strong>rs of scalars, simple/double floating point, and<br />

8 <strong>to</strong> 128 bits integers. The length of these vec<strong>to</strong>rs is a parameter. They are represented<br />

internally as plain arrays of the according types and sizes. A typedef is used <strong>to</strong> wrap these<br />

vec<strong>to</strong>r types <strong>to</strong> ease code generation. Listing 5.2 illustrates this approach <strong>for</strong> a vec<strong>to</strong>r of 4<br />

single precision floats. The typedef naming convention used is gcc’s. Complex numbers<br />

are treated as arrays of two elements <strong>to</strong> be compatible with the vec<strong>to</strong>r representation.<br />

The vec<strong>to</strong>r operations supported by the meta-mis are taken from the union of avx and<br />

neon mis, restricted <strong>to</strong> pure simd instructions:<br />

– common mathematical opera<strong>to</strong>rs: addition, subtraction, multiplication and division;<br />

– trigonometric functions: sine, cosine;<br />

– comparison: equal, greater/lesser than;<br />

– multiply-addition operation, implemented in neon and proposed in avx, but not yet<br />

implemented in the Sandy Bridge architecture;<br />

– logical operations;<br />

– data movement: packed and unpacked loads and s<strong>to</strong>res;<br />

– memory reorganization operations: shuffle and broadcast.<br />

Non-simd instructions such as haddps from Streaming simd Extension (sse) 4.2, an<br />

horizontal add on single precision floats, are more difficult <strong>to</strong> deal with because the pattern<br />

matching algorithm has a non-linear complexity. 1<br />

The meta-mis is parametric in the vec<strong>to</strong>r length: at code generation time and depending<br />

on the targeted mis, all operations are generated <strong>for</strong> the given vec<strong>to</strong>r length. Figure 5.4<br />

shows a sample of usage of this mis <strong>for</strong> a vec<strong>to</strong>r size of 128 bits.<br />

A n-instance of the meta-mis, denoted n-mis, is the set of all types and operations <strong>for</strong><br />

a vec<strong>to</strong>r of n bits. For instance, Figure 5.4 is an example of the 128-mis. Given an n-mis,<br />

it is possible <strong>to</strong> generate a set of patterns that fully describes the mis in a <strong>for</strong>m easier <strong>to</strong><br />

1. The authors of the paper on swarp gave up with the generic approach when avx was released<br />

because the pattern matching algorithm did not scale well <strong>for</strong> 256-bits vec<strong>to</strong>r registers.


98 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

v4sf vec0 , vec1 , vec2 ;<br />

/*...*/<br />

<strong>for</strong> (i0 = 0; i0


5.1. SUPER-WORD LEVEL PARALLELIZATION 99<br />

void SIMD_MULADD_PS ( float w[4] , float x[4] , float y[4] , float z [4]) {<br />

<strong>for</strong> ( int i =0;i


100 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

<strong>for</strong> (i0 = 0; i0


5.1. SUPER-WORD LEVEL PARALLELIZATION 101<br />

a [0] = a [0] + b [0] * c [0]; (s0)<br />

a [1] = a [1] + b [1] * c [1]; (s1)<br />

a [1] = a [1] + b [1] * c [2]; (s2)<br />

Listing 5.7: C excerpt <strong>to</strong> illustrate statement closeness.<br />

5.1.4 Generation of Optimized simd Instructions<br />

5.1.4.1 Statement Closeness<br />

An important aspect of the generation of efficient simd code is the usage of packed data.<br />

To create such packs, we introduce the notion of closeness between two statements that<br />

match the same pattern. It represents the likelihood <strong>to</strong> <strong>for</strong>m a perfectly packed operation<br />

from statements. Intuitively, two statements are close if they share the same pattern and<br />

each involved array reference from a statement is close <strong>to</strong> an array reference in the other<br />

statement. For instance, in Listing 5.7, statements s1 is closer <strong>to</strong> s0 than s2 because it<br />

has more references close <strong>to</strong> s0 than s2.<br />

Let s0 = p, r0 0, . . . , r0 <br />

n−1 and s1 = p, r1 0, . . . , r1 <br />

n−1 be two statements that match<br />

the same pattern p. The statement closeness c(s0, s1) is given by<br />

where<br />

cmax ∈ N is chosen so that:<br />

¯d(r 0 i , r 1 i ) =<br />

c(s0, s1) =<br />

<br />

n−1<br />

k=0<br />

¯d(r 0 k, r 1 k) 2<br />

cmax : r 0 i = r 1 i<br />

d(r 0 i , r 1 j ) : otherwise<br />

∀r 0 i , r 1 j , d(r 0 i , r 1 j ) = ∞ ⇒ d(r 0 i , r 1 j ) < cmax<br />

Given a statement so, a set of statements {s0, . . . , sn} that share the same pattern as<br />

so can be ordered using the following comparison function:<br />

⎧<br />

⎨<br />

cmp(si, sj) =<br />

⎩<br />

−1 : c(so, si) < c(so, sj)<br />

0 : c(so, si) = c(so, sj)<br />

1 : c(so, si) > c(so, sj)<br />

which ensures that the statement with the most and closest memory references <strong>to</strong> so are<br />

ranked first. This method is used in the “select_closest” function below.<br />

5.1.4.2 Parametric Vec<strong>to</strong>r Instruction Generation Algorithm<br />

The algorithm presented in this section generates optimized vec<strong>to</strong>r instructions given<br />

a register width w, a basic block denoted b and a set of patterns denoted patterns. It


102 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

is inspired from the preliminary work of François Ferrand [Fra03]. The originality of<br />

this algorithm, with respect <strong>to</strong> the original version of Samuel Larsen and Saman P.<br />

Amarasinghe [LA00], lies in the generation of load, s<strong>to</strong>re and shuffle operations. It is<br />

presented in Algorithm 3 and makes an extensive use of the “statement closeness” between<br />

two statements that share the same pattern (see Section 5.1.4.1).<br />

A block of statements, b, is processed statement by statement. For each s of these<br />

statements that matches a simd pattern from the input patterns, the statements that can<br />

be moved right after it, according <strong>to</strong> the dependence graph, are extracted by function<br />

extract_no_conflict. Among them, those that match the same pattern are selected by<br />

function extract_isomorphics. They are then ordered using the comparison function introduced<br />

in Section 5.1.4.1 and the w − 1 first elements are extracted <strong>to</strong> <strong>for</strong>m a pack. A<br />

set of loaded vec<strong>to</strong>rs and s<strong>to</strong>red vec<strong>to</strong>rs is then derived from this pack.<br />

Let vi denotes the i th vec<strong>to</strong>r of the pack, that is<br />

vi = r i 0, . . . , r i w−1<br />

If the corresponding memory locations are written, then a s<strong>to</strong>re from the vec<strong>to</strong>r <strong>to</strong> the memory<br />

location is unconditionally generated, a binding between the vec<strong>to</strong>r and the memory<br />

locations is added <strong>to</strong> live_registers and all the previous vec<strong>to</strong>rs referencing these locations<br />

are removed from live_registers. If the memory locations are read, live_registers is scanned<br />

<strong>for</strong> an existing vec<strong>to</strong>r that already holds their values in the same order. If any, no load<br />

is generated and the previous vec<strong>to</strong>r is used. If ∀r ∈ vi, r = r i 0, a broadcast operation is<br />

generated. Otherwise all permutations of the memory locations are checked. If a binding<br />

exists, then a shuffle operation, an operation that per<strong>for</strong>ms a permutation of the register<br />

content, is generated instead of a load. In all cases, the association between the vec<strong>to</strong>r and<br />

the memory references is s<strong>to</strong>red in live_registers.<br />

5.1.5 Pattern Discovery<br />

The vec<strong>to</strong>rization algorithm proposed in Section 5.1.4.2 is only efficient <strong>for</strong> sequences<br />

that contain enough statements <strong>to</strong> reveal patterns. This is sometimes the case <strong>for</strong> manually<br />

unrolled loops, such as the one found in the linpack benchmark. 3 To obtain more<br />

patterns, we successively apply well-known loop vec<strong>to</strong>rization techniques: loop interchange<br />

<strong>to</strong> improve locality, loop tiling <strong>to</strong> favor data packing, and finally loop unrolling. Data<br />

dependences inherited from reductions are removed through expansion of the reduction<br />

variables.<br />

5.1.6 Loop Tiling<br />

The loop tiling strategy used <strong>for</strong> vec<strong>to</strong>rization is rather simple: given a loop nest L of<br />

depth n with an innermost loop body BL, the tiling matrix is chosen as a diagonal matrix,<br />

3. In that case, the unroll rate, 5, is not a power of 2 and offers very poor intra-loop parallelism.


5.1. SUPER-WORD LEVEL PARALLELIZATION 103<br />

Data: w ← width of vec<strong>to</strong>r register<br />

Data: patterns ← set of patterns characterizing the instruction set<br />

Data: b ← list of statements<br />

Result: list of potentially vec<strong>to</strong>rized statements<br />

visited ← ∅;<br />

new_b ← ∅;<br />

live_registers ← ∅;<br />

while b = ∅ do<br />

s ← head(b);<br />

if s ∈ visited then<br />

visited ← visited ∪ {s};<br />

if match(s, pattern) then<br />

nconflict ← extract_no_conflict(tail(b), s);<br />

iso_stats ← extract_isomorphics(nconflict, s);<br />

if iso_stats = ∅ then<br />

simd_s ← select_closest(iso_stats, s, w);<br />

load_s ← gen_load(simd_s, live_registers);<br />

s<strong>to</strong>re_s ← gen_s<strong>to</strong>re(simd_s, live_registers);<br />

update_live_registers(simd_s, live_registers);<br />

new_b ← new_b ; load_s;<br />

new_b ← new_b ; simd_s;<br />

new_b ← new_b ; s<strong>to</strong>re_s;<br />

<strong>for</strong> s ′ ∈ simd_s do<br />

visited ← visited ∪ {s ′ };<br />

end<br />

else<br />

new_b ← new_b ; s;<br />

update_live_registers(s, live_registers);<br />

end<br />

else<br />

new_b ← new_b ; s;<br />

update_live_registers(s, live_registers);<br />

end<br />

end<br />

b ← tail(b);<br />

end<br />

Algorithm 3: Parametric vec<strong>to</strong>r instruction generation algorithm.


104 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

<strong>for</strong> (it = 0; it 4* it +3)<br />

<strong>for</strong> ( jt1 = 0; jt1 4* jt1 +3)<br />

<strong>for</strong> ( i11 = 4* it; i11


5.1. SUPER-WORD LEVEL PARALLELIZATION 105<br />

Data: w ← width of vec<strong>to</strong>r register<br />

Data: prog ← whole program<br />

Result: vec<strong>to</strong>rized program<br />

<strong>for</strong> f ∈ functions(prog) do<br />

if_conversion(f) ([AKPW83])<br />

n_address_code_generation(3, f)<br />

<strong>for</strong> l ∈ loops(f) do<br />

loop_interchange(f, l) (if profitable)<br />

loop_tiling(f, l) (see § 5.1.6)<br />

end<br />

<strong>for</strong> l ∈ innermost_loops(f) do<br />

unroll(f, l, w)<br />

end<br />

reduction_parallelization(f) (see § 5.2)<br />

<strong>for</strong> b ∈ basic_blocks(f) do<br />

scalar_renaming(f, b) ([ASU86])<br />

slp(f, b, w) (see § 5.1.4.2)<br />

end<br />

dead_code_elimination(f)<br />

redundant_load_s<strong>to</strong>re_elimination(f) (see § 6.3)<br />

end<br />

Algorithm 4: Hybrid vec<strong>to</strong>rization at the pass manager level.


106 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

5.2 Reduction Parallelization<br />

A reduction is in<strong>for</strong>mally defined as the processing of a data structure of n elements <strong>to</strong><br />

compute n−1 or fewer values. For instance the computation of an his<strong>to</strong>gram over a dataset<br />

of n elements split in k < n categories is a reduction. Formally, a reduction operation occurs<br />

when an associative opera<strong>to</strong>r, say ⊗, operates on a variable x as in x = x ⊗ expression and<br />

x is not referenced in expression.<br />

Code with reductions are not parallel and are indeed a bottleneck <strong>for</strong> many scientific<br />

applications. For instance [GPZ + 01] reported the presence of reductions in several hpc<br />

benchmarks and measured that the parallelization of those reductions led <strong>to</strong> an average<br />

speedup of ×2.7 on 16 processors. For that reason, techniques <strong>to</strong> parallelize reductions<br />

have been developed.<br />

Parallelization of reduction algorithms themselves has been well studied and efficient<br />

algorithms are available [KRS90, Lei92]. The challenge is first <strong>to</strong> detect the reduction,<br />

then <strong>to</strong> parallelize it depending on the hardware features. The <strong>for</strong>mer is a well known<br />

subject [JD89, ZC91] and involves the detection of the reduction pattern, a check of the<br />

reduction opera<strong>to</strong>r property and a check of the data dependencies with the loop body—<br />

reduction parallelization is only valid if the reduction variable is not used outside of the<br />

reduction. The latter requires more attention. A common way <strong>to</strong> parallelize reductions is <strong>to</strong><br />

place them in<strong>to</strong> a critical section, as proposed in Open Multi Processing (openmp) [Ope11]<br />

<strong>for</strong> non-a<strong>to</strong>mic reductions, or <strong>to</strong> use a<strong>to</strong>mic version of the reduction opera<strong>to</strong>rs when they<br />

are available, but it puts all the contention in a single place. To overcome this, parallel<br />

prefixes [LF80, Ble89] are generally used. They rely on the associativity of the reduction<br />

opera<strong>to</strong>r <strong>to</strong> per<strong>for</strong>m partial reductions in parallel.<br />

However, these generic algorithms do not take advantage of the specific hardware feature<br />

that may optimize the parallel reduction. In [GPZ + 01], Marìa Jesùs Garzaràn proposed<br />

a hardware design that makes it possible <strong>to</strong> per<strong>for</strong>m reductions efficiently thanks <strong>to</strong> the<br />

delegation of the merging phase <strong>to</strong> the hardware. Field Programmable Gate Array (fpga)<br />

design <strong>for</strong> such algorithms exist [Zim97] and take in<strong>to</strong> account the speedup/area ratio.<br />

More recently, versions of parallel prefix have been implemented <strong>for</strong> nVidia Graphical<br />

Processing Unit (gpu) [SHG08], while the Brook language [BFH + 04] provides built in<br />

support <strong>for</strong> reductions on gpu.<br />

This leads <strong>to</strong> the idea that per<strong>for</strong>ming a reduction efficiently on a specific hardware<br />

must be done by taking in<strong>to</strong> account the hardware specificities. However it is difficult <strong>for</strong><br />

a compiler <strong>to</strong> au<strong>to</strong>matically generate an optimized, target-dependant reduction algorithm<br />

<strong>for</strong> non-trivial cases. It is more practical <strong>to</strong> call a generic routine or use a pre-defined stub<br />

instead. Two strategies are explored: Section 5.2.1 details a template-based approach and<br />

Section 5.2.2 details how <strong>to</strong> delegate reduction handling <strong>to</strong> a third-party function.<br />

5.2.1 Reduction Detection Inside a Sequence<br />

The slp algorithm presented in Section 5.1 only works on sequences. Because reductions<br />

introduce data dependencies that prevent the vec<strong>to</strong>rization, they need <strong>to</strong> be removed,


5.2. REDUCTION PARALLELIZATION 107<br />

something typically done in sse using a partial sum vec<strong>to</strong>r on <strong>for</strong> loops.<br />

We have extended this approach <strong>to</strong> process sequences, as shown by Figure 5.5.<br />

int a, b, c, d;<br />

int r = 0;<br />

r += a;<br />

r += b;<br />

r += c;<br />

r += d;<br />

(a) Reduction in a sequence be<strong>for</strong>e parallelization.<br />

int a, b, c, d;<br />

// PIPS generated variable<br />

int RED0 [4];<br />

int r = 0;<br />

RED0 [0] = 0;<br />

RED0 [1] = 0;<br />

RED0 [2] = 0;<br />

RED0 [3] = 0;<br />

RED0 [0] = RED0 [0]+ a;<br />

RED0 [1] = RED0 [1]+ b;<br />

RED0 [2] = RED0 [2]+ c;<br />

RED0 [3] = RED0 [3]+ d;<br />

r = RED0 [3]+ RED0 [2]+ RED0 [1]+ RED0 [0]+ r;<br />

(b) Reduction in a sequence after parallelization.<br />

Figure 5.5: Parallelizing reductions in a sequence.<br />

To achieve this goal, we first use the reduction analysis presented in [JD89] <strong>to</strong><br />

per<strong>for</strong>m a semantic detection of reduction statements which associates <strong>to</strong> each statement<br />

a set of couple 〈reduction, opera<strong>to</strong>r〉. Once all statements holding a reduction are<br />

flagged, these reductions are aggregated at the sequence level <strong>to</strong> <strong>for</strong>m a set of pairs<br />

{〈〈reductioni, opera<strong>to</strong>r i〉 , ni〉} where ni is the number of times reduction reductioni is per<strong>for</strong>med<br />

in the sequence. During the aggregation process, any reduction that is referenced<br />

by a non-reduction statement is pruned out. Then <strong>for</strong> each reductioni, an array aredi of<br />

ni elements is created <strong>to</strong> hold the intermediate values. A prelude fills the array with the<br />

neutral values, depending on the reduction opera<strong>to</strong>r opera<strong>to</strong>r i, and a postlude per<strong>for</strong>ms<br />

the reduction using the same opera<strong>to</strong>r. They are added be<strong>for</strong>e and after the sequence,<br />

respectively.<br />

If the statement block is the body of a loop and there is no data dependency between<br />

the loop index and the reduction variable, then the prelude and postlude can be moved<br />

around the surrounding loops.<br />

This behavior is shown in Figure 5.6 starting from an extract of the ddot_r function


108 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

<strong>for</strong> ( LU_IND0 = 0; LU_IND0


5.2. REDUCTION PARALLELIZATION 109<br />

__m128d xres = _mm_setzero_pd ( );<br />

<strong>for</strong> (i =0;i


110 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

void erode ( int n, int m, int in[n][m], int out [n][m]) {<br />

<strong>for</strong> ( int i =0;i


5.4. CONCLUSION 111<br />

– an execution time on the accelera<strong>to</strong>r, given as the ratio between the sequential execution<br />

time on the host processor, th, and the average relative speedup provided by<br />

the accelera<strong>to</strong>r, ath.<br />

This model is based on two assumptions: the data transfer time is a linear function of<br />

the amount of data transferred, and the accelera<strong>to</strong>r execution time is proportional <strong>to</strong> the<br />

host execution time. These assumptions are discussed in Section 5.3.2.<br />

The profitability of the offloading can be expressed as Inequality (5.2), which turns<br />

Equation (5.1) in<strong>to</strong> Inequality (5.3).<br />

τ0 +<br />

ta(s, σ) < th(s, σ) (5.2)<br />

V (s, σ)<br />

B<br />

< th(s, σ) × ath − 1<br />

ath<br />

(5.3)<br />

τ0, B, ath and th(s, σ), are parameters that depend from the hardware and host target.<br />

On the other hand V (s, σ) is program-dependent and can be computed at compile-time.<br />

As a consequence, the offloading decision is postponed <strong>to</strong> runtime. For instance, in the case<br />

of a matrix multiplication between two n × n matrices, th = O(n 3 ), while V (s, σ) = O(n 2 );<br />

in the absence of constraints on n, an off-line asymp<strong>to</strong>tic decision would unconditionally<br />

offload the kernel <strong>to</strong> an accelera<strong>to</strong>r, whereas a high τ0 value should prevent the offloading<br />

<strong>for</strong> small value of n.<br />

5.3.2 Limitations of the Model<br />

The model described in this section does not take in<strong>to</strong> account several aspects of data<br />

V (S,σ)<br />

transfers. Data transfer time cannot solely be represented by τ0 + . In the case<br />

B<br />

of gpu boards, data alignment has a significant impact on per<strong>for</strong>mances, and zero copy<br />

mechanism can be used <strong>for</strong> data that are read/written only once, but these aspects are<br />

ignored. Asynchronous transfers are often used <strong>to</strong> overlap communications and hide data<br />

transfer cost, which makes our approach over-pessimistic. In a similar manner, a succession<br />

of kernels can result in<strong>to</strong> redundant data transfers. Our method only takes in<strong>to</strong> account<br />

local in<strong>for</strong>mation.<br />

5.4 Conclusion<br />

In this section, we have focused on Super-word Level Parallelism and proposed an<br />

original algorithm that combines the traditional loop-based approach with the more recent<br />

sequence based pattern-matching, parametrized by the Multimedia Instruction Set<br />

description and without the need of loop rerolling. This combination makes it possible <strong>to</strong><br />

discover parallelism outside of loops or in manually unrolled loops, while still benefiting<br />

from the research led over the past decades in loop parallelization. Its validity is examined<br />

on several linpack kernels in Chapter 7.


112 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS<br />

In a similar manner, we have extended reduction parallelization <strong>to</strong> sequences where it<br />

only held <strong>for</strong> loops, which led <strong>to</strong> more parallelization opportunities when combined with<br />

the slp algorithm. We also propose a methodology <strong>to</strong> ignore hardware-specific mechanism<br />

<strong>for</strong> reductions at compilation time.<br />

In the next chapter, we examine a last category of hardware constraints: distributed<br />

memory.


Chapter 6<br />

Trans<strong>for</strong>mations <strong>for</strong> Memory Size and<br />

Distribution<br />

Pont de Pacé, Ille-et-Vilaine c○ Pymouss / WikipediA<br />

Wm. A. Wulf and Sally A. Mckee concluded their article [WM95] “Hitting the<br />

Memory Wall: Implications of the Obvious” published in 1995 by the following sentence:<br />

The most “convenient” resolution <strong>to</strong> the problem would be the discovery of a<br />

cool, dense memory technology whose speed scales with that of processors. We<br />

are not aware of any such technology (. . . ).<br />

Fifteen years later, we are no more aware of any such technologies, and memory is still<br />

a critical issue <strong>for</strong> many parallel applications. In the context of heterogeneous computing,<br />

where host and accelera<strong>to</strong>r memory space are often separated, it is important <strong>to</strong> handle this<br />

hardware constraint with care. To this end, we introduce three generic trans<strong>for</strong>mations:<br />

statement isolation that separates the accelera<strong>to</strong>r memory space from the host memory<br />

space, presented in Section 6.1, memory footprint reduction that finds out tiling parameters<br />

<strong>for</strong> a loop nest so that the inner loops fit in<strong>to</strong> the target memory, presented in Section 6.2,<br />

and redundant load-s<strong>to</strong>re elimination presented in Section 6.3.<br />

113


114CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

void foo ( int i) {<br />

int j;<br />

j=i*i; //


6.1. STATEMENT ISOLATION 115<br />

where i ′ are new identifiers unique <strong>to</strong> the program, idss is a function that collects all<br />

identifiers syntactically used by a given statement and si→i ′<br />

identifier i is syntactically changed in<strong>to</strong> i<br />

is a statement where the<br />

′ .<br />

Definition 6.1. The idss : S → P(I) function is defined by the syntactic rules:<br />

idss(;) = ∅<br />

idss({ type id ; stat }) = idst(type) ∪ idss(stat)<br />

idss( ref=expr) = idsr(ref) ∪ idse(expr)<br />

idss( ref=read) = idsr(ref) ∪ {istdin}<br />

idss(write expr) = {istdout} ∪ idse(expr)<br />

idss(f(ref)) = idsr(ref)<br />

idss(if( expr ) { stat0 } else { stat1 }) = idse(expr) ∪ idss(stat0) ∪ idss(stat1)<br />

idss(while( expr ) { stat }) = idse(expr) ∪ idss(stat)<br />

where idst : T → P(I) is given by:<br />

idss(stat0 ; stat1) = idss(stat0) ∪ idss(stat1)<br />

idst(int | float | complex) = ∅<br />

idst(struct id { fields }) = ∅<br />

where idse : E → P(I) is given by:<br />

and idsr : R → P(I) is given by:<br />

idst(type [ expr ]) = idst(type) ∪ idse( expr)<br />

idse(cst) = ∅<br />

idse(ref) = idsr(ref)<br />

idse(expr0 op expr1) = idse(expr0) ∪ idse(expr1)<br />

idsr(id) = {id}<br />

idsr(ref [ expr ]) = idsr(ref) ∪ idse(expr)<br />

idsr(ref . fieldname) = idsr(ref)<br />

Definition 6.2. We denote si→i ′ the statement where the identifier i is syntactically<br />

changed in<strong>to</strong> i ′ , where i ′ ∈ I \ idss(s) ∧ ∀σ ∈ Σ, σ(i ′ ) = unbound.<br />

ei→i ′ and ri→i ′ have a similar meaning in the context of expression and references.


116CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

A lesser version of Theorem (6.1) is given in Theorem (6.2).<br />

Theorem 6.2. The evaluation of a statement where one identifier has been isolated yields<br />

the same memory state as the evaluation of the original statement.<br />

Given a statement s ∈ S, a memory state σ ∈ Σ and i ∈ idss(s) s.t. ∀j ∈ idss(s), I(j) ∈<br />

{I(istdin), I(istdout)}<br />

S({typeof(i) i ′ ; i ′ = i ; si→i ′ ; i = i′ ; }, σ) = S(s, σ)<br />

Theorem (6.1) results from the iterative application of Theorem (6.2) on all variables<br />

referenced by s. The remaining of the section is dedicated <strong>to</strong> the proof of Theorem (6.2).<br />

6.1.1.1 Expression Renaming<br />

Lemma 6.3. The evaluation of an expression e ∈ E in state σ ∈ Σ is not changed by the<br />

renaming of an identifier.<br />

∀i ∈ idse(e), ∀i ′ ∈ idse(e) s.t. typeof(i ′ ) = typeof(i),<br />

E(e, σ) = E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />

Proof. Let prove this lemma by induction on the syntactic elements of the expression<br />

domain. Let e be an expression, and σ a memory state. We choose i ∈ idse(e) and<br />

i ′ ∈ idse(e), s.t. typeof(i) = typeof(i ′ ).<br />

Constants If e = cst, we have<br />

E(ei→i ′, σ)<br />

=E(cst i→i ′, σ[I(i′ ) → σ(I(i))])<br />

=E(cst, σ[I(i ′ ) → σ(I(i))])<br />

=cst<br />

=E(e, σ)<br />

Identifiers If e = id, we have<br />

E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />

=E(id i→i ′, σ[I(i′ ) → σ(I(i))])


6.1. STATEMENT ISOLATION 117<br />

if i = id,<br />

=E(i ′ , σ[I(i ′ ) → σ(I(i))])<br />

=(σ[I(i ′ ) → σ(I(i))]))(I(i ′ ))<br />

=σ(I(i))<br />

=E(e, σ)<br />

otherwise we have id i→i ′ = id and<br />

=E(id, σ[I(i ′ ) → σ(I(i))])<br />

=E(id, σ)<br />

which terminates the induction proof <strong>for</strong> the initial elements of E.<br />

References We now consider non-initial elements. If e = ref . fieldname, we have:<br />

E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />

=E(ref i→i ′ . fieldname, σ[I(i′ ) → σ(I(i))])<br />

=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))]).fieldname <br />

=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))]) .fieldname<br />

=σ(R(ref , σ)).fieldname from induction hypothesis<br />

=σ(R(ref , σ).fieldname)<br />

=E(e, σ)<br />

If e = ref [ expr ], we have:<br />

E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />

=E(ref i→i ′ [ expr i→i ′ ], σ[I(i′ ) → σ(I(i))])<br />

=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))])[E(expr i→i ′, σ[I(i′ ) → σ(I(i))])] <br />

=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))])[E(expr, σ)] from induction hypothesis<br />

<br />

= σ[I(i ′ ) → σ(I(i))] R(ref i→i ′, σ[I(i′ ) → σ(I(i))]) <br />

[E(expr, σ)]<br />

= σ(R(ref , σ)) [E(expr, σ)] from induction hypothesis<br />

= σ(R(ref , σ))[E(expr, σ)] <br />

=E(e, σ)


118CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

Arithmetic Operations In the case of e = expr 0 op expr 1, we have,<br />

E(ei→i ′, σ[I(i′ ) → σ(I(i))])<br />

=E(expr 0 op expr 1i→i ′, σ[I(i′ ) → σ(I(i))])<br />

=E(expr 0i→i ′, σ[I(i′ ) → σ(I(i))]) op E(expr 1i→i ′, σ[I(i′ ) → σ(I(i))]<br />

=E(expr 0, σ) op E(expr 1, σ) from induction hypothesis<br />

=E(e, σ)<br />

6.1.1.2 Type Renaming<br />

We state a similar lemma <strong>for</strong> type evaluation:<br />

Lemma 6.4. The evaluation of a type t ∈ T in state σ ∈ Σ is not changed by the renaming<br />

of an identifier.<br />

∀i ∈ idse(e), ∀i ′ ∈ idst(t) s.t. typeof(i) = typeof(i ′ ),<br />

T(t, σ) = T(ti→i ′, σ[I(i′ ) → σ(I(i))])<br />

Proof. We use an induction proof on the type definition.<br />

Scalar Types The equality is direct <strong>for</strong> int, float and complex that are independent<br />

from the memory state and unchanged by the renaming rule.<br />

Structures In the case of t = struct id { fields }, we have<br />

= T(t, σ)<br />

T(ti→i ′, σ[I(i′ ) → σ(I(i))])<br />

= <br />

T(fi→i ′, σ[I(i′ ) → σ(I(i))])<br />

〈f,i〉∈fields<br />

= <br />

〈f,i〉∈fields<br />

Arrays In the case of t = type [ expr ], we get<br />

T(ti→i ′, σ[I(i′ ) → σ(I(i))])<br />

T(f, σ) from induction hypothesis<br />

=T( typei→i ′, σ[I(i′ ) → σ(I(i))]) × E(expri→i ′, σ[I(i′ ) → σ(I(i))])<br />

=T( type, σ) × E(expr, σ)from induction hypothesis and Lemma 6.3<br />

=T(t, σ)


6.1. STATEMENT ISOLATION 119<br />

6.1.1.3 Statement renaming<br />

Lemma (6.3) and Lemma (6.4) can be extended <strong>to</strong> the statement domain:<br />

Lemma 6.5. The evaluation of a statement s ∈ S in memory state σ ∈ Σ is not changed<br />

by the renaming of an identifier.<br />

∀i ∈ idss(s) s.t. I(i) ∈ {I(istdin), I(istdout)} , ∀i ′ ∈ idss(s) s.t. typeof(i) = typeof(i ′ ),<br />

S(si→i ′, σ[I(i′ ) → σ(I(i))]) = S(s, σ)[I(i) → σ(I(i)), I(i ′ ) →<br />

<br />

S(s, σ) I(i) <br />

]<br />

We were not able <strong>to</strong> prove this lemma <strong>for</strong>mally. In<strong>for</strong>mally, it describes the result of<br />

the evaluation of a statement where variable i is syntactically changed in<strong>to</strong> i ′ in a memory<br />

state where each memory location associated <strong>to</strong> i holds the value of the associated memory<br />

location from i. It states that this new memory state is the same as the one resulting<br />

from the evaluation of the initial statement in the initial memory state, with the memory<br />

locations associated <strong>to</strong> i unchanged, and the memory locations associated <strong>to</strong> i ′ holding the<br />

updated values.<br />

6.1.1.4 Restricted Statement Isolation<br />

Proof.<br />

We can now prove Theorem (6.2)<br />

S({typeof(i) i ′ ; i ′ = i ; si→i ′ ; i = i′ ; }, σ)<br />

=unbind(S(i ′ = i; si→i ′ ; i = i ;′ , loc(id, T(type, σ), σ)<br />

<br />

=σ ′<br />

), i ′ )<br />

=unbind(S(i = i ′ , S(si→i ′, S(i′ = i, σ ′ ))), i ′ )<br />

=unbind(S(i = i ′ , S(si→i ′, σ′ [I(i ′ ) → σ ′ (I(i))])), i ′ )<br />

We can apply Lemma 6.5 <strong>to</strong> the evaluation of si→i ′ <strong>to</strong> get<br />

=unbind(S(i = i ′ , S(s, σ ′ )[I(i) → σ ′ (I(i)), I(i ′ <br />

) → S(s, σ ′ ) I(i) <br />

=unbind(S(s, σ ′ )[I(i) → σ ′ (I(i)), I(i ′ ) →<br />

=unbind(S(s, σ ′ )[I(i ′ ) →<br />

=unbind(S(s, σ ′ )[I(i ′ ) →<br />

=S(s, σ)<br />

<br />

S(s, σ ′ ) I(i) <br />

<br />

S(s, σ ′ ) I(i) <br />

, I(i) →<br />

<br />

S(s, σ ′ ) I(i) <br />

], i ′ )<br />

, I(i) →<br />

<br />

S(s, σ ′ ) I(i) <br />

], i ′ )<br />

]), i ′ )<br />

<br />

S(s, σ ′ ) I(i) <br />

], i ′ )


120CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

6.1.2 Statement Isolation and Convex Array Regions<br />

The application of Theorem (6.1) leads <strong>to</strong> a correct but very poor code. Indeed any<br />

variable referenced in the isolated statement is transferred back and <strong>for</strong>th, even if it is only<br />

used, or if it is only defined by the statement. In a similar manner, arrays are transferred<br />

as a whole, while only a sub-array may be needed. This section studies the interaction<br />

between convex array regions and statement isolation, using the <strong>for</strong>mer <strong>to</strong> reduce the data<br />

transfers generated by the latter. 1<br />

In a nutshell, the approach is similar <strong>to</strong> using Theorem 6.1 <strong>for</strong> all variables, then calling<br />

an enhanced version of dead code elimination <strong>to</strong> remove all the “dead” data transfers.<br />

Given a statement s, it is possible <strong>to</strong> compute an estimate of the array regions imported<br />

or exported by this statement <strong>for</strong> each array reference r referenced by s. These regions<br />

are denoted Ri(s, σ)[r] and Ro(s, σ)[r], respectively. Depending on the accuracy of the<br />

analysis, these regions are either exact, R = i (v), or over-estimated, R <br />

i (v). There is a<br />

strong relationship between these array regions and the data <strong>to</strong> be transferred. Considering<br />

a statement s ∈ S:<br />

Transfers from the accelera<strong>to</strong>r All data that may be exported by s must be copied<br />

back <strong>to</strong> the host from the accelera<strong>to</strong>r:<br />

TH←A : S × Σ → P(R) = (s, σ) ↦→ Ro(s, σ) (6.1)<br />

Transfers <strong>to</strong> the accelera<strong>to</strong>r All data that may be imported by s must be copied from<br />

the host <strong>to</strong> the accelera<strong>to</strong>r:<br />

TH→A : S × Σ → P(R) = (s, σ) ↦→ Ri(s, σ)<br />

Indeed, all data <strong>for</strong> which we have no guarantee of a preliminary write by s must be<br />

copied in. Otherwise, uninitialized data may be transferred back <strong>to</strong> the host without<br />

being initialized. So the extended <strong>for</strong>mula is:<br />

TH→A : S × Σ → P(R) = (s, σ) ↦→ Ri(s, σ) ∪ (R o (s, σ) − R = o (s, σ)) (6.2)<br />

Based on Equations (6.1) and (6.2) it is possible <strong>to</strong> allocate new variables on the<br />

accelera<strong>to</strong>r, <strong>to</strong> generate copy operations from the old variables <strong>to</strong> the newly allocated ones<br />

and <strong>to</strong> per<strong>for</strong>m the required change of frame on s. Listing 6.2 illustrates this trans<strong>for</strong>mation<br />

on the running example from Listing 5.9. It presents the variable replacement, the data<br />

allocation and the 2D data transfers. Thanks <strong>to</strong> region analysis, in0 is not copied out<br />

and out0 is not copied in. The generated data transfers are target-independent and the<br />

implementation is specialized depending on the targeted accelera<strong>to</strong>r.<br />

1. Statement isolation can also be used <strong>to</strong> generate thread local s<strong>to</strong>rage, or improve cache behavior.


6.2. MEMORY FOOTPRINT REDUCTION 121<br />

void erode ( int n, int m, int in[n][m], int out [n][m]) {<br />

int (* out0 )[n][m] = 0, (* in0 )[n][m+1] = 0;<br />

P4A_accel_malloc (( void **) &in0 , sizeof ( int )*n*(m+1) );<br />

P4A_accel_malloc (( void **) &out0 , sizeof ( int )*n*m);<br />

P4A_copy_<strong>to</strong>_accel_2d ( sizeof ( int ), n, m, n, m+1, 0, 0, &in [0][0] ,<br />

* in0 );<br />

P4A_copy_<strong>to</strong>_accel_2d ( sizeof ( int ), n, m, n, m, 0, 0, & out [0][0] , *<br />

out0 );<br />

<strong>for</strong> ( int i = 0; i


122CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

To compute Vl we gather all the identifiers syntactically found in s and sum up their<br />

type size.<br />

Vl : S → E = s ↦→ <br />

i∈decls(s)<br />

sizeof(i)<br />

where decls : S → P(I) is a function similar <strong>to</strong> idss : S → P(I) that collects all identifiers<br />

declared in a statement.<br />

This approach would be <strong>to</strong>o naive <strong>for</strong> variables declared outside of s, as it includes all<br />

array elements, even those that are never written or read. Convex array regions are useful<br />

here, and we can use the <strong>for</strong>mulae:<br />

Vo : S × Σ → E(s, σ) ↦→ |Rr(s, σ) ∪ Rw(s, σ)|<br />

where the cardinal opera<strong>to</strong>r counts the number of elements in the resulting set. It can be<br />

split in a per-variable <strong>for</strong>m:<br />

Vo(s, σ) = <br />

i∈decls(s)<br />

|Rr(s, σ)[i] ∪ Rw(s, σ)[i]|<br />

where the [·] opera<strong>to</strong>r selects all the references prefixed by the given identifier. If we<br />

compute the convex hull of the region union, it is possible <strong>to</strong> count symbolically its cardinal<br />

using Ehrhart polynomials [Cla96].<br />

Vo(s, σ) ≤ <br />

i∈decls(s)<br />

|Rr(s, σ)[i] ¯∪ Rw(s, σ)[i]|<br />

where · ¯∪ · is the convex union.<br />

Because arrays in C have rectangular shapes 3 , it is more realistic <strong>to</strong> consider the rectangular<br />

hull of the regions, which leads <strong>to</strong> Equation 6.3.<br />

V (s, σ) ≤ Vl(s) + <br />

i∈decls(s)<br />

|⌈Rr(s, σ)[i]¯∪Rw(s, σ)[i]⌉| (6.3)<br />

where ⌈·⌉ is the rectangular hull. When s is surrounded by loops and the above expression<br />

depends on the loop indices, it is possible <strong>to</strong> trans<strong>for</strong>m these loops <strong>to</strong> change the memory<br />

footprint.<br />

6.2.2 Symbolic Rectangular Tiling<br />

Let us consider a perfectly nested loop s of depth n. In order <strong>to</strong> work out the tiling<br />

parameters, a two-step process is used: n symbolic values denoted p1, . . . , pn are introduced<br />

<strong>to</strong> represent the computational blocks and a symbolic tiling, parameterized by these values,<br />

compute their exact value. It is however possible <strong>to</strong> compute an over-estimation of this volume thanks <strong>to</strong><br />

statement preconditions, but this dissertation does not dive in<strong>to</strong> these details.<br />

3. It is possible <strong>to</strong> allocate non-rectangular convex shapes using pointer arrays. . .


6.3. REDUNDANT LOAD STORE OPTIMIZATION 123<br />

is per<strong>for</strong>med. It generates n outer loops and n inner loops. The statement carrying the<br />

inner loops is denoted sinner and the memory state be<strong>for</strong>e its execution is denoted σinner.<br />

The idea is <strong>to</strong> run the inner loops on the accelera<strong>to</strong>r once the pk are chosen so that the<br />

memory footprint of sinner does not exceed a threshold defined by the hardware. To this<br />

end, the memory footprint V (sinner, σinner) is computed and one of the solutions satisfying<br />

Condition (6.4) is searched.<br />

V (sinner, σinner) ≤ Vmaxa<br />

(6.4)<br />

Vmaxa is the memory size of the considered accelera<strong>to</strong>r a. This gives one inequality over<br />

the pk. Other constraints are derived from the accelera<strong>to</strong>r model specified in Section 2.1:<br />

e.g. a vec<strong>to</strong>r accelera<strong>to</strong>r requires p1 <strong>to</strong> be set <strong>to</strong> the vec<strong>to</strong>r size. The algorithm is given in<br />

a synthetic <strong>for</strong>m in Algorithm 5:<br />

Data: ln ← a perfect loop nest of depth n<br />

Data: Vmax a maximum memory footprint<br />

Data: c an additional system of linear inequalities<br />

Result: a statement that matches c and the volume constraint<br />

l2n ← rectangular_tiling(〈x1, . . . , xn〉);<br />

l ′ ← inner_loop(l2n, n);<br />

p ← memory_footprint(l ′ );<br />

〈p1, . . . , pn〉 ← solve(s ∧ (p ≤ Vmax), 〈x1, . . . , xn〉);<br />

return ;<br />

i∈1,n int xi = pi ; l2n<br />

Algorithm 5: Memory footprint reduction algorithm.<br />

Listing 6.3 shows the effect of a symbolic tiling and the result array region analysis on<br />

the running example. As a result, the memory footprint of sinner is given as a function of<br />

p1, p2 in Equation (6.5).<br />

V (sinner, σinner) = 2 × p1 × p2<br />

For terapix, the constraint system is<br />

x1 ≤ 128<br />

2 × x1x2 ≤ 1024<br />

and the tuple 〈128, 512〉 is the maximal solution.<br />

6.3 Redundant Load S<strong>to</strong>re Optimization<br />

(6.5)<br />

At every parallelism level, be it at the node, cpu or instruction level, data transfers<br />

is often the per<strong>for</strong>mance bottleneck. The time spent <strong>to</strong> transfer data does not contribute<br />

directly <strong>to</strong> the computation. There are two complementary approaches <strong>to</strong> limit the time<br />

loss:


124CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

void erode ( int n, int m, int in[n][m], int out [n][m]) {<br />

int p_1 , p_2 ;<br />

<strong>for</strong> ( int it = 0; it


6.3. REDUNDANT LOAD STORE OPTIMIZATION 125<br />

We also introduce a function that checks if two statements satisfy the Bernstein’s condition<br />

[Ber66], B : S × S → {true, false}.<br />

Characterizations of Direct Memory Access (dma) are used in the <strong>for</strong>m of load and<br />

s<strong>to</strong>re statements.<br />

Definition 6.3. A statement s ∈ S in memory state σ ∈ Σ is a dma statement if it<br />

verifies the following properties:<br />

1. s is a function call ;<br />

2. s writes a single convex array region: ∃i ∈ I s.t. Rw(σ, s) = {i[φ0, . . . , φk]}<br />

A dma statement is a function call statements that write data <strong>to</strong> a single location. As<br />

such the assignment opera<strong>to</strong>r, “=” is a <strong>for</strong>m of dma. loads and s<strong>to</strong>res are distinguished<br />

by their name—the Internal Representation (ir) does not distinguish between host and<br />

remote memory.<br />

Given a dma statement, we define its reciprocal as follows.<br />

Definition 6.4. The reciprocal of a dma statement d is a statement denoted d −1 that<br />

verifies the following properties:<br />

∀σ ∈ Σ, ∀l ∈ (L \ R(Rw(σ, d), σ)), S(d ; d −1 , σ)(l) = S(d, σ)(l)<br />

For instance, the statement denoted by “memcpy(a,b,10*sizeof(in));” is a dma and its<br />

reciprocal is denoted by “memcpy(b,a,10*sizeof(in));”. The idea is that in the sequence<br />

memcpy(a,b,10*sizeof(in));memcpy(b,a,10*sizeof(in)), the second call is useless.<br />

6.3.1 Redundant Load Elimination<br />

The algorithm used <strong>to</strong> move load statement upward is based on a simple idea: step-bystep<br />

move load operations upward in the hcfg so that are executed as soon as possible.<br />

Combined with the redundant s<strong>to</strong>re elimination trans<strong>for</strong>mation described in Section 6.3.2,<br />

it can lead <strong>to</strong> two optimizations:<br />

– Move load operations outside of loops, leading <strong>to</strong> an optimization related <strong>to</strong> invariant<br />

code motion;<br />

– Remove load and s<strong>to</strong>re operations when they meet.<br />

The next sections define the legality conditions <strong>for</strong> moving a statement in the three<br />

most common control flow constructs—sequences, tests and loops—and how this can be<br />

done interprocedurally.<br />

6.3.1.1 Sequences<br />

Let consider a statement sequence where sl is a load statement:<br />

s = s0 ; sl


126CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

Bernstein’s condition give us a condition when it is valid <strong>to</strong> swap them, as shown in<br />

Equation (6.6).<br />

<br />

sl ; s0 Bern(s0, sl)<br />

Rl(s) =<br />

(6.6)<br />

s otherwise<br />

6.3.1.2 Tests<br />

Let consider a branch statement.<br />

s = if(ec) { s0 ; st } else { s1 ; sf }<br />

Depending on the nature of s0 and s1, it is possible and profitable <strong>to</strong> move them be<strong>for</strong>e<br />

the condition. If s0 and s1 are similar, there is an opportunity <strong>to</strong> merge both statements<br />

in<strong>to</strong> a single one.<br />

Let σc denotes the memory state after the evaluation of ec. Both s0 and s1 are evaluated<br />

in the same memory state σc. If they both are load statements, satisfy the Bernstein’s<br />

condition and are textually equal, it is possible <strong>to</strong> move them upward in a single statement,<br />

as summarized by Equation (6.7).<br />

<br />

s0 ; if(ec) { st } else { sf } (Bern)(s0, ec) ∧ s0<br />

Rl(s) =<br />

s otherwise<br />

t<br />

= s1<br />

(6.7)<br />

Similarly, if only s0 or s1 is a load statement, and it satisfies the Bernstein’s condition,<br />

then it can be moved outside the test.<br />

6.3.1.3 Loops<br />

Let consider a loop statement:<br />

s = do { sl ; s0 } while(ec);<br />

A sufficient condition <strong>to</strong> move s0 out of the loop if that s0 satisfies the Bernstein’s<br />

condition with s0 and ec, leading <strong>to</strong> Equation (6.8).<br />

<br />

sl ; do { s0 } while(ec) Bern(s0, sl) ∧ Bern(sl, ec) ∧ S(sl ; sl) = sl<br />

Rl(s) =<br />

s otherwise<br />

(6.8)<br />

Proof. We use a recursive proof on the number of iteration of loop s. Let s n denote s<br />

when the loop body executes n times. The property <strong>to</strong> prove is that Equation (6.8) holds<br />

∀s n , n ∈ N ∗ .<br />

For n = 1<br />

s 1 = sl ; s0 ; ec<br />

= Rl(s 1 )


6.3. REDUNDANT LOAD STORE OPTIMIZATION 127<br />

Let assume the property is true <strong>for</strong> n ∈ N ∗ ,<br />

s n+1 = do { sl ; s0 } while(ec) <br />

= sl ; s0 ; ec ; s n <br />

= sl ; s0 ; ec ; Rl(s n ) from induction hypothesis<br />

= sl ; s0 ; ec ; sl ; do { s0 } while(ec) from definition<br />

= sl ; sl ; s0 ; ec ; do { s0 } while(ec) from Bern(sl, s0) ∧ Bern(sl, ec)<br />

= sl ; s0 ; ec ; do { s0 } while(ec) sl is idempotent<br />

= Rl(s n+1 )<br />

6.3.1.4 Interprocedurally<br />

As the result of moving load statement upward in the hcfg, a load can be found at<br />

the entry point of a function. In that case it may be interesting <strong>to</strong> move the load at the<br />

call sites. To do so, one must first ensure that the memory state be<strong>for</strong>e the call site is the<br />

same as the memory state at the function entry point. It is the case if there is no write<br />

effect in function parameters. In that situation, the load statement can be moved be<strong>for</strong>e<br />

the call state after backward translation from <strong>for</strong>mal parameters <strong>to</strong> effective parameters.<br />

6.3.2 Redundant S<strong>to</strong>re Elimination<br />

This section describe the conditions <strong>to</strong> move s<strong>to</strong>re statements upward in the hcfg.<br />

The equations are similar <strong>to</strong> redundant load elimination’s.<br />

6.3.3 Sequences<br />

This problem is quite similar <strong>to</strong> its load counterpart from Section 6.3.1.1.<br />

s = ss ; s0 <br />

Bernstein’s condition give us a condition when it is valid <strong>to</strong> swap them, as shown in<br />

Equation (6.9).<br />

6.3.4 Tests<br />

Rs(s) =<br />

Let consider a branch statement.<br />

s0 ; ss Bern(s0, ss)<br />

s otherwise<br />

(6.9)


128CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

s = if (ec) { st ; s0 else sf ; s1 }<br />

We get an equation that mirrors Equation (6.7) except <strong>for</strong> the condition over ec.<br />

6.3.5 Loops<br />

t<br />

if (ec) { st } else { sf } s0 s0 = s1<br />

Rs(s) =<br />

s otherwise<br />

Let consider a loop statement:<br />

s = do { s0 ; ss } while (ec)<br />

The s<strong>to</strong>re version is given by Equation (6.11).<br />

(6.10)<br />

<br />

do { s0 } while (ec) ss Bern(ss, s0) ∧ Bern(ss, ec) ∧ S(ss ; ss) = ss<br />

Rs(s) =<br />

s otherwise<br />

(6.11)<br />

The proof follows the same idea as <strong>for</strong> Equation (6.8).<br />

6.3.6 Interprocedurally<br />

If the same s<strong>to</strong>re statement is found at each exit point of a function, it may be possible<br />

<strong>to</strong> move it past its call site. To do so, one must ensure that the s<strong>to</strong>re statement only<br />

depends on <strong>for</strong>mal parameters and that these parameters are not written by the function.<br />

If this the case, the load statement can be removed from the function call and added after<br />

each call site after parameter backward translation.<br />

6.3.7 Combining Load and S<strong>to</strong>re Elimination<br />

This section examines the interaction between load and s<strong>to</strong>res in two situations: in a<br />

sequence, when a load is followed by a s<strong>to</strong>re, and in loops, when the loop body is surrounded<br />

by a load and a s<strong>to</strong>re. These two situations are eventually triggered by the upward move<br />

of dma in the hcfg.<br />

6.3.7.1 Sequence<br />

Let consider a simple sequence of two statements:<br />

s = s0 ; s1<br />

By definition, if s0 is a dma and s1 its reciprocal, then we have:


6.3. REDUNDANT LOAD STORE OPTIMIZATION 129<br />

s = s0 ; s −1<br />

0 <br />

= s0<br />

that eliminates the second call and may make it possible <strong>to</strong> continue the upward propagation.<br />

6.3.7.2 Loops<br />

Let consider a loop statement whose body is surrounded by dma calls:<br />

s = do { sl ; s0 ; ss } while (ec)<br />

then it can be translated in<strong>to</strong> Equation (6.12).<br />

under the following conditions:<br />

R(s) = sl do { s0 } while (ec) ss (6.12)<br />

sl = s − s 1 (6.13)<br />

Bern(ss, s0) (6.14)<br />

Bern(ss, ec) (6.15)<br />

(6.16)<br />

Proof. We use a recursive proof on the number of iteration of loop s. Let s n denote s when<br />

the loop body executes n times.<br />

Equation (6.12) is true when n = 1:<br />

s 1 = sl ; s0 ; ss ; ec<br />

= sl ; s0 ; ec ; ss from hypothesis 6.15<br />

= R(s 1 )<br />

Let us assume it is true if the loop iterates n times. In that case a loop that would<br />

iterate n + 1 times can be decomposed as follows:<br />

s n+1 = s n ; sl ; s0 ; ss ; ec<br />

= sl ; R(s n ) ; ss ; sl ; s0 ; ss ; ec from recursion hypothesis<br />

= sl ; R(s n ) ; ss ; s0 ; ss ; ec from hypothesis 6.13<br />

= sl ; R(s n ) ; s0 ; ss ; ec from hypothesis 6.14<br />

= sl ; R(s n ) ; s0 ; ec ; ss from hypothesis 6.15<br />

= R(s n+1 )


130CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION<br />

6.3.8 Main Algorithm<br />

Applying iteratively redundant load elimination, redundant s<strong>to</strong>re elimination and combine<br />

load s<strong>to</strong>re may lead <strong>to</strong> fewer data communication. This process is detailed in Algorithm<br />

6.<br />

Data: p ← a program<br />

repeat<br />

p ′ ← p;<br />

p ← redundant_load_elimination(p);<br />

p ← redundant_s<strong>to</strong>re_elimination(p);<br />

p ← combine_load_s<strong>to</strong>re(p);<br />

pdead_code_elimination(p);<br />

until p = p ′ ;<br />

Algorithm 6: Redundant load s<strong>to</strong>re elimination algorithm at the pass manager level.<br />

Listing 6.4 illustrates the result of this algorithm on an example taken from the Paralléliseur<br />

Interprocedural de Programmes Scientifiques (pips) validation. It demonstrates<br />

interprocedural elimination of data communications represented by the load and s<strong>to</strong>re<br />

functions. These functions are first moved outside of the loop, then outside of the a function<br />

then redundant loads are eliminated.<br />

6.4 Conclusion<br />

In this chapter, we have presented and proved Theorem (6.1) <strong>to</strong> completely isolate a<br />

statement from its original memory. This trans<strong>for</strong>mation is the basic building block <strong>for</strong><br />

many trans<strong>for</strong>mations related <strong>to</strong> heterogeneous computing, <strong>for</strong> they usually use a separated<br />

memory space.<br />

The generated data transfers generated are not optimized globally. Hence we have<br />

proposed Algorithm 6 <strong>to</strong> iteratively merge these transfers in order <strong>to</strong> suppress redundant<br />

ones. This algorithm is independent from the previous one and also works with the dma<br />

generated by Algorithm 3.<br />

We have also developed Algorithm 5 <strong>to</strong> take in<strong>to</strong> account the limited memory size of<br />

the targeted hardware, based on loop tiling and memory footprint estimation.<br />

The experiments related <strong>to</strong> the usage of these trans<strong>for</strong>mations are presented with the<br />

compiler implementations in Chapter 7.


6.4. CONCLUSION 131<br />

void a( int i, int j[2] , int k [2]) {<br />

while (i - - >=0) {<br />

load (k, j); //


132CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION


Chapter 7<br />

Compiler Implementations and<br />

Experiments<br />

Pont de Bruz, Ille-et-Vilaine c○ Pymouss / WikipediA<br />

This thesis introduces and describes a methodology <strong>to</strong> cus<strong>to</strong>mize compilers <strong>for</strong> different<br />

heterogeneous plat<strong>for</strong>ms, building up on a rich <strong>to</strong>olbox of source-<strong>to</strong>-source trans<strong>for</strong>mations,<br />

a programmable pass-manager Application Programming Interface (api) and a simple<br />

hardware description. It would not be complete without an experimental validation.<br />

The methodology claims <strong>to</strong> make it easier <strong>to</strong> assemble compilers. To validate it, we<br />

have chosen five different targets: three general purpose Central Processing Unit (cpu) with<br />

different vec<strong>to</strong>r instruction units, a Field Programmable Gate Array (fpga)-based image<br />

processor [BLE + 08] and an nVidia Graphical Processing Unit (gpu). For each of them,<br />

we have developed a compiler pro<strong>to</strong>type using the techniques presented in Chapters 4, 5<br />

and 6. The efficiency of the code generated by these research compilers is measured using<br />

benchmarks or applications from the relevant domain.<br />

This chapter begins with a simple Open Multi Processing (openmp) directive genera<strong>to</strong>r<br />

in Section 7.1 <strong>to</strong> show how <strong>to</strong> apply the principles discussed in this thesis <strong>to</strong> a simple, yet<br />

real, example. The compiler <strong>for</strong> gpus implemented by hpc project based on our work is<br />

detailed in Section 7.2. Section 7.3 presents terapyps, a compiler from C <strong>to</strong> terasm, the<br />

assembly language <strong>for</strong> the terapix image processor. Finally, a retargetable compiler <strong>for</strong><br />

Multimedia Instruction Set (mis) is described in Section 7.4 <strong>for</strong> three targets: Streaming<br />

simd Extension (sse), Advanced Vec<strong>to</strong>r eXtensions (avx) and neon.<br />

133


134 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

memory<br />

ram<br />

shared<br />

optional feature<br />

manda<strong>to</strong>ry feature<br />

Multicore<br />

Device<br />

isa Acceleration<br />

Parallelism<br />

mimd<br />

Figure 7.1: Multicore hardware feature diagram.<br />

7.1 A Simple OpenMP Compiler<br />

The goal of this section is <strong>to</strong> illustrate the idea developed in this thesis on a simple<br />

example: a multicore machine.<br />

7.1.1 Architecture Description<br />

First step is <strong>to</strong> list the hardware constraints of the target machine. The (simple)<br />

hardware feature diagram is given in Figure 7.1. The only constraint is Multiple Instruction<br />

stream, Multiple Data stream (mimd) parallelism and it is optional. As a consequence, the<br />

only required trans<strong>for</strong>mation is mimd parallelism detection/extraction. Optional features<br />

and optimizations are not taken in<strong>to</strong> account.<br />

7.1.2 Compiler Implementation<br />

The input language is C and the output language is C with openmp directives. As directives<br />

can be represented in the Internal Representation (ir), it means no post-processor<br />

is needed. So we have a very classical source-<strong>to</strong>-source compilation flow, detailed in Figure<br />

7.2.<br />

Algorithm 7 is used by the source-<strong>to</strong>-source compiler. It involves privatization, parallelism<br />

detection, reduction detection and directive generation. Additionally, loop fusion is<br />

used <strong>to</strong> improve locality. If parallelism detection fails, the loops are distributed using the<br />

Allen & Kennedy algorithm [AK87], and the detection is tried again.<br />

For reference, the Pythonic PIPS (pyps) script executed by the pass manager is given<br />

in Listing 7.1.


7.1. A SIMPLE OPENMP COMPILER 135<br />

Sequential<br />

Code<br />

Transla<strong>to</strong>r<br />

Sequential<br />

Code<br />

+<br />

directives<br />

openmp<br />

Compiler<br />

Binary<br />

Figure 7.2: <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> openmp.<br />

Data: s ← a statement<br />

Result: a statement with openmp directives<br />

s ← loop_fusion(s);<br />

privatization(s);<br />

reduction_detection(s);<br />

if parallelism_detection(s) then<br />

s ← directive_generation(s);<br />

else<br />

s ← loop_distribution(s);<br />

if parallelism_detection(s) then<br />

s ← directive_generation(s);<br />

end<br />

end<br />

return s Algorithm 7: Parallel loop generation algorithm <strong>for</strong> openmp.


136 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

def openmp (m, verbose = False , ** props ):<br />

""" parallelize function with opennmp """<br />

... # initialization stuff<br />

m. loop_fusion ()<br />

# some analyses per<strong>for</strong>m better after this<br />

m. split_initializations (** props )<br />

# privatize scalar variables<br />

m. privatize_module (** props )<br />

# try openmp parallelization coarse grain<br />

# cus<strong>to</strong>m functions<br />

try :<br />

m. coarse_grain_parallelization (** props )<br />

except :<br />

m. internalize_parallel_code (** props )<br />

# directive generation<br />

m. ompify_code (** props )<br />

m. omp_merge_pragma (** props )<br />

# eventually print the resulting code<br />

if verbose :<br />

m. display (** props )<br />

Listing 7.1: Original PyPS script <strong>for</strong> openmp code generation.


7.1. A SIMPLE OPENMP COMPILER 137<br />

CFLAGS +=- fopenmp<br />

LIBS +=- fopenmp<br />

## pipsrules ##<br />

Listing 7.2: Makefile stub <strong>for</strong> openmp compilation.<br />

transla<strong>to</strong>r post-processor maker #pass involved<br />

SLOC 41 0 2 8<br />

Table 7.1: sloccount report <strong>for</strong> an openmp directive genera<strong>to</strong>r pro<strong>to</strong>type written in pyps.<br />

The build process is mostly unchanged, but an additional flag <strong>to</strong> tell the compiler <strong>to</strong><br />

interpret openmp directives. The makefile stub is given in Listing 7.2. No additional rules<br />

are provided, but the compiler and linker flags are changed.<br />

7.1.3 Experiments & Validation<br />

The aim of this section is not <strong>to</strong> build an efficient openmp code genera<strong>to</strong>r, but <strong>to</strong><br />

provide a sample example as an introduction <strong>to</strong> the next sections. As a consequence we do<br />

not choose focus on getting impressing speedups on real-world applications, but rather on<br />

giving evidences that a compiler pro<strong>to</strong>type can achieve reasonable results in spite of the<br />

little amount of work dedicated <strong>to</strong> its construction.<br />

The benchmark suite used is the polybench. Although this benchmark is intended<br />

at testing polyhedral trans<strong>for</strong>mations, it contains numerous kernels that are easily au<strong>to</strong>matically<br />

parallelized, so they do not stress our naïve implementation, while showing the<br />

relevancy of the approach. Figure 7.3 shows the median speedup of the accelerated version,<br />

measured by taking the median over 100 executions and <strong>for</strong> the default benchmark<br />

sizes.<br />

The reference timings are obtained from a lap<strong>to</strong>p running 2.6.38-2-686 GNU/Linux.<br />

It has 2 Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz. The code is compiled with<br />

gnu C Compiler (gcc) version 4.6.1 and -O3 -ffast-math flags. The accelerated code<br />

is obtained by running Algorithm 7 on each function of the program. It is compiled with<br />

same compiler with the same flags plus the openmp flag.<br />

To obtain this result, we have directly scripted the compiler with pyps. This script is<br />

a good illustration of pyps flexibility. It is reproduced in Appendix C.<br />

For each compiler we have implemented, we issue a small report that states the number<br />

of SLOC <strong>for</strong> the source-<strong>to</strong>-source compiler, the post-processor and the maker. We also<br />

compute the number of passes and analyses involved in the whole compilation process.<br />

The result <strong>for</strong> the openmp pro<strong>to</strong>type is given Table 7.1. It shows that the pro<strong>to</strong>type is<br />

simple and really capitalizes on existing trans<strong>for</strong>mations. The assembling is done in a very<br />

lightweight way.


138 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

generated OpenMP code relative execution time<br />

lu.c<br />

covariance.c<br />

correlation.c<br />

jacobi-2d-imper.c<br />

fdtd-2d.c<br />

jacobi-1d-imper.c<br />

adi.c<br />

seidel.c<br />

fdtd-apml.c<br />

gauss-filter.c<br />

reg-detect.c<br />

durbin.c<br />

symm.c<br />

symm.exp.c<br />

gemm.c<br />

mvt.c<br />

bicg.c<br />

trisolv.c<br />

trmm.c<br />

2mm.c<br />

syrk.c<br />

gemver.c<br />

cholesky.c<br />

atax.c<br />

syr2k.c<br />

doitgen.c<br />

gesummv.c<br />

3mm.c<br />

ludcmp.c<br />

dynprog.c<br />

gramschmidt.c<br />

Figure 7.3: Per<strong>for</strong>mance of an openmp directive genera<strong>to</strong>r pro<strong>to</strong>type on the polybench<br />

benchmark.


7.2. A GPU COMPILER 139<br />

7.2 A GPU Compiler<br />

This section describes a pro<strong>to</strong>type compiler <strong>for</strong> machine with nVidia’s gpus. It is not<br />

an optimizing compiler: it does not take advantage of some hardware capabilities, but it<br />

still generates speedups <strong>for</strong> computationally intensive applications.<br />

7.2.1 Architecture Description<br />

A gpu is an accelera<strong>to</strong>r that does not share a memory space with its host. It couples<br />

mimd parallelism at coarse grain with Single Instruction stream, Multiple Data stream<br />

(simd) parallelism at fine grain. Two characteristics are important: a huge number of cores<br />

that provide important theoretical speedup, and a constrained memory: shared memory<br />

with limited size and low transfer rate between the gpu and the cpu when compared with<br />

transfer rate between cpu and main memory, not <strong>to</strong> mention that coalesced accesses are<br />

critical <strong>to</strong> reach high throughput.<br />

An nVidia gpu board is a set of multiprocessors. For instance a GTX580 board<br />

has 16 multiprocessors containing up <strong>to</strong> 32 thread processors each, that is 512 Compute<br />

Unified Device Architecture (cuda) cores as a whole. It can run a maximum number of<br />

1536 threads per multiprocessor. The blocks of threads are scheduled transparently by the<br />

hardware. It assumes each multiprocessor is independent.<br />

The memory hierarchy is quite complex:<br />

– gpu have registers <strong>for</strong> fast 1-cycle thread-local access so they are <strong>to</strong> be privileged <strong>for</strong><br />

computations. Un<strong>for</strong>tunately since there are only 32KB registers per multiprocessor<br />

and thousands of threads on an nVidia GTX580, this is a scarce resource.<br />

– each thread has a local memory;<br />

– the shared memory is local <strong>to</strong> a multiprocessor and global <strong>for</strong> each thread of the<br />

multiprocessor. It is 16 or 48 KB large but it is accessed as fast as registers. They<br />

are often used as scratchpad memories <strong>to</strong> drastically increase per<strong>for</strong>mance;<br />

– the extended memory, called global memory (typically 1 <strong>to</strong> 6 GB) can be accessed<br />

by each thread but with a far bigger latency (800–1000 cycles). Each access must be<br />

coalesced by the use of a large number of threads per block;<br />

– the texture cache uses a small portion of the global memory. With a 50-cycles access<br />

time must be privileged over the global memory when possible. It is only readable,<br />

the coherence with the global memory is not ensured; but the memory amount is<br />

only 8 KB per multiprocessor;<br />

– the 64 KB constant memory.<br />

The different memory level and Processing Element (pe) layout are visible on the Fermi<br />

chip shown in Figure 7.4. It is synthesized in the <strong>for</strong>m of a hardware feature diagram<br />

in Figure 7.5. In this diagram, many features are flagged as optional. This allows an<br />

incremental compiler development: first manda<strong>to</strong>ry constraints are taken in<strong>to</strong> account,<br />

then optional constraints are integrated <strong>to</strong> enhance per<strong>for</strong>mance. In this thesis, we focus<br />

on building a pro<strong>to</strong>type that works <strong>for</strong> the manda<strong>to</strong>ry features. <strong>Building</strong> a compiler that<br />

takes in<strong>to</strong> account all gpu features is a PhD subject on its own!


140 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

Figure 7.4: nVidia Fermi architecture.<br />

memory<br />

rom ram<br />

gpu<br />

distributed shared<br />

optional feature<br />

manda<strong>to</strong>ry feature<br />

isa Acceleration<br />

Parallelism<br />

simd mimd<br />

Figure 7.5: gpu hardware feature diagram.


7.2. A GPU COMPILER 141<br />

7.2.2 Compiler Implementation<br />

The hardware feature diagram of a gpu has some similarities with the hardware feature<br />

diagram of terapix (see Figure 7.10): both have their own memory and benefit from simd<br />

acceleration. As a consequence, the two compilers roughly use the same algorithm <strong>to</strong> turn<br />

the input code in<strong>to</strong> host and accelera<strong>to</strong>r part. In a similar manner, Direct Memory Access<br />

(dma) generation uses the same analyses, although the api naturally differs.<br />

The main difference lies in the kernel Instruction Set Architecture (isa). Firstly, unlike<br />

terapix, a compiler from a C++ dialect, cuda, <strong>to</strong> gpu isa, Parallel Thread eXecution<br />

(ptx), already exists, and takes care of low level trans<strong>for</strong>mations. However, some constraints<br />

such as the lack of support <strong>for</strong> variable length arrays or the normalization of the<br />

iteration spaces, require additional trans<strong>for</strong>mations.<br />

Secondly, cuda extends the C89 syntax with function qualifiers (e.g. __global__) or<br />

kernel calls, using the triple chevron syntax (>). We do not extend the Paralléliseur<br />

Interprocedural de Programmes Scientifiques (pips) ir <strong>to</strong> cover these extensions but use a<br />

macro-based compatibility header.<br />

The compilation scheme <strong>for</strong> gpu code generation is given in Figure 7.6. The new parts<br />

are the compatibility headers, the source-<strong>to</strong>-source compiler and the cuda transla<strong>to</strong>r.<br />

Other modules are simply reused.<br />

Compatibility header A compatibility header is a set of macro functions that per<strong>for</strong>ms<br />

a translation from C syntax <strong>to</strong> cuda syntax. For instance the return type of a kernel<br />

can be set <strong>to</strong> a typedef, say typedef __global__void void, which is correct C, that<br />

is defined as #define __global__void __global__ void in the compatibility header. 1<br />

Initial transla<strong>to</strong>r The initial transla<strong>to</strong>r takes a sequential code , detects parallel loops<br />

and splits each of them in<strong>to</strong> two parts: a sequential C code that contains a call <strong>to</strong><br />

a kernel, and a sequential call that embodies the kernel. A loop proxy code applies<br />

the kernel <strong>to</strong> each element of the loop iteration space. This loop proxy code is<br />

needed <strong>for</strong> the sequential semantics but does not appear in the final code, nor in<br />

Figure 7.6. Let us take a simple example, the sum of two arrays: the two arrays are<br />

initialized, a loop over the array elements per<strong>for</strong>m the addition and s<strong>to</strong>res the result<br />

in a third array. The initial transla<strong>to</strong>r splits this code in three parts: the host code<br />

that per<strong>for</strong>ms initializations and calls a kernel; the loop proxy code that iterates over<br />

all the elements and calls a kernel code <strong>for</strong> each; the kernel code that per<strong>for</strong>ms the<br />

addition. Figure 7.7 illustrates this structure.<br />

Once the code is split in<strong>to</strong> host and kernel part, statement isolation is used <strong>to</strong> generate<br />

data transfers in the host part.<br />

cuda transla<strong>to</strong>r The cuda dialect is very close <strong>to</strong> the C language. The cuda transla<strong>to</strong>r<br />

takes care of the following syntactic changes: convert variable-length array <strong>to</strong> pointers<br />

1. Sometimes, the C preprocessor is not sufficient. In that case, regular expressions or (better) C++<br />

templates with type inference can become handy. The reader may doubt the relevancy of using third-party<br />

<strong>to</strong>ols instead of a large and monolithic ir. However, we are confident that using a wide range of specialized<br />

<strong>to</strong>ols is more flexible.


142 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

Compatibility<br />

header<br />

C<br />

Preprocessor<br />

cuda<br />

Host Code<br />

cuda<br />

Compiler<br />

Host Binary<br />

Sequential<br />

C Code<br />

+<br />

kernel call<br />

Sequential<br />

Code<br />

Initial<br />

Transla<strong>to</strong>r<br />

Sequential<br />

C Code<br />

=<br />

kernel<br />

cuda<br />

Transla<strong>to</strong>r<br />

C Code<br />

+<br />

compatibility<br />

layer<br />

Figure 7.6: <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> gpu.<br />

C<br />

PreProcessor<br />

cuda<br />

Kernel Code<br />

cuda<br />

Compiler<br />

Host Binary<br />

Compatibility<br />

header


7.2. A GPU COMPILER 143<br />

double in0 [n], in1 [n], out [n];<br />

<strong>for</strong> ( int i =0;i


144 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

transla<strong>to</strong>r post-processor maker #pass involved<br />

SLOC 3836 0 N/A 20<br />

Table 7.2: sloccount report <strong>for</strong> a cuda genera<strong>to</strong>r pro<strong>to</strong>type written in pyps.<br />

using array linearization, normalize the iteration space using loop normalization,<br />

make sure no additional iterations are per<strong>for</strong>med using iteration clamping, convert<br />

C99 complex type <strong>to</strong> cuda complex types and finally take care of the cuda specific<br />

syntax. Language differences are handled by the compatibility header.<br />

7.2.3 Experiments & Validation<br />

We have validated the <strong>to</strong>ol on a set of image processing kernels: a convolution, with a<br />

n<br />

window size of 5 × 5, and a finite impulse response filter, with a window size of . The<br />

1000<br />

erode used as a running example so far does not pass the computational intensity test (see<br />

Section 5.3) on the considered machine and is not included in the benchmark.<br />

Measurements have been made using a desk<strong>to</strong>p station hosting a 64-bit Debian/testing<br />

with gcc 4.3.5 and a 2-core 2.4 GHz Intel Core2 cpus. The cuda 3.2 compiler is<br />

used and the generated code is executed on a Quadro FX 2700M card. Compilation is<br />

fully au<strong>to</strong>matic. The whole run is measured, i.e. timings include gpu initialization, data<br />

transfers, kernel calls, etc. The median over 100 runs is taken. Figure 7.8 shows additional<br />

results <strong>for</strong> digital signal processing kernels extracted from [Orf95] and available on the<br />

website http://www.ece.rutgers.edu/~orfanidi/intro2sp: a N-Discrete Time Fourier<br />

Trans<strong>for</strong>m and a sample cross correlation.<br />

The sloccount report <strong>for</strong> each part of the pro<strong>to</strong>type is given in Table 7.2.<br />

7.3 An FPGA Image Processor Accelera<strong>to</strong>r Compiler<br />

<strong>Heterogeneous</strong> computing is all about balancing the hardware and the associated costs,<br />

say intellectual property rights, energy consumption, volume, throughput, maintenance or<br />

development costs. For embedded devices, the balance is all the more difficult <strong>to</strong> find as the<br />

constraints are tighter. As a consequence, the hardware is likely <strong>to</strong> be highly specialized,<br />

which often means difficult <strong>to</strong> program. The terapix plat<strong>for</strong>m is a good illustration of this<br />

phenomenon: it is a low-power, high-throughput device specialized <strong>for</strong> image-processing,<br />

based on fpga, developed by thales. There are two main motivations <strong>for</strong> this machine:<br />

1. <strong>to</strong> be able <strong>to</strong> process a stream of images directly on the camera that generates them—<br />

so called “intelligent camera”. In the context of event recognition, if the events are<br />

scarce, it is <strong>to</strong>o expensive <strong>to</strong> transfer all data <strong>to</strong> a remote processing engine. Per<strong>for</strong>ming<br />

the detection in place allows <strong>to</strong> transfer only valuable data;<br />

2. <strong>to</strong> be independent of a circuit provider. For longterm maintenance, it is not acceptable<br />

<strong>to</strong> depend on a third-party, closed-source, hardware. Choosing an fpga-based<br />

circuit unties the machine from the hardware.


7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 145<br />

execution time (s)<br />

execution time (s)<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0<br />

0<br />

1000<br />

10000<br />

2000<br />

20000<br />

3000<br />

4000<br />

5000<br />

input size<br />

6000<br />

(a) Convolution<br />

30000<br />

40000<br />

50000<br />

input size<br />

60000<br />

on GPU<br />

on CPU<br />

7000<br />

8000<br />

on GPU<br />

on CPU<br />

70000<br />

80000<br />

9000<br />

90000<br />

10000<br />

100000<br />

(c) N-Discrete Time Fourier Trans<strong>for</strong>m<br />

execution time(s)<br />

execution time (s)<br />

18<br />

16<br />

14<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

0<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

0<br />

0<br />

0<br />

1e+06<br />

10000<br />

2e+06<br />

20000<br />

3e+06<br />

30000<br />

4e+06<br />

5e+06<br />

input size<br />

(b) fir<br />

40000<br />

50000<br />

input size<br />

6e+06<br />

60000<br />

(d) Correlation<br />

on GPU<br />

on CPU<br />

7e+06<br />

70000<br />

8e+06<br />

on GPU<br />

on CPU<br />

Figure 7.8: Median execution time on a gpu <strong>for</strong> dsp kernels.<br />

80000<br />

9e+06<br />

90000<br />

1e+07<br />

100000


146 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

#<br />

! "<br />

! "<br />

Figure 7.9: terapix architecture.<br />

This section presents the compilation chain <strong>for</strong> this hardware and its result on a few<br />

benchmarks. Section 7.3.1 describes the architecture and models it as a feature diagram.<br />

Section 7.3.2 proposes a compilation flow based on this model and the trans<strong>for</strong>mations<br />

presented in this thesis. Section 7.3.3 validates the approach on various image processing<br />

algorithms and compares manually compiled code <strong>to</strong> au<strong>to</strong>matically generated ones.<br />

7.3.1 Architecture Description<br />

In this section, we give a quick summary of the terapix architecture and emphasize<br />

on the hardware constraints. The reader interested in more details is referred <strong>to</strong> [BLE + 08].<br />

The terapix architecture is an fpga-based circuit implemented on a Virtex-4 SX-55<br />

from Xilinx. A general purpose softcore microprocessor, the µP , implements the control<br />

part and a simd Processing Unit (pu) is used <strong>for</strong> the image kernels that require high<br />

processing power. This pu consists of 128 pes that run at 150 MHz. The interconnect<br />

between pes follows a ring <strong>to</strong>pology so that each pe has access <strong>to</strong> its neighbours’ memory<br />

and <strong>to</strong> its local Random Access Memory (ram) of 512 × 36b used <strong>for</strong> registers. A Read<br />

Only Memory (rom) of limited size can be accessed by all pes.<br />

The isa is dedicated <strong>to</strong> image processing: it uses a Very Long Instruction Word (vliw)<br />

instruction set that provides arithmetic operations over integers and a conditional assignment.<br />

Neither division nor floating point operations are available. Direct and indirect<br />

addressing modes are supported as well as a special pattern addressing mode <strong>to</strong> describe<br />

complex memory access patterns. Using a vertical pattern, a pe that access a[i] retrieves<br />

the a[#PE][i] element of the 2D global memory, while using a diag2 pattern, it retrieves<br />

the a[#PE][i+#PE] element. The sequencer only provides three control operations: a<br />

counter-based loop, a continue and a return.<br />

A vliw instruction consists of 5 fields given Table 7.3. The image field manipulates<br />

$$$


7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 147<br />

32b 26b 32b 13b 5b<br />

Image Mask Register Alu Sequence<br />

Table 7.3: Description of a terapix microinstruction.<br />

pointers <strong>to</strong> the global ram and <strong>to</strong> neighbouring pes, the mask field manipulates pointers<br />

<strong>to</strong> the global rom, the register field manipulates pointers <strong>to</strong> the local ram, the ALU field<br />

selects arithmetic operations and opera<strong>to</strong>rs and the sequence field is used <strong>for</strong> the program<br />

counter. An example of vliw assembly code is given Listing 7.3.<br />

The set of hardware features is described in Figure 7.10, using the methodology proposed<br />

in Chapter 2.<br />

This hardware is currently only programmed by hand: the developer writes the C code<br />

<strong>for</strong> the host side and the microcode, the assembly code <strong>for</strong> the accelera<strong>to</strong>r. Three <strong>to</strong>ols are<br />

provided <strong>to</strong> the developer: a compiler from the assembly code that generates a microcode<br />

image in the <strong>for</strong>m of a an array of bytes defined in a C header <strong>for</strong> inclusion on the host<br />

side, a cycle-accurate simula<strong>to</strong>r <strong>to</strong> test the resulting code and a code compac<strong>to</strong>r <strong>to</strong> pack<br />

vliw instructions when possible.<br />

In addition <strong>to</strong> the restricted isa, programming such a machine is difficult because of the<br />

combination of a large simd unit and limited dma. For instance, <strong>to</strong> per<strong>for</strong>m a point-<strong>to</strong>point<br />

operation on a vec<strong>to</strong>r of 130 elements, one must load the first 128 elements, per<strong>for</strong>m<br />

the computation and copy back the result, then load the elements from the third <strong>to</strong> the<br />

130 th , per<strong>for</strong>m the computation and copy the result back, leading <strong>to</strong> 126 computations<br />

being per<strong>for</strong>med twice. Figure 7.11 illustrates this troublesome behavior.<br />

7.3.2 terapix Compiler Implementation<br />

Compiling <strong>for</strong> terapix requires two steps: the input code is scanned <strong>for</strong> parallel loops<br />

and each of them is split in<strong>to</strong> an host part and an accelera<strong>to</strong>r part, then the accelera<strong>to</strong>r<br />

part is translated in<strong>to</strong> terapix assembly code.<br />

A compilation scheme that takes in<strong>to</strong> account terapix specificities is given in Figure<br />

7.12.<br />

7.3.2.1 Input Code Splitting<br />

The separation of the kernel from its caller is per<strong>for</strong>med by Algorithm 8 based on the<br />

trans<strong>for</strong>mations presented in Sections 6.1 and 6.2.<br />

Once a kernel has been extracted, it must be converted <strong>to</strong> meet the hardware constraints.<br />

However, no C-<strong>to</strong>-assembly compiler is available, and we are left with an assembler<br />

and a code compacter. As a consequence, we first per<strong>for</strong>m as many refinements as<br />

possible at the source level, using the ideas developed in Chapter 4. Then we use an ad<br />

hoc C-<strong>to</strong>-terasm <strong>to</strong>ol developed <strong>for</strong> this purpose. It generates uncompacted code and pipes<br />

it through the code compac<strong>to</strong>r <strong>to</strong> generate the final assembly code.


148 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

prog convol<br />

sub convol<br />

pattern vertic || || || ||<br />

im ,i1= FIFO1 +NN ||ma ,m1= FIFO3 || || ||<br />

im ,i2= FIFO2 +NN || || || do_N1 ||<br />

im ,i1=i1+SS || || || ||<br />

im ,i2=i2+SS || || || ||<br />

im ,i3=i1+W || || || ||<br />

im ,i4=i2+W || || || do_N2 ||<br />

im ,i3=i3+E || ma=m1 ||P=im*ma || ||<br />

im=im+E || ma=ma+E ||P=P+im*ma || ||<br />

im=im+E || ma=ma+E ||P=P+im*ma || ||<br />

im=im+S || ma=ma+S ||P=P || ||<br />

|| ||P=N+im*ma || ||<br />

im=im+S || ma=ma+S ||P=P || ||<br />

|| ||P=N+im*ma || ||<br />

im=im+W || ma=ma+W ||P=P+im*ma || ||<br />

im=im+W || ma=ma+W ||P=P+im*ma || ||<br />

im=im+N || ma=ma+N ||P=P || ||<br />

|| ||P=S+im*ma || ||<br />

im=im+E || ma=ma+E ||P=P+im*ma || ||<br />

im ,i4=i4+E || ||P,im=P || loop ||<br />

|| || || loop ||<br />

|| || || || return<br />

endsub<br />

endprog<br />

Listing 7.3: terapix assembly <strong>for</strong> a 3 × 3 convolution kernel.<br />

memory<br />

rom ram<br />

distributed<br />

manda<strong>to</strong>ry feature<br />

terapix<br />

isa Acceleration<br />

Parallelism<br />

simd<br />

Figure 7.10: terapix hardware feature diagram.


7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 149<br />

(a) First step: no redundant computations.<br />

(b) Second step: redundant computations.<br />

Figure 7.11: terapix redundant computations.<br />

Sequential<br />

C Code<br />

+<br />

kernel call<br />

C Compiler<br />

Host<br />

Binary<br />

Sequential Code<br />

Transla<strong>to</strong>r<br />

Sequential<br />

C Code<br />

=<br />

kernel<br />

terapix<br />

PostProcessor<br />

Assembly<br />

(not compacted)<br />

terapix<br />

Compac<strong>to</strong>r Assembler<br />

Microcode Assembly<br />

(compacted) Binary<br />

Figure 7.12: <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> terapix.


150 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

Data: s ← a statement<br />

Data: pe ← the number of Processing Element<br />

Data: m ← the accelera<strong>to</strong>r memory size<br />

Result: k a set of kernel codes<br />

<strong>for</strong> l ∈ loops(s) do<br />

if depth(l) = 2 then<br />

declare_variable(s, size_t height);<br />

declare_variable(s, size_t width);<br />

l ′ ← symbolic_tiling(l, 〈height, width〉);<br />

solve_linear_system(l ′ , pe, m);<br />

generate_rom(l ′ );<br />

s ′ ← isolate_statement(s, l ′ );<br />

k ← k ∪ {outline(s, s ′ )};<br />

end<br />

end<br />

return k<br />

Algorithm 8: terapix kernel extraction algorithm at the pass manager level.<br />

Algorithm 9 details the steps involved in assembly code generation. It first processes<br />

each loop <strong>to</strong> normalize its iteration space and converts each do-loop <strong>to</strong> its while-loop<br />

counterpart. Declaration blocks are removed by flatten code and all array references are<br />

replaced by their pointer equivalent using array linearization. strength reduction trans<strong>for</strong>ms<br />

pointers in<strong>to</strong> itera<strong>to</strong>r whenever possible. The granularity of the C code is then lowered by<br />

split update opera<strong>to</strong>r and n address code generation.<br />

7.3.3 Experiments & Validation<br />

We categorize the image opera<strong>to</strong>rs found in terapix’s application domain as either<br />

point-<strong>to</strong>-point, vertical, horizontal or stencil opera<strong>to</strong>rs. This leaves apart opera<strong>to</strong>rs such as<br />

his<strong>to</strong>grams that are not covered by our compilation scheme because of the more complex<br />

parallelization scheme. For each category, we choose a specific opera<strong>to</strong>r, namely brightness,<br />

vertical erode, horizontal convolution and convolution. A terapix expert manually wrote<br />

an optimized assembly version <strong>for</strong> these kernels, and we wrote the text-book version of<br />

these algorithms in C and piped it through our au<strong>to</strong>matic compiler. Table 7.4 gives the<br />

ratio between microcode cycle counts <strong>for</strong> au<strong>to</strong>matic and manual code generation. It shows<br />

that the au<strong>to</strong>matically generated code’s execution time is close <strong>to</strong> the manual one. The<br />

slowdown of the vertical erode is due <strong>to</strong> a naïve register allocation scheme that suffers from<br />

the low-latency terapix registers.<br />

The sloccount report <strong>for</strong> each part of the pro<strong>to</strong>type is given in Table 7.5. This pro<strong>to</strong>type<br />

is far more complex than the previous ones, but so is the target.<br />

Listing 7.4 illustrates the behavior of Algorithm 9 on a horizontal erosion <strong>for</strong> the host<br />

side. Listing 7.5 illustrates the accelera<strong>to</strong>r side. Listing 7.6 shows the generated assembly


7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 151<br />

Data: k ← a kernel from Algorithm 8<br />

Data: I ← {f | f is an instruction in Terasm}<br />

Result: k <strong>for</strong>matted as a terapix microcode<br />

<strong>for</strong> l ∈ loops(k) do<br />

k ← loop_normalize(l, lower_bound=0);<br />

k ← do_loop_<strong>to</strong>_while_loop(l);<br />

end<br />

k ← flatten_code(k);<br />

k ← linearize_array(k, pointer_conversion=True);<br />

k ← strength_reduction(k);<br />

k ← split_update_opera<strong>to</strong>r(k);<br />

k ← n_address_code_generation(k, 2);<br />

k ← normalize_terapix_microcode(k);<br />

k ← dead_code_elimination(k);<br />

<strong>for</strong> i ∈ I do<br />

k ← instruction_selection(k, i);<br />

end<br />

return k<br />

Algorithm 9: C-<strong>to</strong>-terapix translation algorithm at the pass manager level.<br />

brightness horizontal convolution vertical erode convolution<br />

au<strong>to</strong>matic<br />

manual ×1 ×1.31 ×2.12 ×1.31<br />

Table 7.4: Ratio between terapix microcode cycle counts <strong>for</strong> au<strong>to</strong>matic and manual code<br />

generation.<br />

transla<strong>to</strong>r post-processor maker #pass involved<br />

SLOC 211 218 18 32<br />

Table 7.5: sloccount report <strong>for</strong> a terapix assembly genera<strong>to</strong>r pro<strong>to</strong>type written in pyps.


152 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

code after compaction. The comparison with the initial listing demonstrates the need <strong>for</strong><br />

an au<strong>to</strong>matic generation <strong>to</strong>ol.<br />

7.4 A Retargetable Multimedia Instruction Set Compiler<br />

This section presents a retargetable compiler <strong>for</strong> mis based on pips. It is based on the<br />

work on Super-word Level Parallelism (slp) presented in Section 5.1 and the communication<br />

optimization from Section 6.3. It targets three different instruction sets: sse, avx<br />

and neon.<br />

7.4.1 Architecture Description<br />

miss rely on the vec<strong>to</strong>r unit found in most modern processors. The set of hardware<br />

feature is described of such processors is given in Figure 7.13. Each mis has a specific<br />

isa but we have already shown in Section 5.1.2 how <strong>to</strong> represent them using a generic<br />

instruction set. As a consequence, the instruction set is mainly represented by the number<br />

of bits per vec<strong>to</strong>r register.<br />

7.4.2 Compiler Implementation<br />

The input language is C and the output language is C with mis intrinsics. As intrinsics<br />

are C functions, there is no need <strong>for</strong> post-processing. However, header substitution is<br />

required <strong>to</strong> specialize the generic mis. Figure 7.14 summarizes the compilation flow. The<br />

source-<strong>to</strong>-source transla<strong>to</strong>r relies on Algorithm 3 from Chapter 5 <strong>to</strong> generate vec<strong>to</strong>r code.<br />

7.4.3 Multimedia Instruction Set on Desk<strong>to</strong>p and Embedded Processors<br />

Three sets of experiments have been carried out. They all use the same set of C source<br />

files. Three mis are tested on Core2 Duo running at 2.2 GHz, using a 2.6.34 Linux Kernel.<br />

A board with an ARMv7 processor and a 2.6.28 Linux kernel was used <strong>for</strong> the neon mis.<br />

A machine with a 2.6.32 Linux kernel and an Intel SandyBridge (running at 2.6 GHz)<br />

executed the avx tests.<br />

Applications have been chosen <strong>to</strong> point out limitations of compilers (including ours).<br />

daxpy_u?r.c, ddot_u?r.c and dscal_u?r.c are taken from linpack [DLP03] benchmark<br />

and illustrate the impact of manual unrolling on vec<strong>to</strong>rization. matrix_*.c are taken from<br />

the Coremark [Con] benchmark and show the impact of tiling. stencil.c is a typical<br />

stencil application and a good candidate <strong>for</strong> vec<strong>to</strong>rization.<br />

Other benchmarks are text-book versions of well known computations kernels (Finite<br />

Impulse Response filter, average power, alpha-blending, convolution with a 3 × 3 kernel)<br />

taken from a dsp manual [Orf95].


7.4. A RETARGETABLE MULTIMEDIA INSTRUCTION SET COMPILER 153<br />

void runner ( int n, int img_out [n][n -4] , int img [n][n]) {<br />

<strong>for</strong> ( int y = 0; y


154 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

void launcher_0_microcode ( int I_29 , int img00 [258] , int img_out00<br />

[254]) {<br />

<strong>for</strong> ( int x = 0; x


7.4. A RETARGETABLE MULTIMEDIA INSTRUCTION SET COMPILER 155<br />

prog launcher_0_microcode<br />

sub launcher_0_microcode<br />

im , i12 = FIFO2 ||||P,re (0)=1 || ||<br />

im , i11 = FIFO1 |||| P=P || ||<br />

im , i10 = i11 +1* E |||| P=P || ||<br />

im ,i9=i11 +2* E |||| P=P || ||<br />

im ,i8=i11 +3* E |||| || ||<br />

im ,i7=i11 +4* E |||| || ||<br />

im ,i1=i7 |||| || ||<br />

im ,i2=i8 |||| || ||<br />

im ,i3=i9 |||| || ||<br />

im ,i4=i10 |||| || ||<br />

im ,i5=i11 |||| || ||<br />

im ,i6=i12 |||| || ||<br />

im ,i6=i6 +1* W |||| || ||<br />

im ,i5=i5 +1* W |||| || ||<br />

im ,i4=i4 +1* W |||| || ||<br />

im ,i3=i3 +1* W |||| || ||<br />

im ,i2=i2 +1* W |||| || ||<br />

im ,i1=i1 +1* W |||| || do_N1 ||<br />

im=i5 +1* E ||||P,re (1)= im*re (0) || ||<br />

im=i4 +1* E |||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

||||P,As=P-im*re (0) || ||<br />

||||P,re (2)= im*re (0) || ||<br />

im=i3 +1* E |||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=re (1) || ||<br />

||||P,re (8)= if(As =1 ,P,re (2))|| ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

||||P,As=re (8) - im*re (0) || ||<br />

||||P,re (7)= im*re (0) || ||<br />

im=i2 +1* E |||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=re (8) || ||<br />

||||P,re (7)= if(As =1 ,P,re (7))|| ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

||||P,As=re (7) - im*re (0) || ||<br />

||||P,re (6)= im*re (0) || ||<br />

im=i1 +1* E |||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=re (7) || ||<br />

||||P,re (6)= if(As =1 ,P,re (6))|| ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

||||P,As=re (6) - im*re (0) || ||<br />

||||P,re (2)= im*re (0) || ||<br />

im=i6 +1* E |||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=P || ||<br />

|||| P=re (6) || ||<br />

||||P,im=if(As =1 ,P,re (2)) || loop ||<br />

|||| || || return<br />

endsub<br />

endprog<br />

Listing 7.6: Illustration of terapix compacted assembly.


156 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

memory<br />

ram<br />

manda<strong>to</strong>ry feature<br />

mis<br />

isa Acceleration<br />

Parallelism<br />

simd<br />

Figure 7.13: mis hardware feature diagram.<br />

Sequential<br />

Code<br />

Transla<strong>to</strong>r<br />

Sequential<br />

Code<br />

+<br />

Intrinsics<br />

<strong>Source</strong><strong>to</strong>-Binary<br />

Compiler<br />

Binary<br />

Specialization<br />

header<br />

Figure 7.14: <strong>Source</strong>-<strong>to</strong>-source compilation scheme <strong>for</strong> mis.


7.4. A RETARGETABLE MULTIMEDIA INSTRUCTION SET COMPILER 157<br />

The experiments consist of measuring the execution time of each kernel, using either<br />

the initial version or the optimized version generated by our compiler. The compilers used<br />

were gcc 4.4.5 <strong>for</strong> both i386 and ARM architectures, with -O3 -ffast-math flags, and<br />

median over 150 runs of each program is measured.<br />

The challenge is <strong>to</strong> have the same level of per<strong>for</strong>mance as Intel C++ Compiler (icc)<br />

<strong>for</strong> sse and avx, while supporting another architecture, arm, in the same unified infrastructure,<br />

pips, <strong>to</strong> provide per<strong>for</strong>mance portability.<br />

7.4.4 Results & Analyses<br />

Figure 7.15a, 7.15b and 7.15c show the result of the experiments, giving speedup of vec<strong>to</strong>rized<br />

version compared <strong>to</strong> the reference sequential version. This reference run is given by<br />

the original source compiled with gcc with -O3 -ffast-math and -fno-tree-vec<strong>to</strong>rize.<br />

These experiments lead <strong>to</strong> the following assessments:<br />

1. gcc vec<strong>to</strong>rization engine hardly achieves any speedup: only rather simple kernels<br />

get a 2× speedup, fir even has a significant slowdown, while icc gets very good<br />

speedups <strong>for</strong> all kernels;<br />

2. using our vec<strong>to</strong>rization engine and running gcc on the output is almost always beneficial<br />

(green bars are above red bars). It is especially visible on the arm processor;<br />

3. we outper<strong>for</strong>m icc (pink bars above blue bars) <strong>for</strong> matrix-mul-* on sse, due <strong>to</strong> the<br />

combination of tiling and vec<strong>to</strong>rization, but always lose <strong>to</strong> it <strong>for</strong> the same kernels on<br />

avx;<br />

4. the Fused Multiply-Add (fma) operation is available in the neon mis and gcc does<br />

not use it. This explains the super-ideal speedups and shows the benefits of using<br />

target-specific instructions;<br />

5. unrolled version of linpack kernels are better vec<strong>to</strong>rized by pips, thanks <strong>to</strong> the slp<br />

approach;<br />

6. pips output behaves better when compiled by icc than when compiled by gcc. It<br />

illustrates the source-<strong>to</strong>-source approach that hooks the compilation flow <strong>to</strong> add a<br />

feature, here vec<strong>to</strong>rization, and delegates the remaining work <strong>to</strong> other compilers. In<br />

this case, icc per<strong>for</strong>ms additional optimizations on the vec<strong>to</strong>r code gcc is not aware<br />

of.<br />

These experiments validate the approach: per<strong>for</strong>mance within the reach of icc is<br />

achieved, and this per<strong>for</strong>mance is portable across architectures.<br />

The sloccount report <strong>for</strong> each part of the pro<strong>to</strong>type is given in Table 7.6. The SLOC<br />

<strong>for</strong> the post-processor and the maker are given <strong>for</strong> the avx driver, sse and neon drivers<br />

have similar values.


158 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

speedup vs. sequential execution<br />

speedup vs. sequential execution<br />

speedup vs. sequential execution<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

gcc+nopips<br />

gcc+pips<br />

icc+nopips<br />

icc+pips<br />

fir matrix-mul-matrix<br />

corr<br />

convol3x3<br />

ddot-r<br />

dscal-r<br />

matrix-mul-vect<br />

ddot-ur<br />

alphablending<br />

stencil<br />

matrix-mul-const<br />

dscal-ur<br />

daxpy-r<br />

daxpy-ur<br />

(a) Vec<strong>to</strong>rization using the sse mis with 32-bit OS.<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

gcc+nopips<br />

gcc+pips<br />

icc+nopips<br />

icc+pips<br />

fir matrix-mul-matrix<br />

corr<br />

convol3x3<br />

ddot-r<br />

dscal-r<br />

matrix-mul-vect<br />

ddot-ur<br />

alphablending<br />

stencil<br />

matrix-mul-const<br />

dscal-ur<br />

daxpy-r<br />

daxpy-ur<br />

(b) Vec<strong>to</strong>rization using the avx mis with 64-bit OS.<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

gcc<br />

gcc+pips<br />

fir matrix-mul-matrix<br />

corr<br />

convol3x3<br />

ddot-r<br />

dscal-r<br />

matrix-mul-vect<br />

ddot-ur<br />

alphablending<br />

stencil<br />

matrix-mul-const<br />

dscal-ur<br />

daxpy-r<br />

daxpy-ur<br />

(c) Vec<strong>to</strong>rization using the neon mis.


7.5. CONCLUSION 159<br />

transla<strong>to</strong>r post-processor maker #pass involved<br />

SLOC 223 166 1 30<br />

Table 7.6: sloccount report <strong>for</strong> an avx intrinsic genera<strong>to</strong>r pro<strong>to</strong>type written in pyps.<br />

pass count<br />

30<br />

25<br />

20<br />

15<br />

10<br />

7.5 Conclusion<br />

5<br />

0<br />

1 2 3 4<br />

# of compilers where a pass is found<br />

Figure 7.15: Pass reuse among 4 pyps-based compilers.<br />

We claim in Chapter 3 that an important characteristic of a compiler infrastructure is<br />

the pass reuse among compilers. To verify that our framework matches this expectation,<br />

we use the following experiment over the six compilers presented in this chapter: <strong>for</strong> each<br />

pass and analysis available in pips, we count the number of compilers it is used in. The<br />

mis compiler is counted only once. The result of this analysis is given in Figure 7.15. This<br />

chart does not take in<strong>to</strong> account parsers and pretty-printers. Although the targets are very<br />

different (a multicore device, a mis, a gpu and an embedded accelera<strong>to</strong>r), we still have<br />

good pass reuse, as more than 50% of the passes are used in at least two compilers, if they<br />

are used at all. The high number of passes used only in one compiler is linked <strong>to</strong> the fact<br />

that each target is subject <strong>to</strong> very different hardware constraints, especially <strong>for</strong> the isa.<br />

As a consequence, each compiler has many small passes <strong>to</strong> take in<strong>to</strong> account target-specific<br />

aspects. The passes that are reused the most are also the most complex ones: symbolic<br />

tiling, outlining, statement isolation, etc. The addition of an Open Computing Language<br />

(opencl) compiler would certainly hare a lot with the existing cuda compiler.<br />

Table 7.7 summarizes the in<strong>for</strong>mation presented <strong>for</strong> each compiler pro<strong>to</strong>type. It shows<br />

that each complier was assembled using a reasonable development ef<strong>for</strong>t, but also that the


160 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS<br />

transla<strong>to</strong>r post-processor maker #pass involved<br />

openmp 41 0 2 8<br />

terapix 211 218 18 32<br />

mis 223 166 1 30<br />

cuda 597 0 N/A 20<br />

Table 7.7: Summary of the sloccount reports <strong>for</strong> the compiler pro<strong>to</strong>types written in pyps.<br />

more complex the target is, the more complex the pass manager is. However, pass reuse<br />

makes it possible <strong>to</strong> keep this complexity low.<br />

Another point <strong>to</strong> consider about pass reuse is the compiler composition. Some compilers<br />

may not share many passes, but we have provided a way <strong>to</strong> compose them using multiple<br />

inheritance. The consequence of this modular design is that each compiler focuses on a<br />

specific task, not an all-around purpose.<br />

This chapter presents the design and implementation of four compilers <strong>for</strong> heterogeneous<br />

devices, based on the heterogeneous model analyses from Chapter 2, the compiler<br />

infrastructure described in Chapter 3 and the trans<strong>for</strong>mations described in Chapters 4, 5<br />

and 6.<br />

This chapter describes how <strong>to</strong> build a basic openmp compiler using the ideas developed<br />

in this thesis: model the architecture, identify the output language and re-use existing<br />

trans<strong>for</strong>mations.<br />

Using the same methodology, we present three other compiler pro<strong>to</strong>types we have built<br />

during our phd: a retargetable compiler <strong>for</strong> mis that targets both sse, avx and neon,<br />

a compiler <strong>for</strong> the terapix image processor that goes from C down <strong>to</strong> assembly, and a<br />

C-<strong>to</strong>-cuda compiler <strong>for</strong> nVidia gpus.<br />

Each pro<strong>to</strong>type is validated on a set of benchmarks <strong>to</strong> ensure that it generates valid<br />

and reasonably efficient code. We also provide an analyse of each compiler in the term of<br />

<strong>Source</strong> Lines Of Code (sloc), pass usage and pass reuse.


Chapter 8<br />

Conclusion<br />

Vieux pont du Bono, Morbihan c○ Pierre Yves Sabas<br />

The path <strong>to</strong>ward per<strong>for</strong>mance is led by heterogeneous devices: even the lap<strong>to</strong>p used <strong>to</strong><br />

write this dissertation can use the processing power of two gpps, the two associated sse<br />

vec<strong>to</strong>r units, and a gpgpu. The main concern with such devices is programmability. In<br />

this thesis, we have taken the path of compilation <strong>to</strong> au<strong>to</strong>mate the production of code <strong>for</strong><br />

hardware accelera<strong>to</strong>rs. We focused on the ability <strong>to</strong> produce cheaply different compilers <strong>for</strong><br />

different hardware. As modern hardware is usually programmable in a C dialect, we have<br />

set the goal of au<strong>to</strong>matically translating text book algorithm written in the C language<br />

in<strong>to</strong> several target-dependent kernels written in a C dialect, and <strong>to</strong> generate the glue code<br />

between the host and the accelera<strong>to</strong>r.<br />

The advantage of this approach is its modularity: many trans<strong>for</strong>mations can be reused<br />

from one accelera<strong>to</strong>r <strong>to</strong> the other. It reduces the cost of producing compilers as new<br />

targets become available. Moreover, using a source-<strong>to</strong>-source based infrastructure makes<br />

it possible <strong>to</strong> interact with existing <strong>to</strong>ols, especially compilers that generate binary code<br />

from C dialects.<br />

161


162 CHAPTER 8. CONCLUSION<br />

To avoid the pitfall of parallel compilers designed in the 80’s, we deliberately chose <strong>to</strong><br />

take as inputs programs <strong>for</strong> which the sequential and parallel algorithms are similar. This<br />

has made it possible <strong>to</strong> focus on the translation task and not the parallelism extraction.<br />

Contributions<br />

Methodology <strong>to</strong> Build <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> <strong>Compilers</strong><br />

We have proposed <strong>to</strong> model hardware devices with a hardware constraint diagram. This<br />

diagram identifies the manda<strong>to</strong>ry and optional features of the hardware, and the manual<br />

association between constraints and code trans<strong>for</strong>mations guides the compiler developer<br />

through the compiler development process.<br />

Generic Compiler Infrastructure Design<br />

The heterogeneity of accelerating devices makes it difficult <strong>to</strong> build a unique compiler <strong>to</strong><br />

target them all. Moreover, pieces of software already exist, at different levels, <strong>to</strong> program<br />

these machines. In Chapter 3, we have proposed a compilation flow that combines a<br />

comprehensive source-<strong>to</strong>-source trans<strong>for</strong>mation <strong>to</strong>olbox, an api <strong>for</strong> pass management and<br />

an heterogeneous machine model. This methodology is validated in Chapter 7 <strong>for</strong> 4 different<br />

targets. It is used in the Par4All <strong>to</strong>ol developed by hpc project.<br />

Trans<strong>for</strong>mations <strong>for</strong> isa Constraints<br />

<strong>Heterogeneous</strong> devices provide acceleration through specialization: basically, they per<strong>for</strong>m<br />

better on a narrower set of applications. The direct consequence is a specialization of<br />

the isa. This specialization is visible in the C dialects proposed <strong>to</strong> program these devices.<br />

Chapter 4 proposes a set of source-<strong>to</strong>-source trans<strong>for</strong>mations <strong>to</strong> lower the level of the input<br />

language, including an original algorithm <strong>for</strong> outlining based on convex array regions.<br />

Hybrid slp Algorithm<br />

Multimedia instructions are now commonly found in gpp and even provide the base <strong>for</strong><br />

acceleration in hybrid cpu/gpu chips. We have developed an original algorithm based on<br />

existing works around loop vec<strong>to</strong>rization and Super-word Level Parallelism that combines<br />

the benefits of loop-based and sequence-based approaches in a unified algorithm. It is<br />

parametrized by a C description of the isa and so respects the criteria of retargetability<br />

raised at Chapter 3. The algorithm has been tested <strong>for</strong> three Multimedia Instruction Set:<br />

sse, avx and neon. This work received the third best poster award at PACT 2011.


Trans<strong>for</strong>mations <strong>to</strong> Meet Memory Constraints<br />

Memory is critical <strong>for</strong> many heterogeneous systems: when the accelera<strong>to</strong>r does not share<br />

memory with its host, rpc and dma are needed. The programming model is much more<br />

complex than classical ones. We have presented in Chapter 6 three trans<strong>for</strong>mations <strong>to</strong> take<br />

this in<strong>to</strong> account: statement isolation that separates accelera<strong>to</strong>r memory and host memory;<br />

memory footprint reduction that finds a tiling matrix that make sure there is enough<br />

memory on the accelera<strong>to</strong>r <strong>to</strong> run the tiled code; and redundant load-s<strong>to</strong>re elimination<br />

that removes redundant data transfers.<br />

Implementation<br />

All the trans<strong>for</strong>mations presented in this thesis have been developed in the pips source<strong>to</strong>-source<br />

compiler infrastructure <strong>for</strong> the C language and assembled using the pyps pass<br />

manager. They have led <strong>to</strong> the implementation of four compilers: a pro<strong>to</strong>type of Open<br />

Multi Processing (openmp) directive genera<strong>to</strong>r, a retargetable compiler <strong>for</strong> mis, a kernel<br />

genera<strong>to</strong>r <strong>for</strong> an fpga-based image processor, terapix, and a gpu code genera<strong>to</strong>r developed<br />

by the hpc project. It validates both the overall compiler infrastructure design and<br />

the algorithm proposed in the thesis. Experiments and compilation flows are detailed in<br />

Chapter 7.<br />

Contributions <strong>to</strong> the pips community<br />

It is difficult <strong>to</strong> untie a PhD in applied computer science from the development task.<br />

Developing new passes and extending some of the existing passes, originally designed <strong>for</strong><br />

the Fortran, <strong>to</strong> the C language, has taken a significant amount of time. As a member of the<br />

pips team, I have managed the modernization of the build system and the rationalization<br />

of the software packaging.<br />

I have supervised five trainees at Télécom Bretagne during internships related <strong>to</strong> the<br />

pips project and contributed <strong>to</strong> the scientific dissemination of our works through two<br />

tu<strong>to</strong>rials.<br />

Future Work<br />

hpc is steadily changing. Sparc64 are ranking first in the <strong>to</strong>p500 of June, 2011, while<br />

nVidia’s gpu led the race 6 months be<strong>for</strong>e. In this moving environment, nothing is ever<br />

settled yet and hardware vendors are still pushing their standards <strong>to</strong> have a common<br />

programming model supported by efficient engineering <strong>to</strong>ols. This requires cooperation<br />

and interoperability between <strong>to</strong>ols. To that extent, bridging the gap between opencl and<br />

existing vhdl genera<strong>to</strong>rs is an interesting challenge and it remains an open research field.<br />

However, hpc is still a niche market compared <strong>to</strong> embedded systems and smartphones. In<br />

these fields, the hardware constraints are all the more important due <strong>to</strong> limited battery<br />

163


164 CHAPTER 8. CONCLUSION<br />

capacity, weight, space, etc. The trans<strong>for</strong>mations and the approach studied during this<br />

PhD can certainly be used in that fields.<br />

mis are becoming more and more flexible, and it is common <strong>to</strong> find non-simd instructions<br />

in their C api. These instructions allows more load/s<strong>to</strong>re patterns (e.g. strided loads)<br />

and helps <strong>to</strong> achieve higher throughput in memory-constrained applications. The incremental<br />

addition of trans<strong>for</strong>mations that handle these instructions <strong>to</strong> our mis compiler is a<br />

promising subject.<br />

We see two possible extensions <strong>to</strong> the work on pass managers. First, the opera<strong>to</strong>r<br />

combinations creates an oriented graph that presents interesting parallelization opportunities<br />

at the pass manager level. It would make the compilation process itself parallel and<br />

improve compilation time. Second, some phase combination are known <strong>to</strong> be redundant,<br />

meaningless, etc. Adding semantics on code trans<strong>for</strong>mation would yield interesting graph<br />

pruning, <strong>for</strong> instance in the context of iterative compilation.


Appendix A<br />

The PIPS Compiler Infrastructure<br />

Pont de Saint Goustan, Morbihan c○ Gwenael AB / flickr<br />

Paralléliseur Interprocedural de Programmes Scientifiques (pips) [IJT91, ISCKG10,<br />

AAC + 11] is a source-<strong>to</strong>-source compiler infrastructure started in 1988 at MINES Paris-<br />

Tech, when parallel architectures were preeminent. Since then it has been successfully<br />

used <strong>to</strong> analyse, check or parallelize industrial Fortran codes. C support started ten years<br />

ago, bringing both new challenges and new applications. The key ideas of the framework,<br />

that makes it still relevant in 2011, are: a minimalistic Internal Representation (ir), interprocedural<br />

analyses and abstract interpretation on polyhedral lattices. The compiler<br />

infrastructure <strong>for</strong> heterogeneous targets, the algorithms and passes described in this thesis<br />

have been fully implemented on <strong>to</strong>p of pips. Finally, the three compilers described<br />

in Chapter 7 are also based on pips. The short overview given in this appendix should<br />

help readers not accus<strong>to</strong>med <strong>to</strong> pips <strong>to</strong> understand some technical parts. A more detailed<br />

overview is given in [AAC + 11], and the interested reader is advised <strong>to</strong> refer <strong>to</strong> the theses<br />

of Nga Nguyen [Ngu02] and Béatrice Creusillet [Cre96] <strong>for</strong> a detailed description of<br />

the underlying mathematical framework.<br />

165


166 APPENDIX A. THE PIPS COMPILER INFRASTRUCTURE<br />

void foo ( int n, int threshold , int a[n], int b[n]) {<br />

int k =0;<br />

<strong>for</strong> ( int h =0;h threshold )<br />

b[k ++]= a[h];<br />

}<br />

}<br />

Available Analyses<br />

Listing A.1: A simple loop <strong>to</strong> illustrate pips analyses.<br />

In addition <strong>to</strong> classical compiler analyses such as use-def chains, dependence graph or<br />

read-write effects, pips provides accurate interprocedural analyses. We illustrate each of<br />

them based on the loop from Listing A.1.<br />

Preconditions and Postconditions are affine predicates over scalar variables that are<br />

proved <strong>to</strong> hold be<strong>for</strong>e or after the execution of a given statement, respectively (see<br />

Listing A.2).<br />

// P() {}<br />

void foo ( int n,int threshold , int a[n], int b[n ]){<br />

// P() {}<br />

int k = 0;{<br />

// P(k) {k ==0}<br />

// P(h,k) {k ==0}<br />

<strong>for</strong> ( int h = 0; h


T() {}<br />

void foo ( int n,int threshold , int a[n], int b[n ]){<br />

// T(k) {k ==0}<br />

int k = 0;{<br />

// T(h) {}<br />

// T(h,k) {0


168 APPENDIX A. THE PIPS COMPILER INFRASTRUCTURE<br />

// : a [*] threshold<br />

// : b [*]<br />

// < is read >: n<br />

void foo ( int n,int threshold , int a[n], int b[n ]){<br />

// < is written >: k<br />

int k = 0;{<br />

// : a [*] h k threshold<br />

// : b [*] k<br />

// < is read >: n<br />

// < is written >: h<br />

<strong>for</strong> ( int h = 0; h : h n threshold<br />

if (a[h]> threshold )<br />

// : a [*]<br />

// : b [*]<br />

// < is read >: h k n<br />

// < is written >: k<br />

b[k ++] = a[h];<br />

}<br />

}<br />

Listing A.4: Example of cumulated memory effects analysis.<br />

//


Appendix B<br />

The LuC language<br />

The LuC language is used in some proofs of this dissertation. This language is similar<br />

<strong>to</strong> Fortran with a C syntax. Redundant constructs such as the += opera<strong>to</strong>r are not<br />

represented <strong>to</strong> keep proofs simple. The main differences with the C language are the<br />

removal of recursive calls, unions and pointers and the addition of reference passing mode<br />

<strong>for</strong> function parameters. Global variables are not allowed. A short reference of its syntax<br />

is given here, using typing conventions <strong>to</strong> differentiate non-terminal symbols and terminals<br />

from language constructs.<br />

B.1 Syntactic Clauses<br />

prog : fdecls<br />

fdecls : ∅ | fdecl fdecls<br />

fdecl : void id ( param ) { stat }<br />

type : int | float | complex | struct id { fields } | type [ expr ]<br />

param : type id<br />

fields : ∅ | field fields<br />

field : type id<br />

expr : cst | ref | expr op expr<br />

ref : id | ref [ expr ] | ref . fieldname<br />

stat : ∅ | ; | { type id ; stat }<br />

| ref =expr ; | ref =read ;<br />

| write expr ;<br />

| ref (ref ) ;<br />

| if( expr ) { stat } else { stat }<br />

| while( expr ) { stat }<br />

| stat ; stat<br />

169


170 APPENDIX B. THE LUC LANGUAGE<br />

B.2 Semantic Clauses<br />

T(int|float, σ) = 1<br />

T(complex, σ) = 2<br />

T(struct id { fields }, σ) = <br />

〈t,i〉∈fields<br />

T(t, σ)<br />

T( type [ expr ] , σ) = T( type, σ) × E(expr, σ)<br />

R(id, σ) = I(id)<br />

R(ref [ expr ], σ) = R(ref , σ)[E(expr, σ)]<br />

R( ref . fieldname, σ) = R(ref , σ).fieldname<br />

E(cst, σ) = cst<br />

E(ref , σ) = σ(R(ref ))<br />

E(expr op expr, σ) = E(expr, σ) op E(expr, σ)<br />

S(∅, σ) = σ<br />

S(;, σ) = σ<br />

S({ type id ; stat}, σ) = unbind(S(stat, loc(id, T(type, σ), σ)), id)<br />

S(ref =expr, σ) = σ[R(ref ) → E(expr, σ)]<br />

S(write expr, σ) = push(σ(istdout), E(expr, σ)); σ<br />

S(ref =read, σ) = σ[R(ref, σ) → pop(σ(istdin))]<br />

S( id(ref ), σ) = σ[lk → vk | lk ∈ R(ref , σ) ∧ vk =<br />

S(body(id), {I(<strong>for</strong>mal(id)) → E(ref , σ)})(lk)]<br />

S(if(expr){ stat 0 }else{ stat 1 }, σ) = if E(expr, σ) then S(stat 0, σ)<br />

else S(stat 1, σ)<br />

S( while( expr ) { stat }, σ) = if E(expr, σ) then<br />

S( while( expr ) { stat }, S(stat, σ))<br />

else σ<br />

S(stat 0 ; stat 1, σ) = S(stat 1, S(stat 0, σ))


Appendix C<br />

Using PyPS <strong>to</strong> Drive a Compilation<br />

Benchmark<br />

This verbatim copy of the script used <strong>to</strong> benchmark our Open Multi Processing<br />

(openmp) transla<strong>to</strong>r on the polybench is a good illustration of the flexibility of Pythonic<br />

PIPS (pyps). It instantiates a compiler <strong>for</strong> each application found in the polybench source<br />

tree and turns each sequential kernel in<strong>to</strong> a parallel kernel and instruments it <strong>to</strong> gather<br />

execution time in<strong>for</strong>mation. This is achieve by composition of the openmp compiler with<br />

an instrumentation compiler and an abstraction—pyrops.pworkspace—that runs each<br />

compiler in a new process.<br />

import pyrops<br />

import workspace_gettime<br />

import openmp<br />

from glob import glob<br />

import shutil<br />

from os. path import basename<br />

map ( shutil . rmtree , glob (" PYPS *"))<br />

map ( shutil . rmtree , glob (" .*. tmp "))<br />

class workspace ( workspace_gettime . workspace , pyrops . pworkspace ):<br />

pass<br />

ITER =10<br />

result = list ()<br />

<strong>for</strong> src in glob (" polybench -2.0/*/*/*. c")+ glob (" polybench -2.0/*/*/*/*. c"):<br />

if src [ -6:] != " pocc .c":<br />

name = basename ( src ). replace ("_","-")<br />

# workspace . delete ( name )<br />

w= workspace (src , cppflags ="-Ipolybench -2.0/ utilities /",verbose = False )<br />

w. fun . main . benchmark_module ()<br />

times0 =w. benchmark ( iterations =ITER , LDFLAGS ="-lm",CFLAGS =’-O3␣-ffast - ma<br />

w. fun . main . openmp ( internalize_parallel_code = False )<br />

171


172 APPENDIX C. USING PYPS TO DRIVE A COMPILATION BENCHMARK<br />

times1 =w. benchmark ( openmp . ompMaker (), iterations =ITER , LDFLAGS ="-lm",CFL<br />

count =0<br />

<strong>for</strong> line in w. fun . main . code . split (’\n’):<br />

if line . find ("# pragma ␣ omp ␣" )!= -1:<br />

count +=1<br />

result . append (( name ,count , times0 [’main ’][0] , times1 [’main ’][0]))<br />

w. close ()<br />

fout = file (" polybench - openmp . dat ","w")<br />

<strong>for</strong> r in result :<br />

print >> fout ,r[0] ,r[1] ,r[2] ,r[3]<br />

fout . close ()


Appendix D<br />

Using C <strong>to</strong> Emulate sse Intrinsics<br />

An excerpt of the header used as a sequential replacement of the Streaming simd<br />

Extension (sse) header xmmintrin.h is reproduced here <strong>for</strong> the interested reader. It shows<br />

how the sse Instruction Set Architecture (isa) can be emulated using pure C code. This<br />

header was used <strong>to</strong> compile sse enabled applications on processors that do not have sse<br />

vec<strong>to</strong>r units.<br />

# include <br />

/* Some Macros from xmmintrin .h */<br />

# define _MM_SHUFFLE (z, y, x, w) ((( z)


174 APPENDIX D. USING C TO EMULATE SSE INTRINSICS<br />

/* data reorganization */<br />

inline __m128i _mm_unpacklo_epi64 ( __m128i v0 , __m128i v1) {<br />

__m128i ov = { . u64 = { v0.u64 [0] , v1.u64 [0] } };<br />

return ov;<br />

}<br />

inline __m128i _mm_shufflehi_epi16 ( __m128i v, int mask ) {<br />

__m128i ov = { . u16 = { v. u16 [0] , v. u16 [1] , v. u16 [2] , v. u16 [3] ,<br />

v. u16 [4+(( mask > >0)&3)] , v. u16 [4+(( mask > >2)& 3)] ,<br />

v. u16 [4+(( mask > >4)& 3)] , v. u16 [4+(( mask > >6)& 3)] } };<br />

return ov;<br />

}<br />

inline __m128i _mm_shufflelo_epi16 ( __m128i v, int mask ) {<br />

__m128i ov = { . u16 = { v. u16 [( mask > >0)&3] , v. u16 [( mask > >2)& 3],<br />

v. u16 [( mask > >4)& 3], v. u16 [( mask > >6)& 3],<br />

v. u16 [4] , v. u16 [5] , v. u16 [6] , v. u16 [7] } };<br />

return ov;<br />

}<br />

inline __m128i _mm_shuffle_epi32 ( __m128i v, int mask ) {<br />

__m128i ov = { . u32 = { v. u32 [( mask > >0)&3] , v. u32 [( mask > >2)& 3],<br />

v. u32 [( mask > >4)& 3], v. u32 [( mask > >6)& 3] } };<br />

return ov;<br />

}<br />

inline __m128i _mm_unpackhi_epi64 ( __m128i v0 , __m128i v1) {<br />

__m128i ov = { . u64 = { v0.u64 [1] , v1.u64 [1] } };<br />

return ov;<br />

}<br />

/* pure vec<strong>to</strong>r operations */<br />

inline __m128i _mm_or_si128 ( __m128i v0 , __m128i v1) {<br />

__m128i ov;<br />

<strong>for</strong> ( size_t i =0;i


}<br />

return ov;<br />

inline __m128i _mm_add_epi32 ( __m128i v0 , __m128i v1 ){<br />

__m128i ov;<br />

<strong>for</strong> ( size_t i =0;i


176 APPENDIX D. USING C TO EMULATE SSE INTRINSICS<br />

}<br />

return ov;<br />

/* bit operations on full vec<strong>to</strong>r */<br />

inline __m128i _mm_slli_si128 ( __m128i v, int count ) {<br />

__m128i ov;<br />

count *=8;<br />

ov.u64 [1]= v. u64 [1] > (64 - count ):<br />

v. u64 [0] count ;<br />

ov.u64 [0]|= ( count < 64) ?<br />

v. u64 [1] > count %64;<br />

ov.u64 [1]= v. u64 [1] >> count ;<br />

return ov;<br />

}


Code Trans<strong>for</strong>mation Glossary<br />

Trans<strong>for</strong>mations marked with a † are passes implemented in pips during the PhD or<br />

passes I made significant contributions <strong>to</strong>o.<br />

array linearization † is the process of converting multidimensional array in<strong>to</strong> unidimensional<br />

arrays, possibly with a conversion from array <strong>to</strong> pointer.. 144, 150<br />

common subexpression elimination † is the process of replacing similar expressions by<br />

a variable that holds the result of their evaluation.. 85, 167<br />

constant propagation is a pass that replaces a variable by its value when this value is<br />

known at compile time.. 51<br />

dead code elimination is the process of pruning from a function all the statements whose<br />

results are never used.. 120, 167<br />

directive generation is a common name <strong>for</strong> code trans<strong>for</strong>mations that annotate the code<br />

with directives.. 49, 134<br />

flatten code is the process of pruning a function body from declaration blocks so that all<br />

declarations are made at the <strong>to</strong>p level.. 150<br />

<strong>for</strong>ward substitution † is the process of replacing a reference read in an expression by<br />

the latest expression affected <strong>to</strong> it.. 46, 84, 167<br />

go<strong>to</strong> elimination is the process of replacing go<strong>to</strong> instructions by a hierarchical control<br />

flow graph.. 84<br />

inlining † is a function trans<strong>for</strong>mation. Inlining a function foo in its caller bar consists<br />

in the substitution of the calls <strong>to</strong> foo in bar by the function body after replacement<br />

of the <strong>for</strong>mal parameters by their effective parameters.. 46, 53, 83, 84<br />

instruction selection † is the process of mapping parts of the Internal Representation <strong>to</strong><br />

machine instructions.. 79<br />

invariant code motion is a loop trans<strong>for</strong>mation that moves outside of the loop the code<br />

from its body that is independent from the iteration.. 167<br />

iteration clamping is a loop trans<strong>for</strong>mation that extends the loop range but guards the<br />

loop body with the <strong>for</strong>mer range.. 144<br />

loop fusion is a loop trans<strong>for</strong>mation that replaces two loops by a single loops whose body<br />

is the concatenation of the bodies of the two initial loops.. 49, 52, 87, 134, 167<br />

177


178 Glossary<br />

loop interchange is a loop trans<strong>for</strong>mation that permutes two loops from a loop nest..<br />

102, 167<br />

loop normalization is a loop trans<strong>for</strong>mation that changes the loop initial increment<br />

value or the loop range <strong>to</strong> en<strong>for</strong>ce certain values, generally 1.. 144<br />

loop rerolling finds manually unrolled loop and replace them by their non-unrolled version..<br />

108<br />

loop tiling is a loop nest trans<strong>for</strong>mation that changes the loop execution order through a<br />

partitions of the iteration space in<strong>to</strong> chunks, so that the iteration is per<strong>for</strong>med over<br />

each chunk and in the chunks.. 52, 102, 167<br />

loop unrolling is a loop trans<strong>for</strong>mation. Unrolling a loop by a fac<strong>to</strong>r of n consists in the<br />

substitution of a loop body by itself, replicated n times. A prelude and/or postlude<br />

are added <strong>to</strong> preserve the number of iteration.. 46, 50, 102<br />

loop unswitching † is a loop trans<strong>for</strong>mation that replaces a loop containing a test independent<br />

from the loop execution by a test containing the loop without the test in<br />

both true and false branch.. 104<br />

memory footprint reduction † is the process of tiling a loop <strong>to</strong> make sure the iteration<br />

over the tile has a memory footprint bounded by a given value.. 121, 163<br />

n address code generation † is the process of splitting complex expression in simpler<br />

ones that take at most n operands.. 150<br />

outlining † is the process of extracting part of a function body in<strong>to</strong> a new function and<br />

replacing it in the initial function by a function call.. 18, 84, 87, 159, 162<br />

parallelism detection is a common name <strong>for</strong> analysis that detect if a loop can be run in<br />

parallel.. 134<br />

parallelism extraction is a common name <strong>for</strong> code trans<strong>for</strong>mations that modifies loop<br />

nest <strong>to</strong> make it legal <strong>to</strong> run them in parallel.. 49<br />

privatization is the process of detecting variables that are private <strong>to</strong> a loop body, i.e.<br />

written first, then read.. 134<br />

reduction detection is an analysis that identifies statements that per<strong>for</strong>m a reduction<br />

over a variable.. 49, 134<br />

redundant load-s<strong>to</strong>re elimination † is an inter procedural trans<strong>for</strong>mation that optimizes<br />

data transfers by delaying and merging them.. 113, 124, 163<br />

scalar renaming † is the process of renaming scalar variables <strong>to</strong> suppress false data dependencies..<br />

100<br />

split update opera<strong>to</strong>r † is the process of replacing an update opera<strong>to</strong>r by its expanded<br />

<strong>for</strong>m.. 150<br />

statement isolation † is the process of replacing all variables referenced in a statement by<br />

newly declared variables. A prologue and an epilogue are added <strong>to</strong> copy old variable<br />

values <strong>to</strong> new variable, back and <strong>for</strong>th.. 114, 120, 141, 159, 163<br />

strength reduction † is the process of replacing an operation by an operation of lower<br />

cost.. 150


Acronyms<br />

Trans<strong>for</strong>mations marked with a † are passes implemented in pips during the PhD or<br />

passes I made significant contributions <strong>to</strong>o.<br />

api Application Programming Interface. 2, 7–9, 15, 18, 22, 37–40, 44, 58, 63, 67, 88, 133,<br />

141, 162, 164<br />

asic Application-Specific Integrated Circuit. 6, 27<br />

asip Application-Specific Instruction set Processor. 41<br />

ast Abstract Syntax Tree. 65, 66, 98<br />

avx Advanced Vec<strong>to</strong>r eXtensions. iii, xvii, 16, 18, 53, 59, 60, 79, 93, 95, 97, 133, 152,<br />

157–160, 162<br />

cisc Complex Instruction Set Computer. 79<br />

cli Command Line Interface. 47, 58<br />

cpu Central Processing Unit. 9, 15, 18, 33, 66, 69, 92, 121, 123, 133, 139, 144, 162<br />

cri Centre de Recherche en In<strong>for</strong>matique. 4, 24<br />

cuda Compute Unified Device Architecture. iii, xvii, 2, 5, 7, 8, 10, 16, 22, 25, 26, 31, 34,<br />

43, 44, 49, 52, 55, 59, 60, 66, 70, 71, 139, 141, 142, 144, 159, 160<br />

dma Direct Memory Access. 7, 11, 19, 32, 33, 38, 50, 89, 110, 125, 128–130, 141, 147, 163<br />

dsp Digital Signal Processing. xiv, 79, 145, 152<br />

flops FLoating point Operations per Second. 2, 5, 22, 26<br />

fma Fused Multiply-Add. xi, 11, 79, 96, 98, 99, 157<br />

fpga Field Programmable Gate Array. iii, 4–6, 15, 19, 24–27, 32, 38, 39, 41, 66, 67, 70,<br />

71, 106, 133, 144, 146, 163<br />

fpu Floating-Point Unit. 73<br />

fsa Fusion System Architecture. 9, 10, 69<br />

gcc gnu C Compiler. xi, xiii, xvii, 12, 17, 34, 36, 44–47, 49, 51, 63, 65, 67, 73, 79, 92–97,<br />

137, 144, 157<br />

gpgpu General Purpose gpu. 2, 5, 6, 12, 17, 22, 25, 26, 31, 33, 34, 39, 91, 161<br />

gpp General Purpose Processor. 4, 6, 26, 27, 161, 162<br />

179


180 Acronyms<br />

gpu Graphical Processing Unit. xiv, 4, 5, 8, 9, 15, 16, 18, 19, 24–26, 30–33, 38, 40, 41,<br />

43, 49, 52, 54, 59, 67, 69, 71, 83, 106, 111, 133, 139–142, 144, 145, 159, 160, 162, 163<br />

hcfg Hierarchical Control Flow Graph. 72, 124, 125, 127, 128<br />

hdl Hardware Description Language. 42<br />

hmpp Hybrid Multicore Parallel Programming. 7, 41<br />

hpc High Per<strong>for</strong>mance Computing. 12, 19, 20, 42, 91, 106, 163<br />

hpec High Per<strong>for</strong>mance Embedded Computing. 36<br />

icc Intel C++ Compiler. xiii, 12, 17, 36, 73, 92, 93, 95, 96, 157<br />

ilp Instruction Level Parallelism. 12, 72, 92<br />

ir Internal Representation. 23, 43–45, 54, 56, 57, 59, 61, 67, 69–72, 79, 83, 89, 125, 134,<br />

141, 165, 177<br />

isa Instruction Set Architecture. 4, 6, 9, 10, 16, 18, 24, 32, 33, 67, 69, 70, 72, 76, 79, 83,<br />

89, 134, 140, 141, 146–148, 152, 156, 159, 162, 173<br />

jit Just In Time. 39, 66, 67, 96<br />

llvm Low Level Virtual Machine. xi, xiii, xvii, 12, 45–47, 49, 63, 65, 67, 70, 92, 93, 95,<br />

96<br />

mimd Multiple Instruction stream, Multiple Data stream. 5, 12, 16, 26, 29, 32, 33, 67,<br />

91, 134, 139, 140<br />

mis Multimedia Instruction Set. viii, x, xiii, xiv, 12, 18, 20, 87, 92, 94–97, 99, 111, 124,<br />

133, 152, 156–160, 162–164<br />

mkl Math Kernel Library. 27, 109<br />

mmx Matrix Math eXtension. 93, 95<br />

mp-soc MultiProcessor System-on-Chip. 5, 25<br />

mpi Message Passing Interface. 29, 55<br />

oop Object Oriented Programming. 50<br />

oo Object Oriented. 66<br />

opencl Open Computing Language. xiii, 5–7, 10, 20, 26–28, 34, 37–42, 45, 51, 52, 57,<br />

70, 71, 159, 163<br />

opengl Open Graphics Library. 5, 26, 31, 38<br />

openmp Open Multi Processing. iii, xi, xiv, xv, xvii, 16, 19, 31, 49, 55, 59, 60, 67, 106,<br />

133–138, 160, 163, 171<br />

pci Peripheral Component Interconnect. 40<br />

pe Processing Element. 139, 146, 147, 150<br />

pips Paralléliseur Interprocedural de Programmes Scientifiques. iii, xii, xiii, 4, 8, 9, 12,<br />

19, 24, 37, 44, 46, 48, 55, 58, 63, 67, 72, 73, 87, 91, 130, 141, 152, 157, 159, 163,<br />

165–167, 177, 179


Acronyms 181<br />

ps3 PlayStation 3. 2, 22<br />

ptx Parallel Thread eXecution. 67, 72, 77, 141<br />

pu Processing Unit. 146<br />

pocc Polyhedral Compiler Collection. 55, 56<br />

pyps Pythonic PIPS. xi, xiii, xiv, xvii, 8, 9, 19, 44, 46, 52, 53, 58–60, 63, 64, 67, 98, 134,<br />

137, 144, 151, 159, 160, 163, 171<br />

ram Random Access Memory. 33, 134, 140, 146, 147<br />

rom Read Only Memory. 32, 140, 146, 147<br />

rpc Remote Procedure Call. 19, 29, 30, 163<br />

sdk Software Development Kit. 5, 8, 25, 37, 39<br />

simd Single Instruction stream, Multiple Data stream. 5, 12, 16, 17, 20, 26, 29, 31–33,<br />

52, 67, 79, 91–93, 96–102, 139–141, 146–148, 156, 164<br />

sisd Single Instruction stream, Single Data stream. 29<br />

sloc <strong>Source</strong> Lines Of Code. 8, 44, 58, 63, 160<br />

slp Super-word Level Parallelism. viii, xi, 18, 92, 96, 98, 104, 106, 111, 112, 152, 157, 162<br />

soc Software On Chip. 5, 25, 41<br />

ssa Simple Static Assignment. 67<br />

sse Streaming simd Extension. iii, xi, xiii, 10, 16–18, 29, 39, 49, 53, 55, 60–62, 72, 94, 95,<br />

97, 98, 107, 109, 133, 152, 157, 158, 160–162, 173<br />

tr Textual Representation. 56, 61, 89<br />

ulp Unit in the Last Place. 38<br />

vhdl vhsic Hardware Description Language. 5, 7, 20, 26, 34, 55, 163<br />

vliw Very Long Instruction Word. 16, 146, 147


182 Acronyms


Bibliography<br />

[AAC + 11] Mehdi Amini, Corinne Ancourt, Fabien Coelho, Béatrice Creusillet, Serge<br />

Guel<strong>to</strong>n, François Irigoin, Pierre Jouvelot, Ronan Keryell, and Pierre Villalon.<br />

PIPS Is not (only) Polyhedral Software. In First International Workshop on<br />

Polyhedral Compilation Techniques, IMPACT, April 2011.<br />

[ABC + 06] Krste Asanovic, Ras Bodik, Bryan Chris<strong>to</strong>pher Catanzaro, Joseph James<br />

Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester<br />

Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The<br />

landscape of parallel computing research: A view from berkeley. Technical<br />

Report UCB/EECS-2006-183, EECS Department, University of Cali<strong>for</strong>nia,<br />

Berkeley, 2006.<br />

[ABCR10] Joshua S. Auerbach, David F. Bacon, Perry Cheng, and Rodric M. Rabbah.<br />

Lime: a Java-compatible and synthesizable language <strong>for</strong> heterogeneous<br />

architectures. In Proceedings of the 25th Annual SIGPLAN Conference on<br />

Object-Oriented Programming, Systems, Languages, and Applications, OOP-<br />

SLA, pages 89–108, New York, NY, USA, Oc<strong>to</strong>ber 2010. ACM.<br />

[ACIK97] Corinne Ancourt, Fabien Coelho, François Irigoin, and Ronan Keryell. A linear<br />

algebra framework <strong>for</strong> static High Per<strong>for</strong>mance Fortran code distribution.<br />

Scientific Programming, 6(1):3–27, 1997.<br />

[AJ75] Alfred V. Aho and Stephen C. Johnson. Optimal code generation <strong>for</strong> expression<br />

trees. In Proceedings of seventh annual symposium on Theory of<br />

computing, STOC, pages 207–217, New York, NY, USA, 1975. ACM.<br />

[AK87] Randy Allen and Ken Kennedy. Au<strong>to</strong>matic translation of FORTRAN programs<br />

<strong>to</strong> vec<strong>to</strong>r <strong>for</strong>m. Transactions on Programming Languages and Systems,<br />

9:491–542, 1987.<br />

[AKPW83] John R. Allen, Ken Kennedy, Carrie Porterfield, and Joe D. Warren. Conversion<br />

of control dependence <strong>to</strong> data dependence. In Principles of Programming<br />

Languages, POPL, pages 177–189, 1983.<br />

[ALSU06] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. <strong>Compilers</strong>:<br />

Principles, Techniques, and Tools (2nd Edition). Addison Wesley, 2006.<br />

[Amd67] Gene M. Amdahl. Validity of the single processor approach <strong>to</strong> achieving<br />

large scale computing capabilities. In Proceedings of the spring joint computer<br />

conference, AFIPS, pages 483–485, New York, NY, USA, April 1967. ACM.<br />

183


184 BIBLIOGRAPHY<br />

[AMG + 99] Eduard Ayguade, MarcGonzalez, Marc Gonzalez, Jesus Labarta, Xavier Mar<strong>to</strong>rell,<br />

Nacho Navarro, and Jose Oliver. NanosCompiler: A research plat<strong>for</strong>m<br />

<strong>for</strong> OpenMP extensions. In In First European Workshop on OpenMP, pages<br />

27–31, 1999.<br />

[API03] Kubilay Atasu, Laura Pozzi, and Paolo Ienne. Au<strong>to</strong>matic application-specific<br />

instruction-set extensions under microarchitectural constraints. International<br />

Journal of Parallel Programming, 31(6):411–428, 2003.<br />

[AR97] Rumen Andonov and Sanjay V. Rajopadhye. Optimal orthogonal tiling of 2-d<br />

iterations. Journal of Parallel Distributed Computing, 45(2):159–165, September<br />

1997.<br />

[ASU86] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. <strong>Compilers</strong>: Princiles,<br />

Techniques, and Tools. Addison-Wesley, 1986.<br />

[ATNW09] Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André<br />

Wacrenier. StarPU: A unified plat<strong>for</strong>m <strong>for</strong> task scheduling on heterogeneous<br />

multicore architectures. In European Conference on Parallel Processing, Euro-<br />

Par, pages 863–874, 2009.<br />

[Aß96] Uwe Aßmann. How <strong>to</strong> uni<strong>for</strong>mly specify program analysis and trans<strong>for</strong>mation<br />

with graph rewrite systems. In Proceedings of the 6th International Conference<br />

on Compiler Construction, CC, pages 121–135, London, UK, 1996.<br />

Springer-Verlag.<br />

[Bac57] John W. Backus. The FORTRAN Au<strong>to</strong>matic Coding System <strong>for</strong> the IBM 704<br />

EDPM. International Business Machines Corporation (IBM), 1957.<br />

[Bas04] Cédric Bas<strong>to</strong>ul. Code generation in the polyhedral model is easier than you<br />

think. In International Conference on Parallel Architecture and Compilation<br />

Techniques, PACT, pages 7–16, Juan-les-Pins, France, 2004. IEEE Computer<br />

Society Press.<br />

[BB09] Francois Bodin and Stephane Bihan. <strong>Heterogeneous</strong> multicore parallel programming<br />

<strong>for</strong> graphics processing units. Scientific Programming, 17:325–336,<br />

December 2009.<br />

[BBK + 08] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy,<br />

J. Ramanujam, Atanas Rountev, and P. Sadayappan. A compiler framework<br />

<strong>for</strong> optimization of affine loop nests <strong>for</strong> GPGPUs. In Proceedings of the 22nd<br />

annual international conference on Supercomputing, ICS, pages 225–234, New<br />

York, NY, USA, 2008. ACM.<br />

[BDH + 10] André Rigland Brodtkorb, Chris<strong>to</strong>pher Dyken, Trond Runar Hagen, Jon M.<br />

Hjelmervik, and Olaf O. S<strong>to</strong>raasli. State-of-the-art in heterogeneous computing.<br />

Scientific Programming, 18(1):1–33, 2010.<br />

[Ber66] Arthur J. Bernstein. Analysis of programs <strong>for</strong> parallel processing. Transactions<br />

on Electronic Computers, pages 757 –762, 1966.


BIBLIOGRAPHY 185<br />

[BFH + 04] Ian Buck, Tim Foley, Daniel Reiter Horn, Jeremy Sugerman, Kayvon Fatahalian,<br />

Mike Hous<strong>to</strong>n, and Pat Hanrahan. Brook <strong>for</strong> GPUs: stream computing<br />

on graphics hardware. Transactions on Graphics, 23(3):777–786, 2004.<br />

[BGGT02] Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. Au<strong>to</strong>matic detection<br />

of saturation and clipping idioms. In International Workshop on Languages<br />

and <strong>Compilers</strong> <strong>for</strong> Parallel Computing, LNCS, pages 61–74. Springer-<br />

Verlag, 2002.<br />

[BGS94] David F. Bacon, Susan L. Graham, and Oliver J. Sharp. Compiler trans<strong>for</strong>mations<br />

<strong>for</strong> high-per<strong>for</strong>mance computing. Computing Surveys, 26(4):345–420,<br />

1994.<br />

[Bik04] Aart J. C. Bik. The Software Vec<strong>to</strong>rization Handbook: Applying Intel Multimedia<br />

Extensions <strong>for</strong> Maximum Per<strong>for</strong>mance. Intel Press, 2004.<br />

[BJK + 95] Robert D. Blumofe, Chris<strong>to</strong>pher F. Joerg, Bradley C. Kuszmaul, Charles E.<br />

Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded<br />

runtime system. In Journal of Parallel and Distributed Computing, JPDC,<br />

pages 207–216, New York, NY, USA, 1995. ACM.<br />

[BL10] Nicolas Benoit and Stéphane Louise. Extending GCC with a multi-grain<br />

parallelism adaptation framework <strong>for</strong> MPSoCs. In GCC <strong>for</strong> Research Opportunities<br />

Workshop, January 2010.<br />

[Ble89] Guy E. Blelloch. Scans as primitive parallel operations. Transactions on<br />

Computers, 38(11):1526–1538, November 1989.<br />

[BLE + 08] Philippe Bonnot, Fabrice Lemonnier, Gilbert Edelin, Gérard Gaillat, Olivier<br />

Ruch, and Pascal Gauget. Definition and SIMD implementation of a multiprocessing<br />

architecture approach on FPGA. In Design Au<strong>to</strong>mation and Test<br />

in Europe, DATE, pages 610–615. IEEE Computer Society Press, 2008.<br />

[BN84] Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure<br />

calls. Transactions on Computer Systems, 2:39–59, 1984.<br />

[Bre10] Tony M. Brewer. Instruction set innovations <strong>for</strong> the Convey HC-1 computer.<br />

IEEE Micro, 30(2):70–79, 2010.<br />

[BSB + 01] Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau L. Boloni, Muthucumaru<br />

Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D.<br />

Theys, Bin Yao, Debra Hensgen, and Richard F. Freund. A comparison of<br />

eleven static heuristics <strong>for</strong> mapping a class of independent tasks on<strong>to</strong> heterogeneous<br />

distributed computing systems. Journal of Parallel Distributed<br />

Computing, 61:810–837, June 2001.<br />

[CDL11] Alexandre Cornu, Steven Derrien, and Dominique Lavenier. HLS <strong>to</strong>ols <strong>for</strong><br />

FPGA: Faster development with better per<strong>for</strong>mance. In Reconfigurable Computing:<br />

Architectures, Tools and Applications - 7th International Symposium,<br />

volume 6578 of LNCS, pages 67–78, Belfast, UK, March 2011. Springer.


186 BIBLIOGRAPHY<br />

[CDM + 10] Hassan Chafi, Zach DeVi<strong>to</strong>, Adriaan Moors, Tiark Rompf, Arvind K. Sujeeth,<br />

Pat Hanrahan, Martin Odersky, and Kunle Olukotun. Language virtualization<br />

<strong>for</strong> heterogeneous parallel computing. In Proceedings of the international<br />

conference on Object oriented programming systems languages and applications,<br />

OOPSLA, pages 835–847, New York, NY, USA, 2010. ACM.<br />

[CDMC + 05] Cristian Coarfa, Yuri Dotsenko, John M. Mellor-Crummey, François Can<strong>to</strong>nnet,<br />

Tarek A. El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel G.<br />

Chavarría-Miranda. An evaluation of global address space languages: coarray<br />

Fortran and unified parallel C. In SIGPLAN Annual Symposium on<br />

Principles and Practice of Parallel Programming, PPOPP, pages 36–47, New<br />

York, NY, USA, 2005. ACM.<br />

[CGO11] Proceedings of the International Symposium on Code Generation and Optimization<br />

(CGO), New York, NY, USA, April 2011. ACM.<br />

[CH89] Pohua P. Chang and W.-W. Hwu. Inline function expansion <strong>for</strong> compiling<br />

C programs. In Proceedings of the SIGPLAN Conference on Programming<br />

language design and implementation, PLDI, pages 246–257, New York, NY,<br />

USA, June 1989. ACM.<br />

[Che94] Wasel Chemij. Parallel Computer Taxonomy. PhD thesis, Aberystwyth University,<br />

1994.<br />

[CJIA11] Fabien Coelho, Pierre Jouvelot, François Irigoin, and Corinne Ancourt. Data<br />

and process abstraction in PIPS internal representation. In Workshop on<br />

Internal Representations, WIR, Chamonix, France, April 2011.<br />

[Cla96] Philippe Clauss. Counting solutions <strong>to</strong> linear and nonlinear constraints<br />

through ehrhart polynomials: applications <strong>to</strong> analyze and trans<strong>for</strong>m scientific<br />

programs. In Proceedings of the 10th international conference on Supercomputing,<br />

ICS, pages 278–285, New York, NY, USA, May 1996. ACM.<br />

[Coe93] Fabien Coehlo. Étude de la Compilation du High Per<strong>for</strong>mance Fortran. PhD<br />

thesis, Université Paris VI, 1993.<br />

[Con] Embedded Microprocessor Benchmark Consortium. Coremark. http://www.<br />

coremark.org.<br />

[Coo04] Keith D. Cooper. Evolving the next generation of compilers. keynote talk at<br />

CGO’04, 2004.<br />

[Cre96] Béatrice Creusillet. Array Region Analyses and Applications. PhD thesis,<br />

MINES ParisTech, 1996.<br />

[CSY10] Kuan-Hsu Chen, Bor-Yeh Shen, and Wuu Yang. An au<strong>to</strong>matic superword<br />

vec<strong>to</strong>rization in LLVM. In 16th Workshop on Compiler Techniques <strong>for</strong> High-<br />

Per<strong>for</strong>mance and Embedded Computing, pages 19–27, Taipei, 2010.<br />

[Dal09] William J. Dally. The end of denial architecture and the rise of throughput<br />

computing. In Design Au<strong>to</strong>mation Conference, San Francisco, CA, USA, July<br />

2009.


BIBLIOGRAPHY 187<br />

[Dar99] Alain Darte. On the complexity of loop fusion. In Proceedings of the 1999 International<br />

Conference on Parallel Architectures and Compilation Techniques,<br />

PACT, pages 149–, Washing<strong>to</strong>n, DC, USA, September 1999. IEEE Computer<br />

Society Press.<br />

[DKK + 99] Carole Dulong, Rakesh Krishnaiyer, Dattatraya Kulkarni, Daniel Lavery, Wei<br />

Li, John Ng, and David Sehr. An overview of the Intel IA-64 compiler. Intel<br />

Technology Journal, 1999.<br />

[DKYC10] Gregory Frederick Diamos, Andrew Kerr, Sudhakar Yalamanchili, and<br />

Nathan Clark. Ocelot: a dynamic optimization framework <strong>for</strong> bulksynchronous<br />

applications in heterogeneous systems. In 19th International<br />

Conference on Parallel Architecture and Compilation Techniques, PACT,<br />

pages 353–364. ACM, September 2010.<br />

[DLP03] Jack Dongarra, Piotr Luszczek, and An<strong>to</strong>ine Petitet. The LINPACK benchmark:<br />

past, present and future. Concurrency and Computation: Practice and<br />

Experience, 15(9):803–820, 2003.<br />

[DMM + ] Steven Derrien, Daniel Ménard, Kevin Martin, An<strong>to</strong>ine Floch, An<strong>to</strong>ine Morvan,<br />

Adeel Pasha, Patrice Quin<strong>to</strong>n, Amit Kumar, and Loïc Cloatre. GeCoS:<br />

Generic compiler suite. http://gecos.g<strong>for</strong>ge.inria.fr.<br />

[DSV96] Alain Darte, Georges-André Silber, and Frédéric Vivien. Combining retiming<br />

and scheduling techniques <strong>for</strong> loop parallelization and loop tiling, 1996.<br />

[Dun90] R. Duncan. A survey of parallel computer architectures. Computer, 23(2):5–<br />

16, February 1990.<br />

[DUSsH93] Raja Das, Mustafa Uysal, Joel Saltz, and Yuan shin Hwang. Communication<br />

optimizations <strong>for</strong> irregular scientific computations on distributed memory architectures.<br />

Journal of Parallel and Distributed Computing, 22:462–479, 1993.<br />

[ER08] Eric Eide and John Regehr. Volatiles are miscompiled, and what <strong>to</strong> do about<br />

it. In International Workshop on Embedded Systems, pages 255–264, 2008.<br />

[Ero95] Ana Maria Erosa. A go<strong>to</strong>-elimination method and its implementation <strong>for</strong> the<br />

McCat C compiler. Thesis (m.s.), McGill University, Montreal, Canada, May<br />

1995.<br />

[FLR98] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation<br />

of the Cilk-5 multithreaded language. In Proceedings of the SIGPLAN<br />

Conference on Program Language Design and Implementation, PLDI, pages<br />

212–223, 1998.<br />

[Fly72] Michael J. Flynn. Some computer organizations and their effectiveness.<br />

Transactions on Computers, C-21(9):948–960, 1972.<br />

[FO03] Björn Franke and Michael F. P. O’Boyle. Array recovery and high-level trans<strong>for</strong>mations<br />

<strong>for</strong> DSP applications. Transactions in Embedded Computing Systems,<br />

2(2):132–162, 2003.


188 BIBLIOGRAPHY<br />

[Fra03] cois Ferrand Fran˙ Optimization and code parallelization <strong>for</strong> processors with<br />

multimedia SIMD instructions. Technical report, Télécom Bretagne, August<br />

2003. master thesis report.<br />

[GCB07] Gildas Genest, Richard Chamberlain, and Robin J. Bruce. Programming an<br />

FPGA-based super computer using a C-<strong>to</strong>-VHDL compiler: DIME-C. In<br />

Adaptive Hardware and Systems (AHS), pages 280–286, 2007.<br />

[GG] J. L. Gustafson and B. S. Greer. ClearSpeed whitepaper: accelerating the<br />

Intel Math Kernel Library. http://www.clearspeed.com/docs/resources/<br />

ClearSpeedIntelWhitepaperFeb07.pdf.<br />

[GLGP06] Gert Goossens, Dirk Lanneer, Werner Geurts, and Johan Van Praet. Design<br />

of ASIPs in multi-processor SoCs using the Chess/Checkers retargetable <strong>to</strong>ol<br />

suite. International Symposium on System-on-Chip, 2006.<br />

[GNB08] Zhi Guo, Walid Najjar, and Betul Buyukkurt. Efficient hardware code generation<br />

<strong>for</strong> FPGAs. Transactions on Architecture and Code Optimization,<br />

5(1):1–26, 2008.<br />

[GO03] Etienne Gaudrain and Yann Orlarey. A Faust tu<strong>to</strong>rial. Technical report,<br />

GRAME, September 2003.<br />

[GPZ + 01] María Jesús Garzarán, Milos Prvulovic, Ye Zhangy, Josep Torrellas, Alin<br />

Jula, Hao Yu, and Lawrence Rauchwerger. Architectural support <strong>for</strong> parallel<br />

reductions in scalable shared-memory multiprocessors. In Proceedings<br />

of the 2001 International Conference on Parallel Architectures and Compilation<br />

Techniques, PACT, pages 243–, Washing<strong>to</strong>n, DC, USA, 2001. IEEE<br />

Computer Society Press.<br />

[GS04] Brian J. Gough and Richard M. Stallman. An Introduction <strong>to</strong> GCC. Network<br />

Theory Ltd., 2004.<br />

[GZA + 11] Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin<br />

Größlinger, and Louis-Noël Pouchet. Polly - polyhedral optimization in<br />

LLVM. In First International Workshop on Polyhedral Compilation Techniques,<br />

IMPACT, 2011.<br />

[HEL + 09] Manuel Hohenauer, Felix Engel, Rainer Leupers, Gerd Ascheid, and Heinrich<br />

Meyr. A SIMD optimization framework <strong>for</strong> retargetable compilers. Transactions<br />

on Architecture and Code Optimization, 6(1), 2009.<br />

[HF11] Matt J. Harvey and Gianni De Fabritiis. Swan: A <strong>to</strong>ol <strong>for</strong> porting CUDA<br />

programs <strong>to</strong> OpenCL. Computer Physics Communications, 182(4):1093–1099,<br />

2011.<br />

[HP06] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth<br />

Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San<br />

Francisco, CA, USA, 2006.<br />

[HRF + 10] Ever<strong>to</strong>n Hermann, Bruno Raffin, François Faure, Thierry Gautier, and<br />

Jérémie Allard. Multi-GPU and multi-CPU parallelization <strong>for</strong> interactive


BIBLIOGRAPHY 189<br />

physics simulations. In Euro-Par, volume 6272 of LNCS, pages 235–246.<br />

Springer, 2010.<br />

[HRTV11] Robert Hundt, Easwaran Raman, Martin Thuresson, and Neil Vachharajani.<br />

MAO – an extensible micro-architectural optimizer. In CGO [CGO11].<br />

[Ier06] Rober<strong>to</strong> Ierusalimschy. Programming in Lua, Second Edition. Lua.Org, 2006.<br />

[IJT91] François Irigoin, Pierre Jouvelot, and Rémi Triolet. Semantical interprocedural<br />

parallelization: an overview of the PIPS project. In International<br />

Conference on Supercomputing, ICS, pages 244–251, 1991.<br />

[iLJE03] Sang ik Lee, Troy A. Johnson, and Rudolf Eigenmann. Cetus - an extensible<br />

compiler infrastructure <strong>for</strong> source-<strong>to</strong>-source trans<strong>for</strong>mation. In 16th International<br />

Workshop on Languages and <strong>Compilers</strong> <strong>for</strong> Parallel Computing, volume<br />

2958 of LCPC, pages 539–553, College Station, TX, USA, 2003.<br />

[INR] INRIA. Aladdin-g5k. https://www.grid5000.fr.<br />

[IS99] Liviu If<strong>to</strong>de and Jaswinder Pal Singh. Shared Virtual Memory: Progress<br />

and Challenges. In Proceedings of the IEEE, pages 498–507. IEEE Computer<br />

Society Press, 1999.<br />

[ISCKG10] François Irigoin, Frédérique Silber-Chaussumier, Ronan Keryell, and Serge<br />

Guel<strong>to</strong>n. PIPS tu<strong>to</strong>rial at PPoPP 2010. http://pips4u.org/doc/tu<strong>to</strong>rial,<br />

2010.<br />

[ISO99] ISO. ISO/IEC 9899 Programming languages —C. ISO, 1999.<br />

[ISO08] ISO. ISO/IEC TR 18037:2008 Programming languages —C —Extensions <strong>to</strong><br />

support embedded processors. ISO, 2008.<br />

[IT88] François Irigoin and Rémi Triolet. Supernode partitioning. In Proceedings<br />

of the 15th SIGPLAN-SIGACT symposium on Principles of programming<br />

languages, POPL, pages 319–329, New York, NY, USA, 1988. ACM.<br />

[JD89] Pierre Jouvelot and Babak Dehbonei. A unified semantic approach <strong>for</strong> the<br />

vec<strong>to</strong>rization and parallelization of generalized reductions. In International<br />

Conference on Supercomputing, ICS, pages 186–194, 1989.<br />

[JM99] Simon Pey<strong>to</strong>n Jones and Simon Marlow. Secrets of the Glasgow Haskell<br />

compiler inliner. In Journal of Functional Programming, page 2002, 1999.<br />

[JMH + 05] Weihua Jiang, Chao Mei, Bo Huang, Jianhui Li, Jiahua Zhu, Binyu Zang,<br />

and Chuanqi Zhu. Boosting the per<strong>for</strong>mance of multimedia applications using<br />

SIMD instructions. In International Conference on Compiler Construction,<br />

CC, pages 59–75, 2005.<br />

[JPJ + 11] Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson,<br />

Stephen R. Beard, and David I. August. Au<strong>to</strong>matic CPU–GPU communication<br />

management and optimization. In Proceedings of the 32nd SIGPLAN<br />

conference on Programming language design and implementation, PLDI,<br />

pages 142–151, New York, NY, USA, June 2011. ACM.


190 BIBLIOGRAPHY<br />

[JRR99] Simon L. Pey<strong>to</strong>n Jones, Norman Ramsey, and Fermin Reig. C--: A portable<br />

assembly language that supports garbage collection. In Principles and Practice<br />

of Declarative Programming, International Conference, LNCS, pages 1–<br />

28, Paris, France, September 1999. Springer.<br />

[KBM07] Volodymyr V. Kindratenko, Robert J. Brunner, and Adam D. Myers. Mitrion-<br />

C application development on SGI Altix 350/RC100. In International Symposium<br />

on Field-Programmable Cus<strong>to</strong>m Computing Machines, FCCM, pages<br />

239–250. IEEE Computer Society Press, 2007.<br />

[Ker03] Brian Kernighan. Interview with brian kernighan. Linux Journal, July 2003.<br />

http://www.linuxjournal.com/article/7035.<br />

[KK92] Ken Kennedy and Kathryn S. M Kinley. Optimizing <strong>for</strong> parallelism and data<br />

locality. In International Conference on Supercomputing, ICS, pages 323–334,<br />

New York, NY, USA, 1992. ACM.<br />

[KKS00] Ki-Il Kum, Jiyang Kang, and Wonyong Sung. AUTOSCALER <strong>for</strong> C: an<br />

optimizing floating-point <strong>to</strong> integer C program converter <strong>for</strong> fixed-point digital<br />

signal processors. Circuits and Systems II: Analog and Digital Signal<br />

Processing, 47(9):840–848, September 2000.<br />

[KOWG10] Khronos OpenCL Working Group. The OpenCL Specification, version 1.1,<br />

2010.<br />

[KRS90] Clyde Kruskal, Larry Rudolph, and Marc Snir. Efficient parallel algorithms<br />

<strong>for</strong> graph problems. Algorithmica, 5:43–64, 1990.<br />

[KS99] Kazuhiro Kusano and Mitsuhisa Sa<strong>to</strong>. A comparison of au<strong>to</strong>matic parallelizing<br />

compiler and improvements by compiler directives. In International<br />

Symposium on High Per<strong>for</strong>mance Computing, ISHPC, pages 95–108, London,<br />

UK, 1999. Springer-Verlag.<br />

[KSA + 10] Kazuhiko Komatsu, Katsu<strong>to</strong> Sa<strong>to</strong>, Yusuke Arai, Kentaro Koyama, Hiroyuki<br />

Takizawa, and Hiroaki Kobayashi. Evaluating per<strong>for</strong>mance and portability<br />

of OpenCL programs. In The Fifth International Workshop on Au<strong>to</strong>matic<br />

Per<strong>for</strong>mance Tuning, June 2010.<br />

[LA00] Samuel Larsen and Saman P. Amarasinghe. Exploiting superword level parallelism<br />

with multimedia instruction sets. In Programming Language Design<br />

and Implementation, PLDI, pages 145–156, 2000.<br />

[LA03] Chris Lattner and Vikram Adve. Architecture <strong>for</strong> a next-generation GCC.<br />

In Proceedings of First Annual GCC Developers’ Summit, Ottawa, Canada,<br />

May 2003.<br />

[LA04] Chris Lattner and Vikram Adve. LLVM: A compilation framework <strong>for</strong> lifelong<br />

program analysis & trans<strong>for</strong>mation. In International Symposium on Code<br />

Generation and Optimization, CGO, Palo Al<strong>to</strong>, Cali<strong>for</strong>nia, 2004.<br />

[Lam74] Leslie Lamport. The parallel execution of do loops. communications of the<br />

ACM, 17(2):83–93, 1974.


BIBLIOGRAPHY 191<br />

[Lat11] Chris Lattner. LLVM, chapter 11. Amy Brown and Greg Wilson (edi<strong>to</strong>rs),<br />

2011. http://www.aosabook.org.<br />

[Lei92] F. Thomson Leigh<strong>to</strong>n. Introduction <strong>to</strong> parallel algorithms and architectures:<br />

array, trees, hypercubes. Morgan Kaufmann Publishers Inc., San Francisco,<br />

CA, USA, 1992.<br />

[LF80] Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. Journal<br />

of the ACM, 27:831–838, 1980.<br />

[LRW05] James Lebak, Albert Reuther, and Edmund Wong. Polymorphous computing<br />

architecture kernel-level benchmarks. Technical Report Project Report PCA-<br />

KERNEL-1, MIT Lincoln Labora<strong>to</strong>ry, Lexing<strong>to</strong>n, MA, 2005.<br />

[LVM + 10] Allen Leung, Nicolas Vasilache, Benoît Meister, Muthu Baskaran, David<br />

Wohl<strong>for</strong>d, Cédric Bas<strong>to</strong>ul, and Richard Lethin. A mapping path <strong>for</strong> multigpgpu<br />

accelerated computers from a portable high level programming abstraction.<br />

In Proceedings of the 3rd Workshop on General-Purpose Computation<br />

on Graphics Processing Units, GPGPU, pages 51–61, New York, NY, USA,<br />

2010. ACM.<br />

[LWFK02] S.M. Loo, B.E. Wells, N. Freije, and J. Kulick. Handel-C <strong>for</strong> rapid pro<strong>to</strong>typing<br />

of VLSI coprocessors <strong>for</strong> real time systems. In Proceedings of the Thirty-<br />

Fourth Southeastern Symposium on System Theory, pages 6–10, 2002.<br />

[MAB + 10] Harm Munk, Eduard Ayguadé, Cédric Bas<strong>to</strong>ul, Paul Carpenter, Zbigniew<br />

Chamski, Albert Cohen, Marco Cornero, Philippe Dumont, Marc Duran<strong>to</strong>n,<br />

Mohammed Fellahi, Roger Ferrer, Razya Ladelsky, Menno Lindwer, Xavier<br />

Mar<strong>to</strong>rell, Cupertino Miranda, Dorit Nuzman, Andrea Ornstein, An<strong>to</strong>niu<br />

Pop, Sebastian Pop, Louis-Noël Pouchet, Alex Ramírez, David Ródenas, Erven<br />

Rohou, Ira Rosen, Uzi Shvadron, Konrad Trifunović, and Ayal Zaks.<br />

Acotes project: Advanced compiler technologies <strong>for</strong> embedded streaming. International<br />

Journal of Parallel Programming, 2010. Special issue on European<br />

HiPEAC network of excellence members projects. To appear.<br />

[Mas92] Vadim Maslov. Delinearization: an efficient way <strong>to</strong> break multiloop dependence<br />

equations. In In Proceedings of the SIGPLAN Conference on Programming<br />

Language Design and Implementation, PLDI, pages 152–161, 1992.<br />

[McB94] Oliver A. McBryan. An overview of message passing environments. Parallel<br />

Computing, 20:417–444, 1994.<br />

[MMG08] Peter Messmer, Paul J. Mullowney, and Brian E. Granger. GPULib: GPU<br />

computing in high-level languages. Computing in Science and Engineering,<br />

10:70–73, 2008.<br />

[Moo65] Gordon E. Moore. Cramming more components on<strong>to</strong> integrated circuits.<br />

Electronics, 38(8), April 1965.<br />

[Muc97] Steven S. Muchnick. Advanced compiler design and implementation, chapter<br />

13. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.


192 BIBLIOGRAPHY<br />

[Ngu02] Thi Viet Nga Nguyen. Efficient and Effective Software Verifications <strong>for</strong> Scientific<br />

Applications using Static Analysis and Code Instrumentation. PhD<br />

thesis, MINES ParisTech, 2002.<br />

[NMRW02] George C. Necula, Scott McPeak, Shree Prakash Rahul, and Westley Weimer.<br />

CIL: Intermediate language and <strong>to</strong>ols <strong>for</strong> analysis and trans<strong>for</strong>mation of C<br />

programs. In Compiler Construction, volume 2304 of LNCS, pages 213–228.<br />

Springer, April 2002.<br />

[Nov06] Diego Novillo. GCC - an architectural overview, current status and future<br />

directions. Ottawa Linux Symposium, July 2006.<br />

[NRD + 11] Dorit Nuzman, Ira Rosen, Sergei Dyshel, Ayal Zaks, Erven Rohou, Kevin<br />

Williams, Albert Cohen, and David Yuste. Vapor SIMD: Au<strong>to</strong>-vec<strong>to</strong>rize once,<br />

run everywhere. In CGO [CGO11].<br />

[NSL + 11] Chris J. Newburn, Byoungro So, Zhenying Liu, Michael D. McCool, Anwar M.<br />

Ghuloum, Stefanus Du Toit, Zhi-Gang Wang, Zhaohui Du, Yongjian Chen,<br />

Gansha Wu, Peng Guo, Zhanglin Liu, and Dan Zhang. Intel’s Array <strong>Building</strong><br />

Blocks: A retargetable, dynamic compiler and embedded language. In CGO<br />

[CGO11], pages 224–235.<br />

[NVI10] NVIDIA. PTX: Parallel Thread Execution ISA Version 2.1, NVIDIA compute<br />

edition, April 2010.<br />

[NVI11] NVIDIA. NVIDIA CUDA Reference Manual 3.2. http://www.nvidia.com/<br />

object/cuda_develop.html, 2011.<br />

[OOVV05] Karina Olmos, Karina Olmos, Eelco Visser, and Eelco Visser. Composing<br />

source-<strong>to</strong>-source data-flow trans<strong>for</strong>mations with rewriting strategies and dependent<br />

dynamic rewrite rules. In 14th International Conference on Compiler<br />

Construction, volume 3443 of LNCS, pages 204–220. Springer-Verlag, 2005.<br />

[Ope11] OpenMP Architecture Review Board. OpenMP Application Program Interface,<br />

2011.<br />

[Orf95] Sophocles J. Orfanidis. Introduction <strong>to</strong> signal processing. Prentice-Hall, Inc.,<br />

Upper Saddle River, NJ, USA, 1995.<br />

[PAB + 06] Dac Pham, Hans-Werner Anderson, Erwin Behnen, Mark Bolliger, Sanjay<br />

Gupta, H. Peter Hofstee, Paul E. Harvey, Charles R. Johns, James A. Kahle,<br />

Atsushi Kameyama, John M. Keaty, Bob Le, Sang Lee, Tuyen V. Nguyen,<br />

John G. Petrovick, Mydung Pham, Juergen Pille, Stephen D. Posluszny,<br />

Mack W. Riley, Joseph Verock, James D. Warnock, Steve Weitzel, and Dieter<br />

F. Wendel. Key features of the design methodology enabling a multi-core<br />

SoC implementation of a first-generation CELL processor. In Asia and South<br />

Pacific Design Au<strong>to</strong>mation Conference, ASP-DAC, pages 871–878, 2006.<br />

[Pat10] David A. Patterson. The trouble with multicore. IEEE Spectrum, 2010.<br />

[PBB10] Louis-Noel Pouchet, Cédric Bas<strong>to</strong>ul, and Uday Bondhugula. PoCC: the Polyhedral<br />

Compiler Collection, 2010. http://pocc.sf.net.


BIBLIOGRAPHY 193<br />

[PBdD11] Artur Pietrek, Florent Bouchez, and Benoît Dupont de Dinechin. Tirex: A<br />

textual target-level intermediate representation <strong>for</strong> compiler exchange. In<br />

Workshop on Intermediate Representations, WIR, Chamonix, France, April<br />

2011.<br />

[PBSB04] Gilles Pokam, Stéphane Bihan, Julien Simonnet, and François Bodin.<br />

SWARP: a retargetable preprocessor <strong>for</strong> multimedia instructions. Concurrency<br />

and Computation: Practice and Experience, 16(2-3):303–318, 2004.<br />

[PBV06] E. Moscu Panainte, K.L.M. Bertels, and S. Vassiliadis. Interprocedural compiler<br />

optimization <strong>for</strong> partial run-time reconguration. Journal of VLSI Signal<br />

Processing, pages 161–172, 2006.<br />

[PBV07] Elena Moscu Panainte, Koen Bertels, and Stamatis Vassiliadis. The Molen<br />

compiler <strong>for</strong> reconfigurable processors. Transactions in Embedded Computing<br />

Systems, 2007.<br />

[PEH + 93] David A. Padua, Rudolf Eigenmann, Jay Hoeflinger, Paul Petersen, Peng Tu,<br />

Stephen Weather<strong>for</strong>d, and Keith Faigin. Polaris: A new-generation parallelizing<br />

compiler <strong>for</strong> MPPs. Technical Report 1306, University of Illinois, Center<br />

<strong>for</strong> Supercomputing Research and Development, 1993.<br />

[PGS + 09] Alexandros Papakonstantinou, Karthik Gururaj, John A. Strat<strong>to</strong>n, Deming<br />

Chen, Jason Cong, and Wen mei W. Hwu. FCUDA: Enabling efficient compilation<br />

of CUDA kernels on<strong>to</strong> FPGAs. In Proceedings of the 7th Symposium on<br />

Application Specific Processors, pages 35–42. IEEE Computer Society Press,<br />

July 2009.<br />

[PH09] David A. Patterson and John L. Hennessy. Computer organization and design:<br />

the hardware/software interface. Morgan Kaufmann Publishers, 2009.<br />

[Qui00] Daniel J. Quinlan. ROSE: Compiler support <strong>for</strong> object-oriented frameworks.<br />

Parallel Processing Letters, 10(2/3):215–226, 2000.<br />

[RBR + 05] Noam Rinetzky, Jörg Bauer, Thomas W. Reps, Shmuel Sagiv, and Reinhard<br />

Wilhelm. A semantics <strong>for</strong> procedure local heaps and its abstractions.<br />

In Proceedings of the 32nd SIGPLAN-SIGACT Symposium on Principles of<br />

Programming Languages, pages 296–309, New York, NY, USA, January 2005.<br />

ACM.<br />

[RCH + 10] Gabe Rudy, Chun Chen, Mary Hall, Malick Murtaza Khan, and Jacqueline<br />

Chame. A programming language interface <strong>to</strong> describe trans<strong>for</strong>mations<br />

and code generation. In The 23rd International Workshop on Languages and<br />

<strong>Compilers</strong> <strong>for</strong> Parallel Computing, LCPC, pages 136–150, Berlin, Heidelberg,<br />

2010. Springer-Verlag.<br />

[RNZ07] Ira Rosen, Dorit Nuzman, and Ayal Zaks. Loop-aware SLP in GCC - two<br />

years later. In GCC summit, 2007.<br />

[Roj04] Juan Rojas. Multimedia Macros <strong>for</strong> Portable Optimized Programs. PhD<br />

thesis, Northeastern University, 2004.


194 BIBLIOGRAPHY<br />

[SA94] Mark Segal and Kurt Akeley. The OpenGL graphics interface. Technical<br />

report, Silicon Graphics Computer Systems, 1994.<br />

[Sar97] Vivek Sarkar. Au<strong>to</strong>matic selection of high-order trans<strong>for</strong>mations in the<br />

IBM XL FORTRAN compilers. IBM Journal of Research and Development,<br />

41(3):233–264, 1997.<br />

[SCH03] Jaewook Shin, Jacqueline Chame, and Mary W. Hall. Exploiting superwordlevel<br />

locality in multimedia extension architectures. Journal of Instruction-<br />

Level Parallelism, 5, 2003.<br />

[Sch09] David Schleef. Oil Runtime Compiler. http://code.entropywave.com/<br />

projects/orc, 2009.<br />

[SCS + 08] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash,<br />

Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert<br />

Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan.<br />

Larrabee: a many-core x86 architecture <strong>for</strong> visual computing. Transactions<br />

on Graphics, 27:18:1–18:15, August 2008.<br />

[SHC05] Jaewook Shin, Mary W. Hall, and Jacqueline Chame. Superword-level parallelism<br />

in the presence of control flow. In International Symposium on Code<br />

Generation and Optimization, CGO, pages 165–175, 2005.<br />

[SHG08] Shubhabrata Sengupta, Mark Harris, and Michael Garland. Efficient parallel<br />

scan algorithms <strong>for</strong> GPUs. Technical report, NVIDIA, 2008.<br />

[Sin98] Satnam Singh. Accelerating Adobe Pho<strong>to</strong>shop with reconfigurable logic. In<br />

Symposium on FPGA Cus<strong>to</strong>m Computing Machines, pages 18–26. IEEE Computer<br />

Society Press, 1998.<br />

[SQ03] Markus Schordan and Daniel J. Quinlan. A source-<strong>to</strong>-source architecture <strong>for</strong><br />

user-defined optimizations. In László Böszörményi and Peter Schojer, edi<strong>to</strong>rs,<br />

Modular Programming Language, volume 2789 of LNCS, pages 214–223, 2003.<br />

[SSM08] Jay Smith, Howard Jay Siegel, and Anthony A. Maciejewski. A s<strong>to</strong>chastic<br />

model <strong>for</strong> robust resource allocation in heterogeneous parallel and distributed<br />

computing systems. In International Parallel & Distributed Processing Symposium,<br />

IPDPS, pages 1–5. IEEE Computer Society Press, 2008.<br />

[Ste97] Robert Stephens. A survey of stream processing. Acta In<strong>for</strong>matica, 34:491–<br />

541, 1997.<br />

[S<strong>to</strong>08] O. Olaf S<strong>to</strong>raasli. Accelerating genome sequencing 100–1000x with FPGAs.<br />

Many-core and Reconfigurable Supercomputing Conference, April 2008.<br />

[TCE + 10] Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser,<br />

Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna<br />

Upadrasta. GRAPHITE two years after: First lessons learned from<br />

real-world polyhedral compilation. In GCC Research Opportunities Workshop,<br />

GROW, Pisa Italie, 2010.


BIBLIOGRAPHY 195<br />

[TM08] Donald Thomas and Philip Moorby. The Verilog Hardware Description Language.<br />

Springer Publishing Company, Incorporated, 5 edition, 2008.<br />

[TNC + 09] Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen.<br />

Polyhedral-model guided loop-nest au<strong>to</strong>-vec<strong>to</strong>rization. International Conference<br />

on Parallel Architectures and Compilation Techniques, pages 327–337,<br />

2009.<br />

[vECGS92] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik<br />

Schauser. Active messages: a mechanism <strong>for</strong> integrated communication and<br />

computation. SIGARCH Computer Architecture News, 20:256–266, 1992.<br />

[War99] Martin P. Ward. Assembler <strong>to</strong> C migration using the FermaT trans<strong>for</strong>mation<br />

system. In International Conference on Software Maintenance, ICSM, pages<br />

67–76, 1999.<br />

[WC03] Ge Wang and Perry R. Cook. Chuck: a concurrent, on-the-fly audio programming<br />

language. In International Computer Music Conference, ICMC, pages<br />

219–226, 2003.<br />

[WFW + 94] Robert P. Wilson, Robert S. French, Chris<strong>to</strong>pher S. Wilson, Saman P. Amarasinghe,<br />

Jennifer M. Anderson, Steve W. K. Tjiang, Shih wei Liao, Chau wen<br />

Tseng, Mary W. Hall, Monica S. Lam, and John L. Hennessy. SUIF: An infrastructure<br />

<strong>for</strong> research on parallelizing and optimizing compilers. SIGPLAN<br />

Notices, 29:31–37, 1994.<br />

[WGN + 02] Oliver Wahlen, Tilman Glökler, Achim Nohl, Andreas Hoffmann, Rainer Leupers,<br />

and Heinrich Meyr. Application specific compiler/architecture codesign:<br />

a case study. SIGPLAN Notices, 37:185–193, 2002.<br />

[Whe01] David A. Wheeler. SLOCCount. http://www.dwheeler.com/sloccount,<br />

January 2001.<br />

[Whi09] Tom White. Hadoop: The Definitive Guide. O’Reilly, June 2009.<br />

[Wik09] Wikibooks, edi<strong>to</strong>r. GNU C Compiler Internals. http://en.wikibooks.org/<br />

wiki/GNU_C_Compiler_Internals, 2006-2009.<br />

[WL91a] Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. In<br />

SIGPLAN conference on Programming Language Design and Implementation,<br />

PLDI, pages 30–44, New York, NY, USA, 1991. ACM.<br />

[WL91b] Michael E. Wolf and Monica S. Lam. A loop trans<strong>for</strong>mation theory and an<br />

algorithm <strong>to</strong> maximize parallelism. Transactions on Parallel and Distributed<br />

Systems, 2(4):452–471, 1991.<br />

[WM95] Wm. A. Wulf and Sally A. Mckee. Hitting the memory wall: Implications of<br />

the obvious. Computer Architecture News, 23:20–24, March 1995.<br />

[Wol94] Wayne H. Wolf. Hardware-software co-design of embedded systems. In proceedings<br />

of the IEEE, pages 967–989, 1994.<br />

[Wol96] Michael Wolfe. High per<strong>for</strong>mance compilers <strong>for</strong> parallel computing. Addison-<br />

Wesley, 1996.


196 BIBLIOGRAPHY<br />

[Wol10] Michael Wolfe. Implementing the PGI accelera<strong>to</strong>r model. In Proceedings of<br />

the 3rd Workshop on General-Purpose Computation on Graphics Processing<br />

Units, GPGPU, pages 43–50, New York, NY, USA, 2010. ACM.<br />

[Wol11] Michael Wolfe. <strong>Compilers</strong> and more: Programming at exascale. HPC Wire,<br />

March 2011. http://www.hpcwire.com/hpcwire/2011-03-08/compilers_<br />

and_more_programming_at_exascale.html.<br />

[WW94] David Walker and David W. Walker. The design of a standard message passing<br />

interface <strong>for</strong> distributed memory concurrent computers. Parallel Computing,<br />

20:657–673, 1994.<br />

[Yi11] Qing Yi. Au<strong>to</strong>mated programmable control and parameterization of compiler<br />

optimizations. In CGO [CGO11].<br />

[YRR + 10] Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay V. Rajopadhye,<br />

Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien. Au<strong>to</strong>matic<br />

creation of tile size selection models. In International Symposium on<br />

Code Generation and Optimization, CGO’08, pages 190–199, New York, NY,<br />

USA, April 2010. ACM.<br />

[ZC91] Hans Zima and Barbara Chapman. Supercompilers <strong>for</strong> parallel and vec<strong>to</strong>r<br />

computers. ACM, New York, NY, USA, 1991.<br />

[ZC98] Julien Zory and Fabien Coelho. Using algebraic trans<strong>for</strong>mations <strong>to</strong> optimize<br />

expression evaluation in scientific codes. In Proceeding of International<br />

Conference on Parallel Architectures and Compiler Techniques, PACT, pages<br />

376–384, 1998.<br />

[Zim97] Re<strong>to</strong> Zimmermann. Binary adder architectures <strong>for</strong> cell-based VLSI and their<br />

synthesis. PhD thesis, Swiss Federal Institute of Technology (ETH), Zürich ,<br />

Switzerland, 1997.<br />

[ZO01] Matthias Zenger and Martin Odersky. Implementing extensible compilers. In<br />

ECOOP workshop on multiparadigm programming with object-oriented languages,<br />

pages 61–80, 2001.<br />

[ZPS + 96] V. Zivojnovic, S. Pees, C. Schlager, M. Willems, R. Schoenen, and H. Meyr.<br />

DSP processor/compiler co-design: a quantitative approach. In Proceedings<br />

of the 9th international symposium on System synthesis, ISSS, pages 108–,<br />

Washing<strong>to</strong>n, DC, USA, 1996. IEEE Computer Society Press.<br />

[ZWZD93] Songnian Zhou, Jingwen Wang, Xiaohu Zheng, and Pierre Delisle. U<strong>to</strong>pia: a<br />

load sharing facility <strong>for</strong> large, heterogeneous distributed computer systems.<br />

Software – Practice and Experiments, 23:1305–1336, December 1993.


Personal Bibliography<br />

[AAC + 11] Mehdi Amini, Corinne Ancourt, Fabien Coelho, Béatrice Creusillet, Serge<br />

Guel<strong>to</strong>n, François Irigoin, Pierre Jouvelot, Ronan Keryell, and Pierre Villalon.<br />

PIPS Is not (only) Polyhedral Software. In First International Workshop on<br />

Polyhedral Compilation Techniques, IMPACT, Chamonix, France, April 2011.<br />

[ACSGK11] Corinne Ancourt, Frédérique Chaussumier-Silber, Serge Guel<strong>to</strong>n, and Ronan<br />

Keryell. PIPS: An interprodedural, extensible, source-<strong>to</strong>-source compiler infrastructure<br />

<strong>for</strong> code trans<strong>for</strong>mations and instrumentations. tu<strong>to</strong>rial at International<br />

Symposium on Code Generation and Optimization, April 2011.<br />

Chamonix, France.<br />

[CSGIK10] Frédérique Chaussumier-Silber, Serge Guel<strong>to</strong>n, François Irigoin, and Ronan<br />

Keryell. PIPS: An interprodedural, extensible, source-<strong>to</strong>-source compiler infrastructure<br />

<strong>for</strong> code trans<strong>for</strong>mations and instrumentations. tu<strong>to</strong>rial at Principles<br />

and Practice of Parallel Programming, January 2010. Bangalore, India.<br />

[DGG + 07] Vincent Danjean, Roland Gillard, Serge Guel<strong>to</strong>n, Jean-Louis Roch, and<br />

Thomas Roche. Adaptive loops with KAAPI on multicore and grid: applications<br />

in symmetric cryp<strong>to</strong>graphy. In Parallel Symbolic Computation, PASCO,<br />

pages 33–42, 2007.<br />

[GAKC11] Serge Guel<strong>to</strong>n, Mehdi Amini, Ronan Keryell, and Béatrice Creusillet. PyPS,<br />

a programmable pass manager. poster at International Workshop on Languages<br />

and <strong>Compilers</strong> <strong>for</strong> Parallel Computing, September 2011. Fort Collins,<br />

Colorado, USA.<br />

[GGK11] Serge Guel<strong>to</strong>n, Adrien Guinet, and Ronan Keryell. <strong>Building</strong> retargetable<br />

and efficient compilers <strong>for</strong> multimedia instruction sets. poster at Parallel<br />

Architectures and Compilation Techniques, Oc<strong>to</strong>ber 2011. Galves<strong>to</strong>n, Texas,<br />

USA.<br />

[GGPV09] Serge Guel<strong>to</strong>n, Thierry Gautier, Jean-Louis Pazat, and Sébastien Varrette.<br />

Dynamic Adaptation Applied <strong>to</strong> Sabotage Tolerance. In Proceedings of<br />

the 17th Euromicro International Conference on Parallel, Distributed and<br />

Network-Based Processing, PDP, pages 237–244, Weimar, Germany, February<br />

2009.<br />

197


198 PERSONAL BIBLIOGRAPHY<br />

[GIK10] Serge Guel<strong>to</strong>n, François Irigoin, and Ronan Keryell. Au<strong>to</strong>matic and source<strong>to</strong>-source<br />

code generation <strong>for</strong> vec<strong>to</strong>r hardware accelera<strong>to</strong>rs. poster at Colloque<br />

National du GDR SOC-SIP, June 2010. Cergy, France.<br />

[GKI11] Serge Guel<strong>to</strong>n, Ronan Keryell, and François Irigoin. Compilation pour cible<br />

hétérogènes: au<strong>to</strong>matisation des analyses, trans<strong>for</strong>mations et décisions nécessaires.<br />

In 20ème Rencontres Françaises du Parallélisme, Renpar, Saint Malo,<br />

France, May 2011.<br />

[Gue09] Serge Guel<strong>to</strong>n. A genetic and source-<strong>to</strong>-source approach <strong>to</strong> iterative compilation.<br />

poster at ACM Student Research Competition Posters, Parallel Architectures<br />

and Compilation Techniques, September 2009. Raleigh, North<br />

Carolina, USA.<br />

[Gue10] Serge Guel<strong>to</strong>n. Au<strong>to</strong>matic source-<strong>to</strong>-source code generation <strong>for</strong> vec<strong>to</strong>r hardware<br />

accelera<strong>to</strong>rs. poster at International Workshop on Languages and <strong>Compilers</strong><br />

<strong>for</strong> Parallel Computing, Oc<strong>to</strong>ber 2010. Hous<strong>to</strong>n, Texas, USA.<br />

[Gue11] Serge Guel<strong>to</strong>n. <strong>Building</strong> <strong>Source</strong>-<strong>to</strong>-<strong>Source</strong> compilers <strong>for</strong> Heterogenous targets.<br />

PhD thesis, Télécom Bretagne, 2011.<br />

[GV09] Serge Guel<strong>to</strong>n and Sébastien Varrette. Une approche génétique et source<br />

à source de l’optimisation de code. In 19ème Rencontres francophones du<br />

parallélisme, Renpar, Toulouse, France, September 2009.<br />

[PVG + 08a] Christine Plumejeaud, Jean-Marc Vincent, Claude Grasland, Sandro Bimonte,<br />

Hélène Mathian, Serge Guel<strong>to</strong>n, Joël Boulier, and Jérôme Gensel.<br />

Hypersmooth: A system <strong>for</strong> interactive spatial analysis via potential maps.<br />

In The 8th International Symposium on Web and Wireless Geographical In<strong>for</strong>mation<br />

Systems, W2GIS, pages 4–16, 2008.<br />

[PVG + 08b] Christine Plumejeaud, Jean-Marc Vincent, Claude Grasland, Jérôme Gensel,<br />

Hélène Mathian, Serge Guel<strong>to</strong>n, and Joël Boulier. Hypersmooth : calcul et<br />

visualisation de cartes de potentiel interactives. CoRR, abs/0802.4191, 2008.<br />

[TVa + 12] Massimo Torquati, Marco Vanneschi, Mehdi amini, Serge Guel<strong>to</strong>n, Ronan<br />

Keryell, Vincent Lanore, François-Xavier Pasquier, Michel Barreteau, Rémi<br />

Barrère, Claudia-Teodora Petrisor, Éric Lenormand, Claudia Cantini, and<br />

Filippo De Stefani. An innovative compilation <strong>to</strong>ol-chain <strong>for</strong> embedded multicore<br />

architectures. In Embedded World Conference, February 2012.


Index<br />

terapix, 16, 17, 19, 73, 77, 78, 123, 133,<br />

141, 144, 146–151, 153–155, 160, 163<br />

array linearization, 144, 150<br />

C99, 34<br />

common subexpression elimination, 85, 167<br />

Compilation flow, 39<br />

compilation flow, 42, 45, 67, 134, 152<br />

constant propagation, 51<br />

convex array region, 87<br />

convex array regions, 120, 168<br />

data transfers, 29, 104, 114, 120, 123, 141<br />

dead code elimination, 120, 167<br />

directive generation, 49, 134<br />

distributed memory, 29, 114<br />

flatten code, 150<br />

<strong>for</strong>ward substitution, 46, 84, 167<br />

fuzz testing, 63<br />

go<strong>to</strong> elimination, 84<br />

header substitution, 59, 99, 108, 152<br />

inlining, 46, 53, 83, 84<br />

instruction selection, 79<br />

invariant code motion, 167<br />

iteration clamping, 144<br />

loop fusion, 49, 52, 87, 134, 167<br />

loop interchange, 102, 167<br />

loop normalization, 144<br />

loop rerolling, 108<br />

loop tiling, 52, 102, 167<br />

loop unrolling, 46, 50, 102<br />

loop unswitching, 104<br />

199<br />

memory footprint reduction, 121, 163<br />

n address code generation, 150<br />

outlining, 18, 84, 87, 159, 162<br />

parallelism detection, 134<br />

parallelism extraction, 49<br />

pass manager, 46, 47, 134<br />

privatization, 134<br />

reduction detection, 49, 134<br />

redundant load-s<strong>to</strong>re elimination, 113, 124,<br />

163<br />

scalar renaming, 100<br />

split update opera<strong>to</strong>r, 150<br />

statement isolation, 114, 120, 141, 159, 163<br />

strength reduction, 150<br />

symbolic tiling, 122, 159<br />

variable length array, 34, 141

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!