Building Source-to-Source Compilers for Heterogeneous Targets

THÈSE TÉLÉCOM Bretagne 

sous le sceau de l’Université Européenne de 

Bretagne 

pour obtenir le titre de 

DOCTEUR DE TÉLÉCOM Bretagne 

En habilitation conjointe avec l’Université 

de Bretagne Occidentale 

Spécialité : 

Building 

Source-to-Source 

Compilers for 

Heterogeneous Targets 

présentée par 

Serge Guelton 

ÉCOLE DOCTORALE : SICMA 

LABORATOIRE : TB/INFO 

Thèse soutenue le 7 octobre 2011 

devant le jury composé de : 

Albert Cohen, Professeur, INRIA 

Fran cois Irigoin, Professeur, MINES ParisTech 

Ronan Keryell, Enseignant-chercheur, Télécom Bretagne 

& HPC Project 

Fabrice Lemonnier, Responsable Projet, Thales 

Research & Technology 

Bernard Pottier, Professeur, Université de Bretagne 

Occidentale 

Patrice Quinton, Professeur, École Normale 

Supérieure de Cachan-Bretagne 

Sanjay Rajopadhye, Professeur, Colorado State 

University 

Eugene Ressler, Professeur, United States Military 

Academy

N o d’ordre : 2011telb0203 

Sous le sceau de l’Université européenne de Bretagne 

Télécom Bretagne 

En habilitation conjointe avec l’Université de Bretagne 

Occidentale 

École Doctorale – SICMA 

Building Source-to-Source Compilers for Heterogeneous 

Targets 

Thèse de Doctorat 

Mention : Sciences et Technologies de l’Information et des Communications 

Présentée par Serge Guelton 

Laboratoire : TB/INFO 

Directeur de thèse : François Irigoin 

Encadrant de thèse : Ronan Keryell 

Soutenue le 7 octobre 2011 

Jury : 

Albert Cohen, Professeur, INRIA 

Fran cois Irigoin, Professeur, MINES ParisTech 

Ronan Keryell, Enseignant-chercheur, Télécom Bretagne & HPC Project 

Fabrice Lemonnier, Responsable Projet, Thales Research & Technology 

Bernard Pottier, Professeur, Université de Bretagne Occidentale 

Patrice Quinton, Professeur, École Normale Supérieure de Cachan-Bretagne 

Sanjay Rajopadhye, Professeur, Colorado State University 

Eugene Ressler, Professeur, United States Military Academy

Abstract 

Heterogeneous computers—platforms that make use of multiple specialized devices to 

achieve high throughput or low energy consumption—are difficult to program. Hardware 

vendors usually provide compilers from a C dialect to their machines, but complete application 

rewriting is frequently required to take advantage of them. 

In this thesis, we propose a new approach to building bridges between regular applications 

written in C and these dialects. From an analysis of the hardware constraints, we 

propose original code transformations that can meet those constraints, and we combine 

them using a completely programmable pass manager that can handle the complex compilation 

flows required by the code generation process for multiple targets. It makes it 

possible to build a collection of compilers based on the same infrastructure while targeting 

different architectures. 

All the code transformations are done at the source level using the pips source-tosource 

compiler framework. New transformations are detailed using denotational semantics 

on a simplified language and have been combined to build four different compilers: an 

openmp directive generator, a retargetable multimedia instruction generator for sse, avx 

and neon, an assembly code generator for an fpga-based image processor and a cuda 

generator. 

Résumé 

Les machines hétérogènes — des ordinateurs reposant sur la combinaison d’unités de 

calculs spécialisées pour obtenir des performances élevées et une consommation énergétique 

moindre — sont difficiles à programmer. Les vendeurs de matériels fournissent généralement 

un compilateur d’un dialecte du C pour leur machines, mais il faut dans ce cas réécrire 

complètement l’application cible pour en profiter. 

Dans cette thèse, nous proposons une nouvelle approche pour faire le pont entre des 

applications classiques écrites en C et ces dialectes. Partant d’une analyse des contraintes 

imposées par le matériel, nous proposons un ensemble de transformations de code original 

qui permet de répondre à ces contraintes et nous les combinons à l’aide d’un gestionnaire 

de passe complètement programmable qui peut gérer les flots de compilation complexes 

mis en œuvre lors de la génération de code multi-cible. Cela permet d’obtenir un ensemble 

de compilateurs réutilisant les mêmes éléments de base tout en ciblant des architectures 

différentes. 

Toutes les transformations s’appliquent au niveau source en se basant sur 

l’infrastructure de compilation pips. Les nouvelles transformations proposées sont explicitées 

en se basant sur la sémantique dénotationnelle et un langage cible simplifié. Elles 

sont utilisées pour assembler quatre compilateurs différents : un générateur de directives 

openmp, un générateur reciblable d’instructions multimédia pour sse, avx et neon, un 

générateur de code assembleur pour une machine à base de fpga spécialisée dans le traitement 

d’images et un générateur de code cuda. 

iii

Remerciements 

Acknowledgments is something better done in one’s native tongue, for it is 

difficult to express the subtlety of feelings in another language. . . 

Cette thèse a été financée par l’Agence Nationale de la Recherche dans la cadre du 

projet freia. Elle a été effectuée en collaboration avec thales trt et mines paristech. 

Il y a cinq ans de ça, suivant à Grenoble la souriante étudiante qui allait devenir ma 

femme, j’ai fait mes débuts dans le monde du travail dans une équipe de recherche nommée 

mescal, sous la direction d’un certain Jean-Marc Vincent. Ce chercheur passionné et 

chaleureux m’a le plus innocemment du monde 1 fait plonger dans un monde merveilleux, 

plongeon dont je ne ressors qu’après la rédaction de ce manuscrit. Ce sont ces grenoblois, 

Thierry, Jean-Louis, Vincent, Frédéric, Arnaud, Bruno et leur joyeuse bande de thésard 

Xavier, Sébastien, Maxime, Jean-Noël, Swann, qui m’ont fait penché du côté lumineux. 

Un autre Jean-Louis, rennais celui-là, a su en profiter et équilibrer par la raison la 

folie qui guette du haut des montagnes grenobloises. Il a pourtant commis une faute irréparable 

en me jetant dans les rets d’un de mes anciens professeurs bretons, le machiavélique 

Ronan Keryell, me soustrayant par là-même aux saintes intentions d’un exthésard 

grenoblois nouvellement luxembourgeois. Me promettant l’équilibre parfait entre 

recherche et développement, s’inspirant des démons grenoblois pour mieux m’attirer à lui, 

il parvint sans peine à me faire franchir la ligne séparant l’ingénierie de la recherche pour 

me faire commencer la thèse dont ce manuscrit est le fruit. 

Que serais-je devenu entre les mains de ce personnage s’il n’avait pas eu l’égarement de 

me placer sous la direction de François Irigoin ? Il est sûr que ces trois années auraient 

eues une toute autre saveur, et il m’est encore maintenant difficile de mesurer les bienfaits 

de sa présence, toujours prompte — bien que parfois inefficace — à contrebalancer les excès 

d’enthousiasmes provoqués par mes travaux. 

Ce fut une thèse itinérante, qui a commencé Rennes, entouré des joyeux symbiotes 

Dominique, Rayan, Jacques, Pierre et les autres, pour terminer à Brest avec la délicieuse 

Armelle, les sympathiques Eliya, Zhe, Xu et Jiayi, les indélogeables habitants du D3 128, 

les joyeux Grégoire, Frédéric, Adrien, Sébastien et les fidèles membres du club de tkd, 

entrecoupé de séjours bellifontains avec Corinne, Fabien, Pierre, Laurent, Amira. . . Sans 

oublier le canal #pipsien et Mehdi, Pierre, Béatrice. 

La rédaction de ce manuscrit a tout particulièrement profité des conseils et des relectures 

avisées de François Irigoin, Ronan Keryell, Béatrice Creusillet, Pierre Jouvelot, 

1. Avec le recul, il y a peut-être là une volonté maligne de conversion de sa part. 

v

vi REMERCIEMENTS 

Fabien Dagnat et Adrien Guinet. A big thanks to Aimée Johansen for her english 

advices and for spotting my numerous mistakes. Mes rapporteurs Albert Cohen, Sanjay 

Rajopadhye et Eugene Ressler n’ont pas hésité à pointer du doigt les nombreux défauts 

de la version qui leur a été soumise 2 et ont grandement contribué à son amélioration. 

Les efforts typographiques consentis à ce document doivent beaucoup à Yannis Haralambous 

et son Chicago Manual of Style, les atrocités restantes sont l’unique fruit de 

ma paresse. Le plaisir de la rédaction doit beaucoup au système de composition L ATEX, au 

paquet TikZ et à l’éditeur de texte vim. 

Mes parents m’ont toujours poussé a faire des études pour garder le plus de portes 

ouvertes, et m’envoyaient à l’école en me disant « amuse toi bien ». J’ai essayé de suivre 

le premier de ces conseils, et il ne fut pas bien difficile de suivre le second. 

Et pour les moments difficiles, les baisses de moral, les longues nuits de soumission 

d’article, les bien plus longs mois de rédaction, pour les sourires radieux, les attentions 

quotidiennes, 灰太狼十分感谢伟大的红太狼和狼宝宝. 

2. Qu’ils soient remerciés pour cette masse de travail supplémentaire :-)

Contents 

Remerciements v 

Acknowledgments, in french. 

Résumé en français 1 

Dissertation summary in French. 

1 Introduction 21 

2 Heterogeneous Computing Paradigm 25 

2.1 Heterogeneous Computing Model . . . . . . . . . . . . . . . . . . . . . . . 26 

2.2 Influence on Programming Model . . . . . . . . . . . . . . . . . . . . . . . 29 

2.3 Hardware Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

2.4 Note About the C Language . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

2.5 OpenCL Programming Model Analysis . . . . . . . . . . . . . . . . . . . . 37 

2.6 Other Programing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 

3 Compiler Design for Heterogeneous Architectures 43 

3.1 Extending Compiler Infrastructures . . . . . . . . . . . . . . . . . . . . . . 44 

3.1.1 Existing Compiler Infrastructures . . . . . . . . . . . . . . . . . . . 44 

3.1.2 A Simple Model for Code Transformations . . . . . . . . . . . . . . 48 

3.1.2.1 Transformations: Definition and Compositions . . . . . . . 48 

3.1.2.2 Parametric Transformations . . . . . . . . . . . . . . . . . 50 

3.1.2.3 From Model to Implementation . . . . . . . . . . . . . . . 50 

3.1.3 Programmable Pass Management . . . . . . . . . . . . . . . . . . . 51 

3.1.3.1 A Class Hierarchy for Pass Management . . . . . . . . . . 51 

3.1.3.2 Control Flow and Pass Management . . . . . . . . . . . . 53 

3.2 On Source-to-Source Compilers . . . . . . . . . . . . . . . . . . . . . . . . 55 

3.2.1 Exploring Source-to-Source Opportunities . . . . . . . . . . . . . . 55 

3.2.2 Impact of Source-to-Source Compilation on the Compilation Infrastructure 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

3.3 pyps, a High Level Pass Manager api . . . . . . . . . . . . . . . . . . . . . 58 

3.3.1 api Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

vii

viii CONTENTS 

3.3.2 Usage Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 

4 Representing the Instruction Set Architecture in C 69 

4.1 C as a Common Denominator . . . . . . . . . . . . . . . . . . . . . . . . . 70 

4.1.1 C Dialects and Heterogeneous Computing . . . . . . . . . . . . . . 70 

4.1.2 From the ISA to C . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 

4.2 Native Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

4.2.1 Scalar Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

4.2.2 Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

4.2.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 

4.3 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

4.4 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 

4.4.1 Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

4.4.2 N-Address Code Generation . . . . . . . . . . . . . . . . . . . . . . 80 

4.5 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

4.6 Function calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

4.6.1 Removing Function Calls . . . . . . . . . . . . . . . . . . . . . . . . 83 

4.6.2 Outlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

4.6.2.1 Outlining Algorithm . . . . . . . . . . . . . . . . . . . . . 84 

4.6.2.2 Using Outlining to Reduce Compilation Time . . . . . . . 87 

4.7 Library Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

5 Parallelism with Multimedia Instructions 91 

5.1 Super-word Level Parallelization . . . . . . . . . . . . . . . . . . . . . . . . 92 

5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

5.1.2 A Meta-Multimedia Instruction Set . . . . . . . . . . . . . . . . . . 97 

5.1.2.1 Sequential Implementation . . . . . . . . . . . . . . . . . . 98 

5.1.2.2 Target-Specific Implementation . . . . . . . . . . . . . . . 99 

5.1.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

5.1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

5.1.4 Generation of Optimized simd Instructions . . . . . . . . . . . . . . 101 

5.1.4.1 Statement Closeness . . . . . . . . . . . . . . . . . . . . . 101 

5.1.4.2 Parametric Vector Instruction Generation Algorithm . . . 101 

5.1.5 Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

5.1.6 Loop Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

5.1.7 Combining Loop Vectorization and Super-word Level Parallelism . . 104 

5.2 Reduction Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 

5.2.1 Reduction Detection Inside a Sequence . . . . . . . . . . . . . . . . 106 

5.2.2 Delegating to Library . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

5.3 Computational Intensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

CONTENTS ix 

5.3.1 Execution Time versus Transfer Time Estimation . . . . . . . . . . 110 

5.3.2 Limitations of the Model . . . . . . . . . . . . . . . . . . . . . . . . 111 

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 

6 Transformations for Memory Size and Distribution 113 

6.1 Statement Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

6.1.1 Formulation of Statement Isolation . . . . . . . . . . . . . . . . . . 114 

6.1.1.1 Expression Renaming . . . . . . . . . . . . . . . . . . . . 116 

6.1.1.2 Type Renaming . . . . . . . . . . . . . . . . . . . . . . . . 118 

6.1.1.3 Statement renaming . . . . . . . . . . . . . . . . . . . . . 119 

6.1.1.4 Restricted Statement Isolation . . . . . . . . . . . . . . . 119 

6.1.2 Statement Isolation and Convex Array Regions . . . . . . . . . . . 120 

6.2 Memory Footprint Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 121 

6.2.1 Memory Footprint Estimate . . . . . . . . . . . . . . . . . . . . . . 121 

6.2.2 Symbolic Rectangular Tiling . . . . . . . . . . . . . . . . . . . . . . 122 

6.3 Redundant Load Store Optimization . . . . . . . . . . . . . . . . . . . . . 123 

6.3.1 Redundant Load Elimination . . . . . . . . . . . . . . . . . . . . . 125 

6.3.1.1 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 

6.3.1.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 

6.3.1.3 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 

6.3.1.4 Interprocedurally . . . . . . . . . . . . . . . . . . . . . . . 127 

6.3.2 Redundant Store Elimination . . . . . . . . . . . . . . . . . . . . . 127 

6.3.3 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

6.3.4 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

6.3.5 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

6.3.6 Interprocedurally . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

6.3.7 Combining Load and Store Elimination . . . . . . . . . . . . . . . . 128 

6.3.7.1 Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

6.3.7.2 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 

6.3.8 Main Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 

7 Compiler Implementations and Experiments 133 

7.1 A Simple OpenMP Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . 134 

7.1.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 134 

7.1.2 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . 134 

7.1.3 Experiments & Validation . . . . . . . . . . . . . . . . . . . . . . . 137 

7.2 A GPU Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 




7.3 An FPGA Image Processor Accelerator Compiler . . . . . . . . . . . . . . 144 

7.3.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . 146

x CONTENTS 

7.3.2 terapix Compiler Implementation . . . . . . . . . . . . . . . . . . 147 

7.3.2.1 Input Code Splitting . . . . . . . . . . . . . . . . . . . . . 147 


7.4 A Retargetable Multimedia Instruction Set Compiler . . . . . . . . . . . . 152 



7.4.3 Multimedia Instruction Set on Desktop and Embedded Processors . 152 

7.4.4 Results & Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 

8 Conclusion 161 

A The PIPS Compiler Infrastructure 165 

B The LuC language 169 

B.1 Syntactic Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 

B.2 Semantic Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 

C Using PyPS to Drive a Compilation Benchmark 171 

D Using C to Emulate sse Intrinsics 173 

Glossary 177 

Acronyms 179 

Bibliography 183 

Personal Bibliography 197 

Index 198

List of Listings 

3.1 gcc pass manager initialization. . . . . . . . . . . . . . . . . . . . . . . . . 47 

3.2 Dynamic phase ordering using the llvm pass manager command line interface. 47 

3.3 Usage of exceptions at the pass manager level. . . . . . . . . . . . . . . . . 54 

3.4 Example of workspace composition at the pass manager level using pyps. . 60 

3.5 Streaming simd Extension (sse) C intrinsics generated for a scalar product. 62 

3.6 Sequential intrinsic implementation of a sse scalar product. . . . . . . . . 62 

3.7 Native intrinsic implementation of a sse scalar product. . . . . . . . . . . . 62 

3.8 Fuzz testing with pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 

4.1 Broadcast a single value in sse. . . . . . . . . . . . . . . . . . . . . . . . . 72 

4.2 Vector type emulation in C. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

4.3 Sequential implementation of _mm_set1_ps. . . . . . . . . . . . . . . . . . 73 

4.4 Example of structure removal. . . . . . . . . . . . . . . . . . . . . . . . . . 75 

4.5 Structure removal in the presence of function call. . . . . . . . . . . . . . . 76 

4.6 Two-step transformation from multi-dimensional arrays to pointers. . . . . 77 

4.7 Using naming convention to distinguish registers for the terapix architecture. 78 

4.8 Outlining of the inner loop of an erosion kernel. . . . . . . . . . . . . . . . 86 

4.9 Enhanced outlining of the inner loop of an erosion kernel. . . . . . . . . . . 86 

5.1 Excerpt from libmpcodecs/vf_gradfun.c file from the mplayer source tree. 93 

5.2 Sample of representation of a vector register using C type. . . . . . . . . . 97 

5.3 Sample of tree pattern in polish notation used for slp. . . . . . . . . . . . 98 

5.4 fma sequential version for a vector of 4 floats. . . . . . . . . . . . . . . . . 99 

5.5 fma generic operation implemented for the neon instruction set. . . . . . 99 

5.6 Vectorized output for a matrix multiply. . . . . . . . . . . . . . . . . . . . 100 

5.8 Conditional loop tiling on a matrix-vector product. . . . . . . . . . . . . . 104 

5.9 Horizontal erosion code sample. . . . . . . . . . . . . . . . . . . . . . . . . 110 

6.1 Illustration of statement isolation on a scalar assignment. . . . . . . . . . . 114 

6.2 Code after statement isolation. . . . . . . . . . . . . . . . . . . . . . . . . . 121 

6.3 Symbolic tiling of the outermost loop of an horizontal erosion with information 

about in and out regions. . . . . . . . . . . . . . . . . . . . . . . . . . 124 

6.4 Illustration of the redundant load store elimination algorithm. . . . . . . . 131 

7.1 Original PyPS script for openmp code generation. . . . . . . . . . . . . . . 136 

7.2 Makefile stub for openmp compilation. . . . . . . . . . . . . . . . . . . . . 137 

7.3 terapix assembly for a 3 × 3 convolution kernel. . . . . . . . . . . . . . . 148 

xi

xii LIST OF LISTINGS 

7.4 Illustration of terapix code generation, host part. . . . . . . . . . . . . . 153 

7.5 Illustration of terapix code generation, accelerator part. . . . . . . . . . . 154 

7.6 Illustration of terapix compacted assembly. . . . . . . . . . . . . . . . . . 155 

A.1 A simple loop to illustrate pips analyses. . . . . . . . . . . . . . . . . . . . 166 

A.2 Example of precondition analysis. . . . . . . . . . . . . . . . . . . . . . . . 166 

A.3 Example of transformers analysis. . . . . . . . . . . . . . . . . . . . . . . . 167 

A.4 Example of cumulated memory effects analysis. . . . . . . . . . . . . . . . 168 

A.5 Example of convex array regions analysis. . . . . . . . . . . . . . . . . . . 168

List of Figures 

2.1 Heterogeneous computing model. . . . . . . . . . . . . . . . . . . . . . . . 27 

2.2 von Neumann architecture vs. opencl architecture. . . . . . . . . . . . . 28 

2.3 Impact of heterogeneous architecture on compilation. . . . . . . . . . . . . 30 

2.4 Example of hardware feature diagram. . . . . . . . . . . . . . . . . . . . . 32 

2.5 Multicore with vector unit feature diagram. . . . . . . . . . . . . . . . . . 33 

2.6 Comparison of C89 and C99 syntax on a complex matrix vector multiply. . 35 

2.7 Comparison of two versions of the HPEC Challenge benchmark: C89 vs. C99. 36 

2.8 Comparison of two versions of the Coremark benchmark: C89 vs. C99. . 37 

2.9 Comparison of two versions of the Linpack benchmark: C89 vs. C99. . . . 38 

2.10 Compilation flow in opencl. . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

3.1 A classical 3-phases retargetable compiler architecture. . . . . . . . . . . . 44 

3.2 Improved compilation flow for heterogeneous computing. . . . . . . . . . . 45 

3.3 pips as a generic compiler infrastructure sample. . . . . . . . . . . . . . . . 46 

3.4 pyps class hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 

3.5 Usage of conditionals at the pass manager level. . . . . . . . . . . . . . . . 53 

3.6 Usage of loops at the pass manager level. . . . . . . . . . . . . . . . . . . . 54 

3.7 Source-to-source cooperation with external tools. . . . . . . . . . . . . . . 56 

3.8 Heterogeneous compilation stages. . . . . . . . . . . . . . . . . . . . . . . . 57 

3.9 Source-to-source heterogeneous compilation scheme. . . . . . . . . . . . . . 58 

4.1 Using outlining to reduce analyse compilation time on an unrolled sequence 

of matrix multiplications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 

5.1 Various way to program for the sse mis. . . . . . . . . . . . . . . . . . . . 94 

5.2 Comparison of llvm, gcc and icc vectorizers using linpack. . . . . . . . 95 

5.3 Multimedia Instruction Set history for x86 processors. . . . . . . . . . . . . 95 

5.4 Vectorized implementation of float vector multiply-addition operation (epilogue 

omitted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

5.5 Parallelizing reductions in a sequence. . . . . . . . . . . . . . . . . . . . . . 107 

5.6 Effect of reduction parallelization on an unrolled loop. . . . . . . . . . . . . 108 

5.7 Manually vectorizing an inner product vs. using a library. . . . . . . . . . . 109 

7.1 Multicore hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . 134 

xiii

xiv LIST OF FIGURES 

7.2 Source-to-source compilation scheme for openmp. . . . . . . . . . . . . . . 135 

7.3 Performance of an openmp directive generator prototype on the polybench 

benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

7.4 nVidia Fermi architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 140 

7.5 gpu hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . . . 140 

7.6 Source-to-source compilation scheme for gpu. . . . . . . . . . . . . . . . . 142 

7.7 Splitting a array sum example code into host, loop proxy and accelerator 

parts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 

7.8 Median execution time on a gpu for dsp kernels. . . . . . . . . . . . . . . 145 

7.9 terapix architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 

7.10 terapix hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . 148 

7.11 terapix redundant computations. . . . . . . . . . . . . . . . . . . . . . . 149 

7.12 Source-to-source compilation scheme for terapix. . . . . . . . . . . . . . . 149 

7.13 mis hardware feature diagram. . . . . . . . . . . . . . . . . . . . . . . . . . 156 

7.14 Source-to-source compilation scheme for mis. . . . . . . . . . . . . . . . . . 156 

7.15 Pass reuse among 4 pyps-based compilers. . . . . . . . . . . . . . . . . . . 159

List of Algorithms 

1 Fuzz testing at the pass manager level. . . . . . . . . . . . . . . . . . . . . . 63 

2 Compilation complexity reduction with outlining. . . . . . . . . . . . . . . . 87 

3 Parametric vector instruction generation algorithm. . . . . . . . . . . . . . . 103 

4 Hybrid vectorization at the pass manager level. . . . . . . . . . . . . . . . . 105 

5 Memory footprint reduction algorithm. . . . . . . . . . . . . . . . . . . . . . 123 

6 Redundant load store elimination algorithm at the pass manager level. . . . 130 

7 Parallel loop generation algorithm for openmp. . . . . . . . . . . . . . . . . 135 

8 terapix kernel extraction algorithm at the pass manager level. . . . . . . . 150 

9 C-to-terapix translation algorithm at the pass manager level. . . . . . . . 151 

xv

xvi LIST OF ALGORITHMS

List of Tables 

3.1 Comparison of source-to-source compilation infrastructures. . . . . . . . . . 55 

3.2 sloccount reports for the gcc and llvm compilers. . . . . . . . . . . . . . 65 

4.1 C dialects and targeted hardware. . . . . . . . . . . . . . . . . . . . . . . . 71 

7.1 sloccount report for an openmp directive generator prototype written in 

pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 

7.2 sloccount report for a cuda generator prototype written in pyps. . . . . 144 

7.3 Description of a terapix microinstruction. . . . . . . . . . . . . . . . . . . 147 

7.4 Ratio between terapix microcode cycle counts for automatic and manual 

code generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 

7.5 sloccount report for a terapix assembly generator prototype written in 

pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 

7.6 sloccount report for an avx intrinsic generator prototype written in pyps. 159 

7.7 Summary of the sloccount reports for the compiler prototypes written in 

pyps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 

xvii

xviii LIST OF TABLES

Notations Quick Reference Sheet 

This thesis makes use of various notations from different computer science fields. They 

are concisely and informally summarized here. Occasionally, a symbol may be used with 

different meaning than shown here. Context should always be sufficient to make these 

cases clear. For example, P is used to denote both the power set of a set and also—as 

indicated here—the domain of all programs. 

Sets and domains are in general denoted with capital letters in a cursive font, for 

example S. Set members are lowercase, for example s. Semantic functions are denoted in 

bold capitals, for example S. 

Where notations given here are unclear, the reader is urged to continue, referring back 

to this sheet as needed. 

Set Operators 

P(A) The power set of A. 

|A| 

Ā 

The number of elements in the set A. 

The convex hull of set A. 

⌈A⌉ The rectangular hull of set A. 

A ∪ B The union of sets A and B. 

A ¯∪ B The convex union of sets A and B. 

Ak A 

The set of all k-tuples of A. 

∗ 

 

k∈N Ak . 

xix

xx NOTATIONS QUICK REFERENCE SHEET 

Syntactic Domains 

P The domain of programs p. 

F The domain of functions f. 

S The domain of statements s. 

E The domain of expressions e. 

I The domain of identifiers id. 

L The domain of memory locations l. 

R The domain of references r. 

C The domain of constants cst. 

Op The domain of operators op. 

D The domain of declarations d. 

T The domain of types t. 

Semantic Domains 

V The domain of denotable values v. 

Σ : L → (V ∪ {unbound}) The domain of stores σ. 

Semantic Functions 

S : S × Σ → Σ S evaluates statement s in store σ0 to produce store σ1. 

P : P × V ∗ → V ∗ P evaluates the body of function main from the program p in 

the store {〈istdin, [v0, . . . , vn−1]〉} and returns the list of values 

accumulated in istdout, where istdin is an identifier reserved for 

standard input and istdout is an identifier reserved for standard 

output. 

E : E × Σ → V E evaluates expression e in store σ to produce value v. 

D : D × Σ → Σ D evaluates declaration d in store σ0 to produce store σ1. 

R : R × Σ → P(L) R evaluates reference r in store σ to produce locations l. 

T : T × Σ → N T returns the number of memory cells occupied by type t. 

I : I → P(L) I returns the locations l associated to the identifier i. 

Syntactic Operators 

if E then E1 else E2 The statement denoted by the syntactic clause “if E then E1 

else E2”.

NOTATIONS QUICK REFERENCE SHEET xxi 

Miscellaneous Operators 

f ◦ g The function f composed with function g. 

f[x → y] A function identical to f but for the element x where it returns 

y. 

loc : I × V × Σ → Σ loc produces v new locations from an identifier i, adds them to 

the store σ and returns this new store. 

unbind(σ, id) : Σ × I → Σ unbind(σ, id) = σ[l → unbound | l ∈ I(id)], i.e. the function 

that removes all the locations accessible by an identifier from 

memory. 

formal(id) : I → I The formal parameter of function id. 

body(id) : I → S The body of function id. 

refs(s) : S → P(R) The set of all references syntactically found in statement s. 

Array Regions 

An array region is an application that associates a statement and a 

memory state to a set of references associated to this memory state. 

R = r : S × Σ → P(R) The exact array region of the references read by statement s 

evaluated in store σ. 

R r : S × Σ → P(R) An over-approximated array region of the references read by 

statement s evaluated in store σ. 

R = i : S × Σ → P(R) The exact array region of the references imported by statement 

s evaluated in store σ. 

R 

i : S × Σ → P(R) An over-approximated array region of the references imported 

by statement s evaluated in store σ. 

R = w : S × Σ → P(R) The exact array region of the references written by statement s 

evaluated in store σ. 

R w : S × Σ → P(R) An over-approximated array region of the references written by 

statement s evaluated in store σ. 

R = o : S × Σ → P(R) The exact array region of the references exported by statement 

s evaluated in store σ. 

R o : S × Σ → P(R) An over-approximated array region of the references exported 

by statement s evaluated in store σ.

xxii NOTATIONS QUICK REFERENCE SHEET

Résumé en français 

Cette section contient, pour chaque chapitre de la thèse, une traduction de son 

introduction et un résumé de chacune de ses sections. Le lecteur intéressé est 

invité à se reporter au chapitre correspondant pour obtenir des précisions sur 

un sujet particulier. 

Construction de compilateurs source à source 

pour cibles hétérogènes 

1. Introduction 

La loi de Moore [Moo65] a été invalidée il y a cinq ans, quand les transistors sont 

devenus si petits que le silicium ne pouvait plus dissiper l’énergie libérée par l’activité des 

processeur utilisés à leur fréquence maximale. William J. Dally a appelé ce phénomène 

« The End of Denial Architecture » [Dal09]. Pour dépasser ces limitations physiques, les fabricants 

de transistors ont augmentés le nombre de cœurs de calcul par processeur, menant 

aux architectures muli-cœurs. L’utilisation de plusieurs cœurs de calcul permet ainsi d’augmenter 

la puissance de calcul globale disponible sans avoir à monter en fréquence. Le même 

problème de dissipation de chaleur se posera néanmoins quand la densité d’intégration de 

ces nœuds de calcul sera telle que le silicium n’arrivera plus à dissiper la chaleur générée, 

même à des fréquences moindres. Pour continuer à améliorer les performances des applications, 

les coprocesseurs, ou accélérateurs, sont apparus et ont préparés l’arrivée du calcul 

hétérogène, où plusieurs unités de calcul avec des capacités différentes collaborent. Il existe 

de nombreux types de coprocesseurs, partant des processeurs spécialisés, très efficaces sur 

un petit nombre d’applications, aux plus généralistes. Ainsi, deux chemins vers le calcul 

haute performance sont maintenant possibles : l’utilisation de multi-cœurs homogènes, et 

l’utilisation d’une combinaison d’accélérateurs spécialisés. La frontière entre ces deux catégories 

est floue, comme le prouve le larabee [SCS + 08] d’Intel, où plusieurs unités de 

calcul homogènes sont chacune dotée d’une unité de calcul vectoriel de 512 bits. 

Il y a 30 ans de cela, de nombreuses machines parallèles ont été construites. La Con- 

1

2 RÉSUMÉ EN FRANÇAIS 

nection Machine et le MasPar MP2 en sont deux exemples notables. À cette époque, les 

supercalculateurs comme le Cray-1 coûtaient 58 M$ et avaient une performance moyenne 

de 80 Mflops. Sortie en 2006, la console de jeu PlayStation 3 (ps3) basée sur une architecture 

Cell be coûtaient alors 500$ et pouvait atteindre 230 GFLoating point Operations 

per Second (flops) grâce à la combinaison d’un processeur généraliste et de plusieurs 

processeurs spécialisés. Moins de 100 exemplaires du Cray-1 ont été construit alors que 

plus de 50 millions de ps3 ont été fabriquées. Ce changement de marché a inévitablement 

créé un besoin en outils de développements. En 2006, le parallélisme n’est plus l’affaire de 

quelques spécialistes comme ce fut le cas à l’époque des Cray. 

Il y a plus à apprendre de l’expérience de la ps3 : bien que lancée par sony un an avant 

la xbox360 de Microsoft, cette dernière a eu un plus grand succès, en partie à cause 

d’un catalogue de jeux plus riche. Il est apparu que de nombreux studios de développement 

de jeux vidéos ont trouvé la ps3 et son architecture hétérogène trop difficile à programmer. 

En effet, l’architecture Cell be utilise des espaces mémoires séparés, demande un contrôle 

manuel des transferts mémoires, la gestion de registres vectoriels de 128 bits et demande 

une gestion manuelle du cache. Celle complexité était peut-être de trop pour le développeur 

moyen. 

Il existe trois moyens pour faire face à une telle complexité : engager un ingénieur expert 

du domaine, qui maitrise complètement l’architecture cible ; développer une bibliothèque 

spécialisée qui masque le comportement de la machine derrière une Application Programming 

Interface (api) simplifiée ; ou construire un compilateur qui permette de traduire un 

langage de haut niveau vers du langage machine. 

La première option est la plus flexible, mais aussi la plus coûteuse. La seconde est efficace 

mais manque de flexibilité, d’autant plus que tirer le maximum des performances d’un 

matériel complexe peut rarement se faire à travers une api simple. La dernière approche 

combine les avantages des deux précédentes, à condition qu’un compilateur puisse effectivement 

être construit à un coût raisonnable. nVidia a par exemple adopté cette approche 

avec succès pour leurs General Purpose gpus (gpgpus). La plupart des programmeurs 

sur gpgpu de nVidia utilisent une extension du langage C/C++ — le langage Compute 

Unified Device Architecture (cuda) — et le compilateur nvcc fournit par nVidia. 

En effet, le calcul sur machines hétérogènes impose de toutes nouvelles contraintes sur 

le flot de compilation. D’après le dragon book [ALSU06], 

Definition. Un compilateur est un programme qui sait lire un programme dans un langage 

— le langage source — et le traduit en un programme équivalent dans un autre langage — 

le langage cible. 

Il n’est pas mentionné qu’un compilateur puisse avoir besoin de transformer son programme 

d’entrée en plusieurs programmes de sorties — un pour chaque accélérateur présent 

sur la machine hétérogène ciblée — chacun étant écrit dans un langage différent. Les architectures 

de machines existantes ne demandaient pas un tel traitement. Quand un fichier 

source se dérive en plusieurs fichiers de sortie, le compilateur a besoin de modéliser le système 

dans son ensemble afin de prendre les bonnes décisions, par exemple vis à vis de ses 

performances globales.

RÉSUMÉ EN FRANÇAIS 3 

Le cas le plus complexe est certainement celui du code existant écrit sans aucune primitive 

de parallélisme explicite. Les compilateurs parallélisant ont échoué dans les années 

80 à cause de la difficulté d’extraire du parallélisme à partir d’un code séquentiel. La parallélisation 

automatique est impossible dans le cas général car les algorithme parallèles 

peuvent être complètement différents des algorithmes séquentiels. Même pour les cas où 

l’algorithme parallèle est identique à l’algorithme séquentiel, cas où le parallélisme pourrait 

être détecté automatiquement par un outil, la tâche peut être rendue difficile à cause 

de transformations de code destinée à améliorer les performances pour le cas séquentiel. 

Alors que le parallélisme devient de plus en plus répandu, la façon dont les algorithmes 

sont conçus évolue elle aussi et commence à prendre en compte la dimension parallèle. 

Pour cette raison, il parait raisonnable de se concentrer sur la compilation de programmes 

explicitement parallèles, et non pas sur l’extraction de parallélisme. 

Les machines hétérogènes accentuent l’analogie donnée dans [ABC + 06], où le logiciel se 

pose comme un pont entre le matériel et les applications. Pour construire de tels ponts, une 

approche est la compilation, et c’est le compilateur qui fera se rencontrer applications et 

matériels. De nombreux composants nécessaires à la construction de tels ponts dans un environnement 

hétérogène existent déjà dans une littérature riche de plus de trente ans, mais 

pas tous. Dans [Pat10], David Patterson expose de nombreux aspects de la recherche en 

parallélisme depuis les années 60. Comme l’on peut s’y attendre, sa conclusion est qu’il n’y 

a pas de repas gratuit dans le domaine. Bien que des cas isolés aient rencontrés un certain 

succès, aucune solution globale n’a émergée concernant le problème du parallélisme. Il y a 

eu des solutions particulières pour des cas particuliers. On peut donc penser qu’écrire un 

compilateur capable de traiter le problème du calcul hétérogène est une tâche impossible. 

Il y a pourtant des détails encourageants. Ces solutions particulières peuvent être vues 

comme les fondements sur lesquels bâtir des solutions plus complexes. Si ses solutions 

sont conçues pour être réutilisables, alors la création de nouveaux compilateurs pour de 

nouvelles cibles peut en être simplifiée. L’idée de construire des compilateurs en composant 

les blocs de base est au centre de cette thèse. De nombreuses questions en découlent : 

Quels sont les blocs de base pertinents dans le contexte du calcul hétérogène ? Comment 

composer ces blocs de base en fonction de la cible ? Est-il possible de représenter l’ensemble 

des cibles matérielles dans une unique représentation interne ? Finalement, existe-t-il une 

méthodologie pour construire des compilateurs pour une cible hétérogène ? 

Dans une étude des techniques de conceptions croisées entre le matériel et le logiciel 

publiée en 1994 [Wol94], Wayne H. Wolfe déclarait que 

Pour être capable de continuer à exploiter l’augmentation des performances 

des CPUs rendue possible par la loi de Moore (. . .), nous devons développer 

de nouvelles méthodes de conception et de nouveaux algorithmes qui permettent 

aux concepteurs de prédire les coûts d’implémentation, de raffiner de 

manière incrémentale les modèles de machine à travers plusieurs niveaux 

d’abstraction, et de créer une première implémentation fonctionnelle. 

Dix-sept ans plus tard, le défi posé par Wolfe n’est toujours pas relevé. Ni les codes 

sources ni les développeurs n’ont évolués aussi vite que le matériel. Par conséquent, les


applications n’ont pas bénéficié des nouvelles architectures matérielles, sauf qu’en d’intenses 

efforts ont été fournis. La quantité de code existant est telle qu’il ne parait pas viable 

économiquement d’utiliser autre chose qu’un compilateur pour combler le fossé qui se 

creuse entre le matériel et le logiciel. Les conseils que Wolfe donne aux concepteurs de 

matériel tient tout autant pour les concepteurs de compilateurs, ceux-là même qui doivent 

concevoir des outils capables de traduire la base de code existante en des codes capables de 

tirer parti des capacités des coprocesseurs parallèles. : raffiner de manière incrémentale la 

conception des compilateurs en utilisant plusieurs niveaux d’abstraction et en créant 

une première implémentation fonctionnelle. Cette tâche est rendue d’autant plus 

difficile que les outils utilisés dans un flot de compilation classique sont souvent spécifiques à 

un langage de programmation, et ne sont pas toujours adaptés à la compilation hétérogène. 

Cette thèse suit l’approche en trois étapes proposée par Wolfe pour assembler des 

compilateurs ciblant autant de machines hétérogènes, de la classique machine multi-cœur 

au processeur basé sur des Field Programmable Gate Array (fpga)s et spécialisé dans le 

traitement d’image en passant par les Graphical Processing Unit (gpu) ou encore les unités 

d’instructions vectorielles intégrées aux General Purpose Processors (gpps). L’objectif n’est 

pas de proposer les meilleurs des compilateurs pour chaque cible, mais plutôt de construire 

des compilateurs raisonnablement efficaces en réutilisant le plus possible de blocs de base. 

Pour atteindre cet objectif, nous commençons par une étude des machines hétérogènes 

et des modèles de programmation existants au chapitre 2. Trois familles de contraintes 

matérielles ressortent de cette étude, correspondant à autant de sources d’hétérogénéité : 

l’Instruction Set Architecture (isa), l’organisation mémoire et la source d’accélération — 

le parallélisme. Le chapitre 3, montre que les infrastructures de compilation existantes 

manquent de la flexibilité requise pour composer des transformations de code à grain 

fin nécessaires pour lever les contraintes matérielles. Nous proposons un modèle afin de 

résoudre ce problème. Le chapitre 4, le chapitre 5 et le chapitre 6 examinent les trois 

familles de contraintes identifiées et détaillent plusieurs transformations de code originales 

et indépendantes de la cible. Elles forment les briques de base de notre approche. 

Cette thèse n’est pas uniquement un exercice théorique. En se reposant sur les idées 

développées dans ce manuscrit, plusieurs compilateurs ont été réalisés. Leur conception et 

leur implémentation ainsi que des rapports de performance sont présentés au chapitrer 7. 

Le chapitre 8 résume les contributions de cette thèse et en tire plusieurs conclusions. 

Toutes les transformations et les compilateurs décrits dans ce document ont été implémentés 

dans l’infrastructure de compilation Paralléliseur Interprocedural de Programmes 

Scientifiques (pips) développée par le Centre de Recherche en Informatique (cri) des mines 

ParisTech avec les contributions de la jeune pousse hpc project, de Télécom Sud-Paris et 

de Télécom Bretagne. Une vue plus détaillée du projet est donnée dans l’appendice A. 

Une grande partie des idées développées dans cette thèse sont déjà intégrées dans l’outil 

Par4All développé par hpc project.


2. Calcul sur machines hétérogènes 

Le 15 février 2007, nVidia a proposé, à travers le langage cuda et le Software Development 

Kit (sdk) associé, une nouvelle manière de développer des applications généralistes 

sur gpus, un type de matériel généralement limité au calcul graphique. Depuis, les gpgpus 

sont utilisés pour simuler des phénomènes physiques, pour faire du traitement vidéo ou encore 

du chiffrement, et de nombreuses machines hétérogènes offrent désormais des solutions 

efficaces pour le calcul scientifique : fpgas de Xilinx, Software On Chip (soc) ou MultiProcessor 

System-on-Chip (mp-soc) de Texas Instruments ou micro contrôleurs de 

Atmel. 

En novembre 2011, le supercalculateur Tianhe-1A, une grappe de machine contenant 

plus de 7k cartes Tesla, est arrivé en haut du top500. Depuis, les le calcul hétérogène se pose 

en alternative viable au calcul sur machines multi-cœurs. Cependant, le calcul hétérogène 

est très différent du calcul homogène, tout particulièrement en ce qui concerne les trois 

aspects critiques suivant : mémoire, parallélisme et jeux d’instructions. Cela conduit à un 

entrelacement de concepts spécifiques à chaque cible, mais indépendant de l’organisation 

du code à haut niveau. Cette complexité a été illustrée de façon intéressante lors de la 

conférence Supercomputing en 2010 : pendant une présentation, un groupe d’experts a 

discuté des trois P du calcul hétérogène. 

Le premier P est pour Performance : contrairement aux applications génériques qui 

doivent fournir une performance moyenne pour tout type d’applications, le but d’un accélérateur 

matériel est de fournir une performance de pointe à des applications spécifiques. 

Cet objectif est atteint par l’usage de parallélisme, de type Single Instruction stream, Multiple 

Data stream (simd), Multiple Instruction stream, Multiple Data stream (mimd) ou 

pipeline, ou à l’aide de routine optimisée cablées en dur. 

Le second P est pour Power (Énergie) : la consommation énergétique est une contrainte 

critique aussi bien pour les systèmes embarqués, pour maximiser l’utilisation entre deux 

chargements, et pour les supercalculateurs, pour minimiser les coûts en électricité. Les 

accélérateurs matériels sont de bons candidats pour améliorer le rapport flops par Watt, 

e.g. parce que moins d’énergie est dépensée dans des opération non liées au calcul. 

Le troisième P est pour la Programmabilité, la principale faiblesse des accélérateurs 

matériel. En effet, développer sur un accélérateur matériel implique un changement de 

modèle depuis la programmation séquentielle vers la programmation de circuit via vhsic 

Hardware Description Language (vhdl) [TM08] pour des cartes fpga, programmation 3D 

via Open Graphics Library (opengl) ou directX pour les gpus, our programmation parallèle 

e.g. via cuda ou Open Computing Language (opencl). De plus, le modèle d’exécution 

introduit la complexité supplémentaire de la gestion de la mémoire partagée. 

Le chapitre 2 présente en détail le modèle de calcul hétérogène dans la section 2.1 et 

ces conséquences sur le modèle de programmation en section 2.2. Le concept de contrainte 

matériel est introduit en section 2.3 comment un moyen de modéliser les interactions entre 

le matériel et un hypothétique compilateur, et la section2.4 discute la pertinence d’utiliser 

le langage C comme langage d’entrée pour programmer ces machines. Nous donnons en


section 2.5 une analyse d’un modèle de programmation standardisé, opencl. D’autres 

modèles sont présentés en section 2.6. 

2.1 Modèles de calcul hétérogène 

Il existe de nombreux types de machines hétérogènes, suivant le types d’accélérateurs 

mis en œuvre : gpgpus, cartes fpga, Application-Specific Integrated Circuits (asics) etc. 

Généralement, l’ordonnancement entre ces différents éléments est géré par un processeur 

maître, typiquement un gpp en fonction de leurs capacités. Cela implique le passage à 

un modèle de calcul distribué. Chaque accélérateur tire son accélération de spécificités 

architecturales qui impliquent un nouveau modèle architectural. Ils peuvent également 

avoir une mémoire séparée des autres avec sa propre hiérarchie, ce qui ajoute la difficulté 

d’un nouveau modèle mémoire. 

2.2 Influence sur le modèle de programmation 

La complexité du modèle mémoire a un impact important sur le modèle de programmation 

: il faut s’affranchir de la mémoire distribuée ou la gérer à travers des appels à distance. 

Les contraintes de taille mémoire peuvent également être bloquantes, notamment dans le 

domaine des processeurs embarqués. De même, les spécificités du modèle architectural font 

qu’un même code doit être adapté à plusieurs isas, impliquant souvent d’écrire plusieurs 

version d’un même code, une par cible. Enfin, le modèle d’exécution, étroitement lié à 

différentes formes de parallélisme, rappelle les nombreuses difficultés liées à l’expression du 

parallélisme au niveau du modèle de programmation. 

2.3 Contraintes matérielles 

Nous proposons de modéliser les difficultés liées au calcul hétérogène sous forme de 

contraintes matérielles. Une contrainte peut être liée à l’isa, la mémoire ou à l’accélération. 

Elle est soit obligatoire, auquel cas elle doit être respectée pour utiliser la cible, soit 

facultative, auquel cas on peut s’attendre à un meilleur comportement de l’accélérateur si 

elle est respectée. Un accélérateur est alors décrit par un diagramme de contraintes qui 

aide le développeur à appréhender la cible. 

2.4 Notes sur le langage C 

Historiquement, les compilateurs ciblant des architectures ne respectant pas le modèle 

von Neumann se sont souvent basés sur la langage C, ce qui nous amènera à considérer son 

emploi dans le cadre de machines hétérogènes. Dans ce document, nous traiterons des codes 

de haut niveau écrits en C99 et utilisant des tableaux multidimensionnels de taille variable,


ce qui dégrade peu les performances par rapport à un code écrit en C89 en utilisant des 

pointeurs mais augmente sensiblement la qualité du résultat des analyses de code. 

2.5 Analyse du modèle de programmation de OpenCL 

opencl est un standard offrant un modèle et une api C pour programmer dans un 

environnement hétérogène. On y retrouve, exhibés au niveau développeur, les différentes 

contraintes matérielles que nous avons mentionnées. Le parallélisme y existe sous plusieurs 

formes : instructions vectorielles, entre noyaux et entre contextes, certaines combinaisons de 

types et de qualificateurs sont interdites pour toucher un plus grand nombre d’architectures 

et le développeur gère manuellement les transferts mémoires, à plusieurs niveaux, à l’aide de 

Direct Memory Accesss (dmas). Le code spécifique à un accélérateur est dérivé à l’exécution 

d’une description générique, chaque vendeur de matériel doit donc fournir le sien, ainsi 

qu’une implémentation de l’api utilisateur, pour être compatible avec le standard. 

2.6 Autres modèles de programmation 

Il n’existe pas de standard autre qu’opencl pour programmer sur machines hétérogènes. 

Cependant, de nombreuses approches ont été proposées pour des cibles particulières : compilation 

à partir d’un sous-ensemble du langage C, par exemple pour la génération de 

vhdl ; extension d’un langage existant avec des concepts propres à la cible, e.g. cuda ; 

utilisation d’un code séquentiel annoté avec des directives, e.g. Hybrid Multicore Parallel 

Programming (hmpp). 

3. Conception de compilateurs pour machines 

hétérogènes 

Jusqu’à la fin du 20ème siècle, un type d’architecture matériel dominait le marché des 

machines génériques : le modèle von Neumann. Par conséquent, les principaux compilateurs 

ont été construits pour cibler efficacement les machines implémentant ce modèle : 

un front-end analyse le code en entrée et le traduit en une représentation intermédiaire, 

un middle-end applique différents optimisations, au niveau bloc de code, boucle, fonction 

ou programme, et un back-end génère le code assembleur spécifique à la cible. Ces trois 

composants ont été largement étudiés et sont bien connus, comme l’illustre la mise à jour 

régulière du Dragon Book [ALSU06]. 

La complexification de l’architecture même des compilateurs, cristallisée par la difficulté 

d’adaptation aux cibles hétérogènes, favorise une approche plus modulaire. En effet, il 

parait raisonnable de réutiliser les compilateurs, applications et bibliothèques existant et 

capable d’effectuer efficacement une tâche précise. Les combiner au lieu de réinventer la


roue devrait être possible. PLuTo [BBK + 08] est un bon exemple d’une telle approche : 

l’outil combine un front-end C, un optimiseur polyédrique, un générateur de code cuda 

et le compilateur cuda de nVidia pour générer du code tournant efficacement sur gpu. 

Les schémas de construction d’application se complexifient également quand il s’agit 

d’assembler des fichiers objet générés depuis différents langages par différentes chaînes 

de compilation. Par exemple, l’outil sloccount dénombre 210 Source Lines Of Code 

(sloc) dans common.mk, un Makefile générique fournit dans le sdk de cuda. Cela rend 

la génération de code plus ardue : il faut non seulement générer du code pour différents 

cibles, mais aussi trouver un moyen de les lier au sein d’une même application. Cette tâche, 

qui peut s’avérer non triviale dans un environnement homogène, peut devenir très complexe 

dans un environnement hétérogène. 

Le chapitre 3 étudie l’impacte du modèle de machine hétérogène, présenté dans le 

chapitre 2, sur la construction de compilateurs. Il propose la combinaison d’une infrastructure 

de compilation générique et d’un gestionnaire de passe programmable pour aborder 

le problème évoqué. Cette approche se concentre sur la modularité, la ré-utilisabilité, la 

capacité à se re-cibler et la flexibilité du schéma dans son ensemble. 

Pour commencer, la section 3.1 étudie l’adéquation entre les infrastructures de compilations 

utilisées en production et la nouvelle cible que représente les machines hétérogènes 

et propose un modèle pour supporter la gestion de passe programmable, un aspect critique 

pour la modularité des compilateurs. Ensuite, la section 3.2 argumente en faveur de 

l’usage de transformations source-à-source, utilisant le fichier de code C comme un moyen 

de communication entre outils. Finalement, la section 3.3 introduit une api de haut niveau 

pour construire des gestionnaires de passe. Elle expose une abstraction suffisante du comportement 

interne du compilateur au développeur qui peut ainsi raisonner efficacement au 

niveau passe. Le schéma complet est illustré par l’interface Pythonic PIPS (pyps) développée 

au dessus de l’infrastructure de compilation pips. Les travaux connexes sont étudiés 

dans la section 3.4. 

3.1 Extension des infrastructures de compilation 

Les infrastructures de compilation classique reposent sur une architecture en trois 

niveaux : un ou plusieurs front-end, un middle-end commun et un ou plusieurs back-end. 

La diversité des accélérateurs présents sur une machine hétérogène rend l’utilisation d’un 

unique middle-end difficile : il en faut autant que de cibles, ce qui limite également la possibilité 

d’avoir plusieurs back-end. Le problème qui se pose est celui de la réutilisation de 

code entre middle-end, et la composition des transformations en des schémas complexes. 

Nous proposons un modèle de transformation de code assurant la validité de leur composition 

et de leur application. Ce modèle est utilisé par le gestionnaire de passe, l’entité 

responsable de l’ordonnancement des transformations de code au sein du middle-end. Une 

hiérarchie de classes est dérivée de ce modèle pour aboutir à une api utilisable par le 

gestionnaire de passes.


3.2 À propos des compilateurs source-à-source 

Un compilateur source-à-source est un compilateur qui prend en entrée un code écrit 

dans un langage de haut niveau et produit un code écrit dans un langage de haut niveau. 

Cette approche est souvent utilisée par les compilateurs parallélisant qui s’épargnent ainsi le 

besoin de gérer la génération de code binaire. Cette approche est particulièrement valable 

dans le cadre du calcul hétérogène, puisqu’il existe souvent des compilateurs source-àbinaire 

spécifiques à une cible, mais générant du code de qualité moyenne. Un compilateur 

source-à-source peut s’interfacer avec de tels outils — ou d’autre compilateurs source-àsource 

— pour générer du code de meilleur qualité. Formulé autrement, il parait bénéfique 

d’utiliser des outils complémentaires et variés pour s’attaquer à des cibles variées. 

3.3 pyps, une api de haut niveau pour la gestion de passe 

Nous avons implémenté l’api mentionnée plus haut dans le langage Python en se basant 

sur l’infrastructure de compilation source-à-source pips. L’utilisation d’un langage de 

script permet de raccourcir les cycles de développement et de prototyper rapidement des 

enchaînements complexe de transformations de code, en travaillant uniquement au niveau 

du gestionnaire de passe. Tous les compilateurs assemblés durant cette thèse se basent sur 

cette implémentation, nommée pyps. 

3.4 Travaux connexes 

L’assemblage complexe de transformations de code au sein d’un compilateur a donnée 

lieu à des travaux visant à démontrer les limitations de l’approche traditionnelle dans le 

cadre de la compilation itérative, à la production de gestionnaires de passes modulaires voire 

programmables. D’autres travaux se concentrent sur la validité de la composition de passes 

et sur la sémantique que l’on peut y associer. Enfin, certains traitent les problèmes liés à 

l’extension de la représentation interne pour représenter de nouvelles cibles en capitalisant 

sur les passes existantes. 

4. Représentation d’un jeu d’instruction dans le langage 

C 

Lors d’une présentation donnée au Fusion Developers Summit 2011 à Bellevue, Washington, 

Phil Rogers annonçait que 

Le Fusion System Architecture (fsa) est indépendant de l’isa, que ce soit pour 

les gpus ou les Central Processing Units (cpus). C’est un point très important


car nous invitons des partenaires à se joindre à nous dans tous les secteurs ; 

d’autres entreprise de matériel à implémenter fsa et à se joindre à nous autour 

de cette plateforme. . . 

Sous le capot, le Fusion System Architecture (fsa) repose sur une isa virtuelle. Au 

chapitre 3, nous avons énoncé qu’il est important, si l’on veut garder un bon niveau d’abstraction, 

de garder la représentation interne indépendante du matériel. Mais est-il possible 

de représenter tous les raffinements de l’isa cible dans une représentation interne proche 

du langage C [ISO99] ? D’après Brian Kernighan [Ker03], 

Le C est peut-être le meilleur équilibre jamais trouvé par un langage de programmation 

entre expressivité et efficacité. (. . .) Il était si proche de la machine 

que l’on pouvait voir ce qu’allait être le code machine (et il n’était pas difficile 

d’écrire un bon compilateur), mais il prenait bien garde de rester au dessus du 

niveau instruction de façon à pouvoir cibler toutes las machines sans avoir à 

réfléchir à des astuces particulières pour une machine particulière. 

Cela nous rappelle que le langage C a été conçu pour être proche du matériel. Ainsi, 

même sir la représentation interne choisie reste proche du C, elle peut-être d’un niveau 

suffisamment bas pour exprimer certaines des caractéristiques propres à l’isa ciblée. Cet 

aspect est examiné à la section 4.1. Nous détaillons ensuite toues les aspects d’une isa et 

montrons que, sous contrainte de respecter certaines conventions et après transformations, 

il est possible d’adapter un code C aux contraintes d’une isa spécifique. La section 4.2 

examine les types de base ; la section 4.3 liste les différents types de registres spécifiques ; 

la section 4.4 détaille les liens entre fonctions intrinsèques et instructions machines ; et la 

section 4.5 parcoure les différences générées par la hiérarchie mémoire. Les problèmes liés 

aux frontières entre appels de fonctions sont examinés en section 4.6 et les appels à des 

bibliothèques externes sont examinés en section 4.7. 

4.1 C comme dénominateur commun 

Une étude des langages utilisés par cinq compilateurs pour accélérateurs, Handel-C, 

Mitrion-C, c2h, cuda and opencl montrent que des dialectes du langage C sont souvent 

utilisés comme langage d’entrée par des compilateurs ciblant du matériel spécialisé. La traduction 

en C sans intrinsèque du fichier d’en-tête xmmintrin.h fournissant les intrinsèques 

pour utiliser le jeu d’instruction Streaming simd Extension (sse) montre qu’il est parfois 

possible de représenter directement en C certaines fonctionnalité du matériel. 

4.2 Types de données natifs 

Certains type de donnés natifs en C peuvent ne pas être supportés par le compilateur 

cible. Dans ce cas, il est parfois possible de se ramener à un type supporté en utilisant 

une transformation de code : utilisation de la virgule fixe en l’absence de support pour les


nombres flottants, éclatement des enregistrements en autant de variables que de champs 

ou remplacement des tableaux de taille fixe par autant de scalaires. 

4.3 Registres 

Le langage C permet de recommander le placement d’une variable dans un registre via 

le mot clé register. Si la machine cible possède des registres dédiés, on peut détourner 

l’usage de ce mot clé et utiliser une convention de nommage particulière pour affecter 

certaines variables à certains registres particuliers tout en restant au niveau C. 

4.4 Instructions 

Il est courant qu’une architecture particulière possède des instructions particulières 

qui n’ont pas d’équivalent direct en C : instructions vectorielles, opérations atomiques, 

Fused Multiply-Add (fma) ou dma sont des exemples classiques. L’approche traditionnelle 

dans ce cas est de représenter ces instructions par des fonctions intrinsèques, des fonctions 

possédant une signature C correcte mais traitées de manière particulière par le compilateur 

pour générer l’instruction assembleur idoine. 

4.5 Architecture mémoire 

Le calcul hétérogène fait intervenir et communiquer entre eux plusieurs espaces mémoires. 

Le langage C ne permet pas de faire la distinction entre une variable déclarée 

dans un espace mémoire et une variable déclarée dans un autre espace mémoire. Là encore, 

l’utilisation de conventions de nommage appropriées permet de ne pas modifier la 

représentation interne. 

4.6 Appels de fonction 

Certaines architecture ne propose pas de mécanisme bas niveau permettant d’effectuer 

un appel de fonction. Dans ce cas, on peut effectuer une expansion de procédure. Par ailleurs 

les appels de fonctions peuvent être utilisé pour simuler un appel à un accélérateur. Dans 

ce cas, on a besoin de sélectionner une portion de code et de l’extraire dans une nouvelle 

fonction qui représentera l’appel. Cette transformation se base sur l’analyse des effets 

mémoires pour limiter le nombre de paramètres de cette nouvelle fonction et restreindre 

l’utilisation de pointeurs pour simuler un passage par référence aux cas utiles.


4.7 Appels de bibliothèques externes 

Dans le contexte des analyses inter procédurales, on peut avoir besoin de connaître le 

comportement de certaines fonctions pour lesquelles on ne dispose pas d’une implémentation 

complète. Le fournisseur de fonction permet de séparer ce problème en plusieurs 

parties. C’est un composant capable de fournir plusieurs versions d’une même fonction : 

une qui reproduit les effets mémoires de la fonction, destinée au compilateur ; et une par accélérateur 

qui correspond à son implémentation réelle. On peut ainsi s’abstraire des appels 

de bibliothèque spécifiques à une cible. 

5. Exploitation du parallélisme avec des instructions multimedia 

Une fonctionnalité récurrente des accélérateurs matériels est l’utilisation qu’ils font du 

parallélisme. Ce parallélisme peut se trouver à plusieurs niveaux, et prendre plusieurs 

formes, généralement un mélange de parallélisme de type simd et de parallélisme de 

type mimd, comme c’est le cas pour les gpgpu. Ces deux types de parallélisme ont 

été longuement étudiés, en se concentrant sur la parallélisation de boucle : recherche 

d’hyperplans [Lam74], gestion des dépendances de contrôle [AKPW83], vectorisation 

de boucles [AK87], extraction de parallélisme [WL91b], partitionnement en supernœuds 

[IT88], optimisation des communications [DUSsH93], interactions avec la mémoire 

cache [KK92], pavage [DSV96, AR97, YRR + 10]. David F. Bacon et al. ont passé en revue 

[BGS94] les transformations de code pour le High Performance Computing (hpc), 

transformations qui concernent majoritairement les boucles. Vivek Sarkar a étudié l 

sélection automatique de transformations en se basant sur un modèle de coût [Sar97]. 

Toutes ces techniques ont été appliquées avec succès dans des compilateurs de recherche tels 

SUIF [WFW + 94], Polaris [PEH + 93], pips [IJT91, AAC + 11], Rose [Qui00] ou Pocc [PBB10], 

et dans des compilateurs utilisés en production comme IBM XL [Sar97], Low Level Virtual 

Machine (llvm) [GZA + 11], gnu C Compiler (gcc) [TCE + 10], Intel C++ Compiler 

(icc) [DKK + 99], pgi [Wol10]. 

Dans ce chapitre, nous nous concentrons sur deux aspects de la parallélisation de code : 

Instruction Level Parallelism (ilp) et parallélisation de réductions. Le premier cherche 

à tirer parti des possibilités de parallélisme au niveau instruction, telles qu’on peut les 

trouver dans des séquences de code ou à l’intérieur d’une boucle, en utilisant les Multimedia 

Instruction Sets (miss) disponibles sur la plupart des processeurs modernes. Cet aspect est 

décrit dans la section 5.1. Le second aspect est un problème critique quand un code mettant 

en jeu une réduction doit être exécuté sur du matériel en mode purement simd. Il est abordé 

en section 5.2. La section 5.3 propose un modèle simple basé sur le parallélisme que l’on 

peut trouver sur des accélérateurs à mémoire distribuée, pour décider s’il est profitable ou 

non de déporter un calcul.


5.1 Parallélisation au niveau instruction 

La génération d’instruction vectorielle peut s’effectuer de deux façon : au niveau boucle 

en utilisant des techniques de vectorisation [BGS94, Bik04], ou au niveau séquence en utilisant 

des algorithmes de détection de motif [LA00, SHC05]. Nous proposons un algorithme 

hybride capable d’exploiter le parallélisme présent dans les boucles et dans les séquences. 

L’idée est d’appliquer des transformations de boucle de haut niveau, type pavage, puis 

de dérouler les boucles internes pour faire apparaitre des séquences sur lesquelles on 

peut appliquer la détection de motif. Cet algorithme s’applique sur un jeu d’instruction 

générique [Roj04] caractérisé par la taille des registres vectoriels et permet donc de cibler 

tous les jeux d’instructions inclus dans celui-ci. 

Un problème récurent lié à la génération d’instructions vectorielles est celui des transferts 

mémoires. L’algorithme de détection de motif proposé maintient un état complet 

du contenu des registres vectoriels et utilise cette connaissance pour limiter le nombre de 

transferts depuis la mémoire principale en les remplaçant par des opération de copies ou 

de mélange entre registres vectoriels. 

5.2 Parallélisation de réductions 

Du point de vue du parallélisme, les réductions constituent un goulet d’étranglement. Il 

existe plusieurs technique bien connues pour extraire du parallélisme de boucles possédant 

une réduction [KRS90, Lei92]. Ces techniques on été étendues à la séquence d’instruction 

indépendamment de la présence de boucle englobante. L’idée est d’identifier chaque variable 

de réduction de la séquence [JD89], et de créer un tableau de la taille idoine pour reporter 

la réduction à la fin de la séquence. 

Que ce soit en sortie de boucle ou en sortie de séquence, la parallélisation de réduction 

fait intervenir un postlude qui n’est pas parallèle. Pour éviter d’avoir à effectuer 

cette réduction en séquentiel, il peut exister des mécanismes matériels dédiés. Pour abstraire 

ce concept, le traitement du postlude est délégué à une bibliothèque tierce dont 

nous fournissons une implémentation séquentielle mais qui peut être remplacée par une 

implémentation matérielle si celle-ci est possible. 

5.3 Estimation de l’intensité de calcul 

Il n’est pas toujours bénéfique de paralléliser une boucle. Ce constat est particulièrement 

vrai dans le cas des accélérateurs possédant une mémoire distribuée, en raison des 

temps de transfert mémoire. En calculant la somme des volumes [Cla96] des régions de 

tableaux convexes lues et écrites par une instruction, il est possible d’obtenir un majorant 

de l’empreinte mémoire d’une instruction. De même, pour un code à contrôle statique, il est 

possible d’estimer le nombre d’instructions exécutées par ce code. En se basant sur cette 

estimation et sur l’empreinte mémoire, on peut estimer le ratio entre calcul et communi-


cations et émettre un critère de décision local et conservatif comme quoi une boucle ne 

dont pas être parallélisée si l’ordre de grandeur des transferts mémoires n’est pas inférieur 

à l’ordre de grandeur du nombre d’instructions exécutées. 

6. Transformations pour la taille mémoire et les mémoires 

distribuées 

Wm. A. Wulf and Sally A. Mckee ont conclu leur article [WM95] « Hitting the 

Memory Wall : Implications of the Obvious » publié en 1995 par la phrase suivante : 

La solution la plus « pratique » au problème serait la découverte d’une technologie 

de mémoire dense et ne chauffant pas, dont la vitesse reste proportionnelle 

à celle des processeurs. Nous ne sommes pas au fait qu’une telle technologie 

existe (. . .). 

Quinze ans plus tard, une telle technologie n’existe toujours pas, et les aspects mémoires 

restent un problème critique pour de nombreuses application parallèles. Dans le 

contexte du calcul hétérogène, où la mémoire de l’hôte et la mémoire de l’accélérateur sont 

souvent séparées, il est important de gérer cette contrainte matérielle avec soin. Dans ce 

cadre, nous proposons trois transformations génériques : l’isolation d’instruction qui sépare 

l’espace mémoire de l’accélérateur de l’espace mémoire de l’hôte, présentée en section 6.1, 

la réduction d’empreinte mémoire qui trouve les bons paramètres de pavage d’un nid de 

boucle de façon à ce que les boucles internes logent dans la mémoire cible, présentée en 

section 6.2, et l’élimination de transferts redondants présentée en section 6.3. 

6.1 Isolation d’instructions 

L’isolation d’instruction est une transformation qui permet d’isoler une instruction 

spécifique dans un nouvel espace mémoire émulé par des variables nouvellement allouées. 

L’idée est de remplacer toutes les variables référencées par l’instruction concernée et de les 

remplacer par de nouvelles variables de même type. Une étape de génération de transfert 

génère les copies depuis les anciennes variables vers les nouvelles et réciproquement de 

façon à garantir la cohérences des valeurs lues par l’instruction. 

Cette transformation fait intervenir deux formes d’optimisation : en se basant sur les 

régions de tableau lues et écrites par l’instruction, elle est capable de placer les tableaux 

référencés par l’instruction dans des tableaux de plus petite taille, limitant ainsi ta taille 

des transferts. En se basant sur les régions de tableaux produites et consommés par l’instruction, 

elle est capable de ne générer les copies que quand celles-ci sont utiles, limitant 

ainsi le nombre de transferts.


6.2 Réduction de l’empreinte mémoire 

La réduction d’empreinte mémoire correspond à la recherche de paramètres de pavages 

garantissant que le volume mémoire nécessaire pour effectuer un calcul sur une tuile est 

inférieur à une borne donnée. Pour cela, on commence par effectuer un pavage rectangulaire 

paramétrique du nid de boucle considéré. Une fois le pavage effectué, on calcul un majorant 

de l’empreinte mémoire du calcul sur la tuile en se basant sur le volume des régions convexes 

lues et écrites. On obtient ainsi une expression fonction des paramètres de pavages que l’on 

cherche à maximiser. Les paramètres trouvés sont alors fixés pour revenir à un pavage 

statique qui nous garantit que la contrainte de capacité mémoire est respectée. 

6.3 Élimination de transferts redondants 

Cette transformation est l’extension de l’élimination de chargement et déchargement 

redondant que l’on connait pour les registres scalaires. Elle propose de considérer chaque 

tableau comme un registre, et de considérer chaque transfert mémoire généré par l’isolation 

d’instruction comme une affectation de registre. En se basant sur ce formalise, l’algorithme 

d’élimination de transfert redondant fait remonter les transferts mémoires au plus haut 

dans la représentation interne, aussi longtemps que la remontée satisfait les conditions 

de Bernstein [Ber66] et en utilisant des règles d’élimination pour fusionner certains accès 

redondants. Cette traversée de la représentation interne peut se faire de façon intra ou 

inter procédurale suivant les caractéristiques de la cible. 

7. Implémentations de compilateurs et expériences 

Cette thèse présente et décrit une méthodologie pour spécialiser des compilateurs pour 

différentes plateformes hétérogènes, en se basant sur une boîte à outil de transformations 

source-à-source bien fournie, une api pour gestionnaires de passe programmable et une 

description simple du matériel. Elle ne saurait être complète sans validation expérimentale. 

La méthodologie proposée prétend rendre plus aisé l’assemblage de compilateurs. Pour 

la valider, nous avons choisi cinq cibles différentes : trois cpu généralistes avec différentes 

unités vectorielles, un processeur [BLE + 08] sur fpga spécialisé dans le traitement d’images 

et un gpu de chez nVidia. Pour chacun d’entre eux, nous avons développé un prototype 

de compilateur en utilisant les techniques présentées dans les chapitres 4, 5 et 6. L’efficacité 

du code généré par ces prototypes de compilateur est mesurée en utilisant des bancs de 

tests ou des applications du domaine concerné.


7.1 Un compilateur OpenMP naïf 

Ce chapitre commence par un générateur de directives Open Multi Processing (openmp) 

simple présenté en section 7.1 pour montrer comment appliquer les principes discutés durant 

cette thèse sur un cas pratique et réel. Le compilateur pour gpus implémenté par 

hpc project en se basant sur nos travaux est détaillé en section 7.2. La section 7.3 présente 

terapyps, un compilateur du langage C vers le langage terasm, le langage assembleur utilisé 

pour le processeur de traitement d’image terapix. Finalement un compilateur re-ciblable 

pour jeux d’instruction multimédia est décrit en section 7.4 pour trois cibles :sse, Advanced 

Vector eXtensions (avx) et neon. 

Pour la génération de directives openmp, on commence par caractériser le matériel 

disponible comme utilisant une mémoire partagée et un parallélisme de type mimd, ce qui 

permet d’identifier les transformations de code à utiliser, à savoir principalement l’extraction 

et la détection de parallélisme. Il n’y a pas d’étape de post-traitement car les directives 

openmp font partie intégrante de la représentation interne. Le prototype de compilateur 

assemblé ainsi est validé sur le banc de test polybench. 

7.2 Un compilateur pour GPU 

La génération de code pour gpu est plus riche : il faut prendre en compte la mémoire 

partagée. On se contente de prendre en compte le parallélisme de type simd, en 

ignorant les capacités mimd du matériel. Les transformations d’isolation d’instruction et 

d’élimination de transferts redondants sont utilisées. Le schéma de compilation est également 

plus complexe puisqu’il fait intervenir une étape de séparation du code hôte et du 

code accélérateur, une étape de conversion du langage C vers le langage cuda utilisé par le 

compilateur source-à-binaire de l’accélérateur, et la génération finale du binaire pour l’accélérateur 

par ce dernier, en sus de la génération de code pour l’hôte. Le gestionnaire de 

passe est utilisé pour combiner ces différentes étapes, l’extraction de procédure permettant 

de générer autant d’unité de compilation que nécessaire. Le compilateur assemblé à partir 

de ces briques de bases est validé sur plusieurs noyaux de traitement du signal, sur lesquels 

on obtient des accélérations moyennes d’un facteur ×25 sur de gros jeux de données. 

7.3 Un compilateur pour un processeur d’images sur FPGA 

L’accélérateur terapix, dédié au traitement d’image, pose un défi supplémentaire car 

la mémoire de l’accélérateur ne permet pas d’y loger assez de données pour traiter une 

image en une passe. De plus sonisa est très spécifique, en particulier de par l’utilisation 

d’un jeu d’instruction Very Long Instruction Word (vliw). La transformation de code 

réduction de l’empreinte mémoire permet de s’affranchir des limitations liées à la taille. Le 

schéma de compilation utilisé ajoute une passe par rapport aux gpus : la traduction d’un 

flot d’instruction séquentiel en un flot d’instruction vliw. Cette étape est assurée par un


outil tierce, il s’agit donc de formater le code produit pour qu’il satisfasse ses contraintes 

d’entrée, à l’aide de transformations de bas niveau comme la linéarisation de tableau, la 

détection d’itérateurs ou la conversion de boucles pour en boucles tant que. La chaîne 

de compilation complète permet de traduire automatiquement des noyaux de traitement 

d’image écrits en C vers le langage assembleur de terapix, et d’obtenir des noyaux dont 

les performances sont proches d’une version optimisée manuellement (nombre de cycles de 

l’ordre de 125% du nombre de cycles optimum). 

7.4 Un compilateur reciblable pour jeux d’instructions vectorielles 

La génération de code pour des processeurs génériques possédant une petite unité vectorielle, 

e.g. de type sse, pose un défi différent, car les contraintes mémoires sont moins 

fortes, bien que similaires aux cas précédents. Dans ce cas, c’est l’extraction automatique 

d’un parallélisme de type simd avec des vecteurs de petite taille qui est la clef de la performance. 

L’algorithme de vectorisation hybride présenté au chapitre 5 permet de lever 

ses contraintes, et se combine avec l’élimination de transferts redondants pour limiter le 

nombre de transferts. Comme les transformations mises en œuvre sont génériques, le compilateur 

assemblé ainsi est indépendant de la cible — mis à part la taille des registres 

vectoriels — et est facilement reciblable d’un jeu d’instruction à l’autre. En pratique, ce 

compilateur génère du code plus efficace que gcc et légèrement moins efficace que icc, 

mai supporte plus d’architecture que icc, par exemple l’ARM v7 et le jeu d’instruction 

neon. 

8. Conclusion 

La recherche de performance passe désormais par les machines hétérogènes : même 

l’ordinateur portable utilisé pour écrire cette thèse peut utiliser les capacités de calcul de 

deux processeurs génériques, des deux unités d’instructions vectorielles sse associées, et 

d’un gpgpu. Le principal problème de ces unités de calcul et la difficulté de les programmer. 

Dans cette thèse, nous avons choisi le chemin de la compilation pour automatiser la 

production de code pour des accélérateurs matériels. Nous nous sommes concentrés sur la 

capacité à produire rapidement plusieurs compilateurs pour des cibles différentes. Comme 

le matériel moderne est généralement déjà programmable dans un dialecte du langage C, 

nous nous sommes fixé pour objectif de traduire automatiquement des algorithmes standard 

écrit en C vers différents noyaux de calcul écrits dans des dialectes du C, et de générer 

le code permettant de faire des appels à un accélérateur depuis le processeur hôte. 

L’avantage de cette approche est sa modularité : de nombreuses transformations peuvent 

être réutilisée d’un accélérateur à l’autre. Cela réduit les coûts de production de compilateurs, 

et l’utilisation du infrastructure de compilation source à source permet d’interagir 

avec les outils existants, en particulier avec les compilateurs générant du code binaire dédié


à partir de dialectes de code C. 

8.1 Contributions 

Méthodologie pour la construction de compilateurs source-à-source 

Nous avons proposé de modéliser les accélérateur matériel par des diagrammes de contraintes 

matérielles. Ces diagrammes identifient les contraintes optionnelles et obligatoires 

associé au matériel. L’association manuelle entre ces contraintes et des transformations de 

code guide le développeur de compilateurs dans son processus de développement. 

Conception d’une infrastructure de compilation générique 

L’hétérogénéité des accélérateurs matériel rend difficile la construction d’un unique 

compilateur capable de les cibler tous. Depuis, il existe déjà de nombreuses applications 

capable de traiter certains problèmes posés par ces machines. Au chapitre 3, nous avons 

proposé un schéma de compilation qui combine l’utilisation d’une boîte à outil de transformations 

de code source à source, une api pour la gestion de passes et un modèle de machine 

hétérogène. Cette méthodologie est validée au chapitre 7 pour 4 cibles différentes. Il est 

utilisé dans l’outil Par4All développé par hpc project. 

Transformations pour contraintes d’isa 

Les accélérateurs matériels doivent leur accélération à leur spécialisation : on peut dire 

qu’elles sont plus efficaces sur un champ d’application plus restreint. La conséquence directe 

est une spécialisation de l’isa. Cette spécialisation est visible au niveau des dialectes du 

langage C proposés pour programmer ces accélérateurs. Le chapitre 4 propose un ensemble 

de transformations source à source pour raffiner un code C de haut niveau vers un code de 

plus bas niveau. Cet ensemble comprend en particulier un algorithme original d’outlining, 

basé sur les régions de tableaux convexes. 

Algorithme slp hybride 

On trouve maintenant des jeux d’instruction multimédia sur tous les processeurs 

généralistes, et même sur des puces hybrides cpu/gpu. Nous avons développé un algorithme 

original basé sur l’état de l’art en vectorisation de boucles et de séquences. Cet 

algorithme unifie les deux approches et est paramétrés par une description au niveau C de 

l’isa. Il respecte ainsi les critères de re-ciblage évoqués au chapitre 3. L’algorithme a été 

validé sur trois familles de Multimedia Instruction Set : sse, avx et neon. Ce travail a été 

récompensé par le prix du troisième meilleur poster à PACT 2011.


Transformations pour contraintes mémoires 

Les aspects mémoire sont critiques pour de nombreux systèmes hétérogènes : quand 

un accélérateur ne partage pas d’espace mémoire avec son hôte, des rpc et des dma sont 

nécessaires. Le modèle de programmation est bien plus complexe que les modèles classiques. 

Nous avons présenté au chapitre 6 trois transformations de code qui prennent ces aspects 

en compte : l’isolation d’instructions sépare l’espace mémoire de l’accélérateur de celui 

de l’hôte ; la réduction d’empreinte mémoire trouve la matrice de pavage garantissant 

que l’accélérateur a suffisamment d’espace mémoire pour exécuter l’application par tuile 

successives ; et l’élimination de transferts redondants élimine les mouvements de données 

inutiles. 

Implémentation 

Toutes les transformations présentées dans cette thèse ont été développés dans l’infrastructure 

de compilation source-à-source pips pour le langage C et ont été assemblée en 

utilisant le gestionnaire de passes pyps. Elles ont conduit à l’implémentation de quatre 

compilateurs : un prototype de générateur de directives openmp, un compilateur reciblable 

pour jeux d’instructions vectorielles, un générateur de microcode pour le processeur 

à base de fpga dédié au traitement d’image terapix, et un générateur de code pour 

gpu développé par hpc project. Cela valide à la fois l’infrastructure de compilation dans 

son ensemble et les algorithmes proposés dans ce manuscrit. Les expériences et les flots de 

compilations spécifiques sont détaillés au chapitre 7. 

Contributions à la communauté de pips 

Il est difficile de séparer l’activité de recherche de l’activité de développement dans 

une thèse en informatique. L’intégration de nouvelles transformations dans l’infrastructure 

choisie et l’extension au langage C de passes conçues pour la langage Fortran sont des 

activités indispensables pour supporter les travaux de recherches mais elles nécessitent un 

investissement en temps important. En tant que membre de l’équipe pips, j’ai pris en 

charge la modernisation de l’infrastructure de compilation du projet et j’ai rationalisé la 

distribution sous forme de paquet. 

J’ai encadré cinq étudiants à Télécom Bretagne au cours de stages autour du projet 

pips, et j’ai contribué à la valorisation scientifique de l’outil à travers deux tutoriels dans 

des conférences internationales. 

8.2 Travaux à venir 

Le monde du hpc est en constante évolution. Un supercalculateur à base de Sparc64 est 

arrivé en tête du top500 de juin 2011, alors que les gpu de nVidia menaient la danse 6 mois 

plus tôt. Dans cet environnement mouvant, rien n’est encore figé et les vendeurs de matériel 

continuent à proposer leur standards pour avoir un modèle de programmation commun


associé à des outils d’ingénierie efficaces. Cela nécessite une coopération et de nombreuses 

interactions entre outils. Dans ce contexte, combler le fossé qui sépare la norme opencl des 

générateurs de vhdl existant est un défi intéressant et un thème de recherche encore ouvert. 

Cependant, le hpc reste un marché de niche comparé à celui des systèmes embarqués et 

des téléphones intelligents. Dans ces domaines, les contraintes matérielles sont encore plus 

importantes : consommation énergétique, poids, volume, etc. Les transformations de code 

et l’approche étudiées pendant cette thèse peuvent certainement y trouver des applications. 

Les mis devenant de plus en plus flexible, il est de plus en plus courant d’y trouver des 

instructions non-simd. Ces instructions autorisent des schémas de chargement/déchargement 

plus élaborés (e.g. accès non contigus) et permettre d’atteindre de meilleurs performances 

pour des applications contraintes par leurs accès mémoires. L’ajout incrémental de 

transformations de code capable de générer ces instructions dans notre compilateur pour 

mis est un sujet prometteur. 

Nous voyons deux possibilités d’extension de nos travaux sur les gestionnaires de passe. 

Premièrement, la combinaison d’opérateurs que nous avons décrite conduit à la construction 

d’un graphe orienté qui présente des opportunités de parallélisation à gros grain, au 

niveau du gestionnaire de passe. Cela améliorerait les temps de compilations en rendant le 

traitement parallèle. Deuxièmement, il apparait que certaines combinaisons de passes sont 

inutiles ou redondantes. L’ajout d’une sémantique précise aux transformations de code 

permettrait d’éliminer certaines séquences d’appels, par exemple dans le contexte de la 

compilation itérative.

Chapter 1 

Introduction 

Pont de l’Iroise, Brest, Finistère c○ lazzarello / flickr 

Moore’s law [Moo65] was invalidated five years ago, when transistors became so small 

that silicon could no longer dissipate the energy released by processor activity at maximum 

switching speeds. William J. Dally calls this “The End of Denial Architecture” [Dal09]. 

To overcome thermal limitations, chip makers increased the number of cores on each die, 

producing multicore processors, an entirely new direction of development. As transistors 

have continued to shrink, multiple cores have provided additional computing power by 

putting more in the same die area with no increase in clock frequency. A second power 

wall is forecast when transistors become so densely packed that silicon cannot conduct 

enough heat even at current, fixed clock rates. To continue improving application performance 

as this second wall approaches, co-processors have emerged and paved the way to 

heterogeneous computing, where several computation units of different types collaborate. 

There are many types of co-processors from the specialized, which are highly efficient at 

certain tasks, to more general ones. All have common goals of low execution time and 

power consumption. In this manner, two general paths to high performance have arisen: 

homogeneous many-core machines and heterogeneous machines featuring various accelerator 

technologies. Of course, creative processor architects have produced exceptions to 

prove this rule. Intel’s larabee [SCS + 08] processor, for example, is a multicore design 

21

22 CHAPTER 1. INTRODUCTION 

with a vector arithmetic unit on-chip. The emerging design space is complex and likely to 

change continuously. 

Beginning some thirty years ago, engineers built a large variety of parallel machines. 

The Connection Machine and the MasPar MP2 are notable examples. At that time, 

supercomputers like the Cray-1 cost $58 M and could perform an average of 80 Mflops. 

By 2006, the PlayStation 3 (ps3) video game console based on the Cell be architecture 

cost $500 and could achieve 230 GFLoating point Operations per Second (flops) thanks 

to a combination of general-purpose and specialized processors. Of the Cray, less than 

a hundred units were built, while more than 50 millions ps3 consoles were produced. 

This change of market inevitably created a need for better development tools. By 2006, 

parallelism was no longer the concern of specialists as in the Cray era. 

More can be learnt from the ps3 story: although launched by sony one year before 

Microsoft’s xbox360, the latter had greater success, in great part due to a larger game 

catalog. It turned out that many game development studios found development for the ps3 

and its heterogeneous architecture to be too difficult. Indeed, the Cell be architecture 

involved separated memory spaces, manual control of data transfers, manual handling of 

128-entry vector registers in a vector-only way, and manual cache management. This 

complexity proved too much for average developers to master. 

Actually, there is and probably will always be three ways to handle such hardware 

complexity: hire an expert engineer who has a comprehensive understanding of the machine; 

develop a specialized library that exposes the hardware capabilities in an Application 

Programming Interface (api); or build a compiler that translates high-level code into the 

target machine language. 

The first option is versatile but costly. The second is efficient but lacks flexibility. 

Drawing maximum performance from complex hardware using only a limited number of 

api calls may be impossible. The last approach combines the advantages of the first two, 

given that the needed compiler can be written at reasonable cost. nVidia, for example, has 

successfully adopted the compilation approach for their General Purpose gpus (gpgpus) 

technology. Most nVidia gpgpu programmers use an extension of C/C++—the Compute 

Unified Device Architecture (cuda) language—and the nVidia compiler nvcc. 

Indeed, heterogeneous computing places entirely new demands on the compilation process. 

Quoting the dragon book [ALSU06], 

Definition 1.1. A compiler is a program that can read a program in one language—the 

source language— and translate it into an equivalent program in another language—the 

target language. 

There is obviously no mention that a compiler may need to transform its one or more 

inputs into several outputs—one for each accelerator processor in a heterogeneous system, 

each in its own language. Previous architectures simply did not require this complex 

behavior. In the common case where one input language results in several target programs, 

the job is complicated, requiring the compiler to internally model complete system 

performance in order to make good decisions about performance.

The most difficult case occurs when the source code is in a legacy language with no explicitly 

parallel constructs. Compilers failed in the 80’s because of the difficulty to extract 

parallelism from sequential codes. Automatic parallelization is impossible in the general 

case because parallel algorithms are completely different from sequential algorithms and a 

compiler cannot create new algorithms. Even for cases were parallelism detection is within 

the scope of an automated tool, it is made difficult when code is obfuscated by unconventional 

coding methods that were originally intended to optimize performance of earlier 

machines and compilers. As parallelism is becoming ubiquitous, the way algorithms are designed 

will also evolve and will take into account parallel aspects. For that reason, it seems 

reasonable to focus on the compilation of explicitly parallel programs for heterogeneous 

targets, not on the automatic extraction of parallelism. 

Heterogeneous environments sharpen the analogy given in [ABC + 06], which depicts 

software as a bridge 1 between hardware and applications. If software is a bridge, then 

the compiler is the bridge-builder, responsible for making both ends meet. Many blocks 

for building such bridges in heterogeneous environments can be found in thirty-year-old 

literature, but not all. In [Pat10], David Patterson summarizes many aspects of research 

in parallel computing since the 1960’s. As one might expect, its conclusion is that there 

is no free lunch. Although there have been success stories, there exist no global solutions 

to the problem of parallel computing, but rather only local solutions to local problems. It 

follows that there is no good reason to believe building a compiler for any heterogeneous 

device will a straightforward task. 

There are glimpses of hope, however. Local solutions for particular devices can be 

viewed as building blocks. If these are configured for easy reuse, then creating compilers 

for new target devices will be simplified. The idea of building compilers by composing basic 

blocks in order to suit a particular heterogeneous system is the core of this dissertation. 

Many interesting questions follow. What are the building blocks relevant to heterogeneous 

computing? How are building blocks to be chained depending on the target? Is it possible 

to embody all possible hardware specifications in a single Internal Representation (ir)? 

Ultimately, is there a standard methodology to build compilers for heterogeneous devices? 

In a survey of hardware-software co-design techniques published in 1994 [Wol94], Wayne 

H. Wolfe stated that 

To be able to continue to make use of the ever-higher performance CPUs 

made possible by Moore’s Law (. . . ), we must develop new design methodologies 

and algorithms which allow designers to predict implementation costs, 

incrementally refine a design over multiple levels of abstraction, and 

create a working first implementation. 

Seventeen years later, Wolfe’s challenge is largely unmet. Neither programmers’ 

expertise nor the source code they produce have evolved as fast as hardware architectures. 

As a result, applications have not greatly benefited from recent alternative hardware designs 

except where intense efforts could be dedicated. The sheer amount of legacy code does not 

1. Inspired by the cover of Communications of the ACM, Vol. 52 No. 10, we illustrate each chapter of 

this thesis with photos of classical bridges in Brittany. 

23

24 CHAPTER 1. INTRODUCTION 

admit economically feasible solutions other than compilers, which appear to be the only 

way to bridge the hardware-software gap. The advice Wolfe gives to hardware designers 

still holds for compiler designers, who face the tremendous task of porting legacy codes that 

implement sequential algorithms written in a sequential language to ever changing parallel 

co-processors: incrementally refine a design, use multiple levels of abstraction and 

create a working first implementation. These tasks are made more difficult by tools 

based on traditional compilation flows targeting a single language per tool, which are 

ill-suited to heterogeneous platforms. 

This dissertation adopts Wolfe’s three steps to assemble compilers for various heterogeneous 

platforms, ranging from classical multicore to an Field Programmable Gate Array 

(fpga)-based image processor via a Graphical Processing Unit (gpu) and small vector 

units. Our objective is not to build the best possible compiler for each target, but to 

realize a reasonable compilation scheme for each while reusing as many building blocks as 

possible. 

To achieve this goal, we begin with a study of heterogeneous devices and existing programming 

paradigms in Chapter 2. We note three families of constraints resulting from 

corresponding dimensions of heterogeneity: the Instruction Set Architecture (isa), the 

memory architecture, and the source of acceleration. Chapter 3 shows that traditional 

compiler frameworks lack the flexibility needed to compose fine-grained code transformations, 

which are needed to overcome the constraints. We propose a model to solve 

this problem. Chapter 4, Chapter 5 and Chapter 6 further examine the three families of 

constraints and detail innovative target-independent transformations to form the building 

blocks of our approach. 

This dissertation is not solely a theoretical work. Several compilers have been realized 

with our ideas. Their design and implementation along with performance benchmarks are 

presented in Chapter 7. Chapter 8 summarizes the contributions of this thesis and draws 

final conclusions. 

All the transformations and compilers described in this document have been implemented 

using the Paralléliseur Interprocedural de Programmes Scientifiques (pips) compiler 

infrastructure developed by the Centre de Recherche en Informatique (cri) from 

mines ParisTech with contributions from the hpc project startup, Télécom Sud-Paris and 

Télécom Bretagne. A quick review of this project is given in Appendix A. 

Most of the ideas developed in this thesis are already integrated in the Par4All tool 

within the hpc project.

Chapter 2 

Heterogeneous Computing Paradigm 

Pont Fleuri, Quimperlé, Finistère c○ Jean Louis Lemoigne 

On February the 15 th , 2007, nVidia introduced a way to program general purpose applications 

on Graphical Processing Units (gpus), a class of hardware generally confined to 

the manipulation of computer graphics, through the Compute Unified Device Architecture 

(cuda) language and its associated Software Development Kit (sdk). Since then General 

Purpose gpus (gpgpus) have dug their way into physical simulations, video file conversions 

or cryptography, and many heterogeneous devices have appeared as efficient solutions 

to perform specific computations. Many firms now propose dedicated hardware such as: 

Field Programmable Gate Array (fpga)s from Xilinx, Software On Chip (soc) or Multi- 

Processor System-on-Chip (mp-soc) from Texas Instruments or micro-controllers from 

Atmel. 

In November 2011, the Tianhe-1A supercomputer, a cluster of powered by more than 

7K Tesla cards, ranked first in the top500. 1 Since then, heterogeneous computing has 

been assessed as one viable alternative to multicore computing for scientific computations. 

1. As of June 2011, the second, third and fifth systems of the top500 are using nVidia gpus. 

25

26 CHAPTER 2. HETEROGENEOUS COMPUTING PARADIGM 

However, heterogeneous computing is rather different from homogeneous computing, particularly 

with respect to three critical aspects: memory, parallelism and instruction sets. 

This leads to a complicated interleaving of concepts specific to each target, but nonetheless 

independent from the high level organization of the source code. An interesting illustration 

of this complexity was given during the 2010 conference of Supercomputing: During a talk, 

the panelists discussed the three P’s of heterogeneous computing. 

The first P stands for Performance: to the opposite of general purpose processors that 

must deliver average performance for any application, the goal of hardware accelerators 

is to deliver important peak performance to specific applications. They achieve this goal 

through extensive usage of parallelism, either Single Instruction stream, Multiple Data 

stream (simd), Multiple Instruction stream, Multiple Data stream (mimd) or pipelining, 

or through the use of hard-coded, optimized routines. 

The second P stands for Power: power consumption plays a key role in both embedded 

systems, to maximize usage time between two recharges, and supercomputers, to minimize 

electricity costs and building size. Hardware accelerators are relevant candidates to improve 

the FLoating point Operations per Second (flops) per Watt metric, e.g. because less power 

is spent in non-computational operations. 

The third of the three P’s stands for Programmability and is the major weakness of 

hardware accelerators. Indeed, programming a hardware accelerator implies a paradigm 

shift from sequential programming to circuit programming via vhsic Hardware Description 

Language (vhdl) [TM08] for fpga boards, 3D programming e.g. via Open Graphics 

Library (opengl) or directX for gpu’s or parallel programming e.g. via cuda or Open 

Computing Language (opencl) for gpgpu’s. Moreover, the basic execution model generally 

introduces the additional complexity of remote memory management. It is a time 

consuming task and developers are not used to it. 

This chapter presents in detail the heterogeneous computing model in Section 2.1 and 

its consequences for the programming model in Section 2.2. The concept of hardware constraints 

is introduced in Section 2.3 as a way to model the interaction between the hardware 

and a hypothetic compiler and Section 2.4 discusses the relevancy of using the C language 

as an input language to program specific hardware components. As an illustration of a 

standardized programming model, we give an analysis of the one for opencl in Section 2.5. 

Other programing models are presented in Section 2.6. 

2.1 Heterogeneous Computing Model 

The opencl specifications [KOWG10] illustrates a heterogeneous computer organization 

with Figure 2.1, which shows that the main characteristics of a heterogeneous device 

is the presence of several computational units with different capabilities. 

The key to performance is to use these different capabilities to the best of their capacity 

in order to achieve efficient computations, because each device is specialized in one 

kind of computations. e.g. a gpgpu is well suited for intensive, regular computations, 

while a General Purpose Processor (gpp) performs better for irregular, unpredictable al-

2.1. HETEROGENEOUS COMPUTING MODEL 27 

Figure 2.1: Heterogeneous computing model. 

gorithms. A dedicated fpga board performs better for the particular streaming signal 

processing computation it was designed to handle. Even more specialized computing units, 

Application-Specific Integrated Circuits (asics) are also used to match very specific design 

2 . Similarly, some hardware accelerators provide functionalities dedicated to a specific 

task: the ClearSpeed Advance board [GG] accelerates the Intel Math Kernel Library 

(mkl), and fpga-based chips have been used to speed up Photoshop TM image processing 

[Sin98]. Ten years later, Cray XD1 combined amd Opterons with Xilinx fpgas to 

speedup Smith-Waterman algorithm [Sto08]. The same approach is used in Intel’s Stellarton, 

the combination of an Atom E600 and an Altera fpga, to find a balance between 

performance and flexibility, which shows the mainstream interest in such platforms. 

The devices enumerated above do not collaborate in a purely decentralized way. A 

host device, generally a gpp, is in charge of scheduling the computational tasks among 

the hardware accelerators, in a master-slave fashion. In many cases, each device has its own 

memory and has to communicate with the others in order to share computational tasks: 

this is a strong paradigm shift from sequential programming to distributed computing that 

emphasizes the first difficulty of heterogeneous computing: the platform model. It also 

has an influence on the memory model. 

Heterogeneity is also found in the computational units themselves, as illustrated by 

Figure 2.2. Figure 2.2a describes a typical von Neumann architecture and Figure 2.2b 

describes the opencl view of a generic computational device. 

In a recent article [Wol11], Michael Wolfe goes through no fewer than eight levels of 

parallelism that must be mastered to reach exascale performance. Heterogeneity is found at 

several levels, as illustrated by the French experimental grid Grid5000 [INR], a gathering 

of 9 clusters counting around 1, 500 nodes, almost 3, 000 processors and more than 7, 000 

cores. Heterogeneity is a fundamental characteristic of this grid: at the node level, with 

nodes from Altix, Bull, Carri System, Dell, hp, ibm or Sun; at the socket level 

with, for instance nVidia Tesla S1070 available at the Grenoble site; at the core level 

2. The main advantages over fpgas are the ability to customize the circuit form, lower unit costs and 

full customization possibility.


Private 

Memory 0 

PE 0 

. . . 

Local 

Memory 0 

I/O 

Arithmetic 

Logic 

Unit 

Memory 

Control 

Unit 

(a) von Neumann Architecture. 

Private 

Memory M0 

PE M0 

Global/Constant Memory Data Cache 

Global Memory 

. . . 

Private 

Memory 1 

PE 0 

Constant Memory 

(b) Generic opencl node architecture. 

. . . 

Local 

Memory N 

Figure 2.2: von Neumann architecture vs. opencl architecture. 

Private 

Memory MN 

PE MN

2.2. INFLUENCE ON PROGRAMMING MODEL 29 

with 17 different kinds of processors from two main families (amd Opteron and Intel 

Xeon), and thus at the vector level with different supported Streaming simd Extension 

(sse) versions. Not to mention that different instruction sets are supported from one node 

to another. An application that is not aware of the specificities of each node it is running 

on cannot achieve minimal execution times. 

From the single processing element per computational unit found in homogeneous computing, 

heterogeneous computing considers the availability of multiple processing elements 

per device. In Flynn’s taxonomy [Fly72], this means a move from Single Instruction 

stream, Single Data stream (sisd) paradigm to either simd or mimd. A survey [BDH + 10] 

by André Rigland Brodtkorb et al. further refines this view and enumerates possible 

organizations of a modern computational device. The main concept is that acceleration 

is achieved through specialization, so specialized hardware with specialized instructions or 

organizations is used. It shows the second difficulty related to heterogeneous computing: 

the execution model. 

Moreover the memory described in Figure 2.2b is not flat, and sharing among processing 

elements is mandatory; rather, a memory hierarchy is exposed inside the computational 

device, in addition to the memory partitioning imposed by remote execution. The potential 

benefits for the developer are optimized cache management and data movement handling, 

but it also seriously complicates the development of applications. This combination of 

distributed memory and hierarchical memory hierarchy is the third difficulty posed to 

developers by heterogeneous computing: the memory model. 

2.2 Influence on Programming Model 

The heterogeneous computing model has an important impact on the programming 

models used to effectively program the devices. Considering the three difficulties raised in 

the section above, it is no surprise that many programming models have been proposed to 

develop applications for heterogeneous architectures. 

The memory model involves ideas from the distributed computing community. 

Message passing protocols have been reviewed in [McB94]. Among them, active messages 

[vECGS92] offer a particularly elegant and efficient solution to distributed memory 

management, and Message Passing Interface (mpi) [WW94] has appeared as a leading and 

standardized solution. Such interfaces enable fine grain control over data movement and 

potentially higher performance, at the expense of manual management of data consistency. 

Several approaches have been proposed to relieve developers from the manual declarations 

of data transfers: one alternative is to use a shared virtual memory space, as described in 

survey [IS99]; another is to use Remote Procedure Call (rpc) and to rely on the runtime or 

the language to automatically transfer arguments [BN84]. Both approaches face the problem 

of scheduling calls over the underlying architecture. However, this topic is far beyond 

the scope of this thesis, for we only consider heterogeneous platforms with a single host 

and a single accelerator. Automatic generation of data transfers and scheduling of computational 

tasks was active during the day of High Performance Fortran [Wol96, ACIK97]


Device 2 code 

Compiler 2 

Object 2 

manual 

Sequential code 

manual 

Device 1 code 

Compiler 1 

Object 1 

manual 

Device 0 code 

Compiler 0 

Object 0 

Figure 2.3: Impact of heterogeneous architecture on compilation. 

host code 

Host 

Compiler 

Host Object 

and gpus have brought renewed interest in the topic [LVM + 10, JPJ + 11]. Another issue 

of distributed computing is the difference of data representation across architectures. It 

forces the use of a common representation or adds translation costs to all data transfers. 

The load-work-store idiom is typical of rpc. It puts a new constraint on the host code, 

because it serializes the processing of a computational task by a computational device in 

three to five steps: 

1. allocate: the host allocates memory on the remote accelerator. On embedded devices, 

this steps may be optional, for memory allocation is managed by the user; 

2. load: the host transfers the data from its memory to the allocated remote (accelerator) 

memory; 

3. work: the accelerator performs computation on the loaded data and notifies the host 

of the computation end; 

4. store: the host transfers the data back from the remote memory to its own memory; 

5. deallocate: the host frees memory allocated in step 1. Likewise, this step may be 

optional. 

The main drawback of this approach is that only step 3 contributes to acceleration. All 

the other steps actually slow down the process. 

The architecture model implies that each device involved in a heterogeneous computation 

may use a different instruction set. Devices generally have to be programmed in 

different languages. For a collection of n different accelerators, it means n different versions 

of the code to maintain and to evolve simultaneously if we use a single accelerator at a 

time. Figure 2.3 illustrates this concept. 

The interactions get more complex as several accelerators are combined in the same 

heterogeneous platform. Moreover, because accelerators are not general purpose processors, 

they are likely to require cross compilation. Code is strongly dependant on the device 

architecture combination, either on source or binary form, which is a limitation for application 

portability. This limitation can however be overcome by bundling all possible object

2.3. HARDWARE CONSTRAINTS 31 

code or source code, at the expense of larger binaries, and have a host code aware of the 

possible change of hardware, as supported by middle-wares such as StarPU [ATNW09] 

or kaapi [HRF + 10]. An interesting consequence of Figure 2.3 is that the host code is 

relatively independent of the device codes, except for the development part. Following 

Amdhal’s law [Amd67], the host should also executes the less computation intensive part 

of the code. As a result, there are very few constraints on this part of the code. 

The execution model varies a lot across accelerators: there is little in common 

between a pipelined vector processor and a pure simd processor. However, most hardware 

accelerators get their speedup from parallelism, and many programming models have 

been proposed for parallel computing. Among them we denote automatic parallelization 

through loop nest analysis [Lam74, IT88, WL91b, IJT91, DSV96] or for irregular applications 

[DUSsH93], partitioned global address space languages [CDMC + 05], domain-specific 

language like Chunk [WC03] or libraries like opengl [SA94]. Stream processing [Ste97] 

takes advantage of a limited form of parallelism. Task level parallelism has also gained 

in popularity in the multicore era with languages like Cilk [BJK + 95], as well as directive 

oriented parallelization such as Open Multi Processing (openmp) [Ope11]. This small illustration 

of the large set of approaches introduced to bridge the gap between the user and 

various parallel paradigms should convince anybody that there is no holy grail in the field. 

An interesting note however is that most approaches listed above involve either the C or 

the Fortran languages. Similar approaches for more recent languages like Chapel or X10, 

or general purpose language like Java, have been proposed but they receive less audience, 

certainly due to the amount of legacy codes. It also shows that the diversity of the hardware 

automatically leads to ad-hoc responses and a one-to-one binding between compilers 

and hardware platforms, the hardware vendor provides a means to program the device, 

either a language, a C extension or a library and the user must cope with it. A notable 

success of this approach is the nVidia cuda [NVI11] language that make it possible to 

program nVidia graphical devices. The host code is not subject to these considerations 

and the language used to develop it is not as relevant for high performance computing as 

for device codes, providing a binding to a lower level language—say C—is available, like 

the interaction between gpgpu and high level languages, as provided by the gpulib. 3 

2.3 Hardware Constraints 

In the previous section, we described three aspects of the heterogeneous computing 

model. A key aspect is that there is not a unique heterogeneous computing model, but a 

collection of models, one per hardware target, all of them fitting in a general and extensible 

model, as illustrated by the feature diagram from Figure 2.4. 

The topic of this dissertation is not to propose yet another taxonomy of parallel machines 

[Dun90, Che94, HP06], even with respect to the limited scope of heterogeneous 

machines. As a consequence, we selected a limited number of features among the existing 

ones, based on their presence on the following hardware: 

3. gpulib [MMG08] is the evolution of the pystream project, a Python binding for cuda.


memory 

rom ram 

Hardware 

Device 

shared distributed 

optional feature 

mandatory feature 

a feature is mandatory 

isa Acceleration 

Specialization Parallelism 

Figure 2.4: Example of hardware feature diagram. 

simd mimd 

– a desktop computer with several cores and a modern gpu board; 

– a laptop with several cores with vector instruction units; 

– an embedded processor with a single processor and a vector instruction unit; 

– an embedded device with a fpga-based accelerator. 

The above targets exhibit three main sources of heterogeneity that must be dealt with: 

Instruction Set Architecture: The presence or absence of the following features at the 

instruction level require compiler support: 

– vector registers or instructions; 

– complex numbers; 

– maximum number of operands per instruction (generally 2 or 3); 

– supported operations. 

Memory: As many applications are memory-bound, taking into account the hardware 

specificities of memory is often critical: 

– memory size; 

– memory hierarchy; 

– cache management; 

– distributed memory; 

– Read Only Memory (rom); 

– Direct Memory Access (dma) flexibility; 

– dma speed. 

Acceleration Features: One of the motivations for heterogeneous machines is perfor-

2.3. HARDWARE CONSTRAINTS 33 

memory 

ram 

shared 



Multicore 

Device 


Parallelism 

mimd simd 

Figure 2.5: Multicore with vector unit feature diagram. 

mance 4 , performance that comes from one or more of the following features: 

– specialized computation unit; 

– mimd execution mode; 

– simd execution mode. 

Some important aspects are set aside in this dissertation, especially the memory / cache 

hierarchy and the availability of asynchronous dma. Both are critical to achieve high performance: 

caches misses and false sharing can drastically reduce performances on Central 

Processing Units (cpus), gpus’ shared memory has a faster access delay and asynchronous 

data transfers are commonly used to overlap communications and computation. Those 

transformations, although often critical to reach high throughput, are not mandatory to 

build a working compiler: we follow a conservative approach that aims at producing reasonably 

efficient code, not to be highly specialized for a single target, following the saying 

“have a working code before you consider optimizing it”. 

Taking advantage of read-only memory, shared memory or vector types is optional, 

while acceleration is a mandatory feature specific to the hardware. A particular hardware 

device that fits in this model can be described using a restricted feature diagram. Figure 2.5 

shows the feature diagram of a typical multicore device with vector units. 

The difference between an optional and a mandatory feature is important for performance: 

for a code to be executed on a specific piece of hardware, it must be aware of all the 

mandatory features. A code aware of optional features would turn this extra knowledge 

into performance boost, e.g. for nVidia’s gpgpus, the use of shared memory is optional 

but is critical to the performance of some applications, while unnecessary for others. 

4. Depending on the context, it can also be development costs or power consumption.


Definition 2.1. A source-to-binary compiler is a compiler that translates its input code 

into machine code. 

Examples of source-to-binary compilers include icc, Intel compiler for the C++ language 

or nvcc or nVidia compiler for the cuda language. 

Mitrion-C used for C to vhdl translation ; 

c2h also used for C to vhdl translation ; 

gcc vector types used for automatic multimedia instruction sets manipulation ; 

cuda used for nVidia gpgpu code generation ; 

opencl used for generic hardware accelerator code generation. 

The goal of a compiler like c2h or nvcc is to generate hardware code or circuit from 

standard C code. The accepted idea concerning such high level synthesis is that what is 

gained in development time is lost in performance. However, a recent publication [CDL11] 

shows this approach can both reduce development time and increase performance. 

This kind of generator must ensure an original code matches all hardware features, and 

may use its optional features. Because some features are in conflict with the language, or 

because it is easier to support a subset of it, they move from standard C to dialects. The 

dialect is a reflection of the hardware features and we call them hardware constraints. 

A hardware constraint is embodied by a restriction or extension of the original language 

and forms the core of the difficulty to program hardware devices using vendor compiler. 

2.4 Note About the C Language 

As shown above, most languages designed to abstract hardware complexity are extensions 

or dialects of the C language. Historically, there have been three major versions 

of this language: C K&R (1978), C89 (1989) and C99 (1999). A Greek Athenian comic 

dramatist once quoted 

High thoughts must have high language. 

Aristophanes, Frogs, 405 B.C 

However, many compilers still rely on C89 and do not benefit from the advantages of 

C99, not to mention the planned C1X. As a consequence, critical features such as native 

complex numbers and variable length arrays are not used, and are replaced by structures 

and pointers, respectively. This greatly lowers the expressiveness of the code, making it 

harder to maintain, and also harder to compile. Figure 2.6 illustrates this difference on a 

complex matrix vector multiply. 

Most available benchmarks are written in C89; there are two typical arguments in favor 

of this statement: 

1. It is compatible with more C compilers. In particular, the C++ language is not 

compatible with certain features of C99, such as variable length arrays ; 5 

5. Contrary to the ISO/IEC 14882:2003 standard, the forthcoming C++0x standard includes most of 

the C99 specificities.

2.4. NOTE ABOUT THE C LANGUAGE 35 

typedef struct { double r,i; } Complex ; 

void matrix_vector_multiply ( int M, int N, Complex * m, Complex *v, 

Complex * out ) { 

int i,j; 

for (i =0;ir=out ->i =0.; 

for (j =0;jr+=mi ->r*vi ->r-mi ->i*vi ->i; 

out ->i+=mi ->r*vi ->i+mi ->i*vi ->r; 

} 

} 

} 

(a) C89 version. 

void matrix_vector_multiply ( int M, int N, complex m[M][N], complex 

v[N], complex out [N]) { 

for ( int i =0;i


1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

C99 vs. C89 speedup 

cfar ct db fdfir ga pm qr svd tdfir 

Figure 2.7: Comparison of two versions of the HPEC Challenge benchmark: C89 vs. C99. 

2. A direct code translation into assembly favors the pointer versions. 

gnu C Compiler (gcc), Intel C++ Compiler (icc), ibm’s, hp’s and pgi’s compiler 

all support C99, so Argument 1 is mostly wrong. Among modern industrial compilers, 

only Microsoft’s does not provide this feature. We have carried out experiments on 

the High Performance Embedded Computing (hpec) Challenge benchmarks [LRW05], a 

benchmark written in C89, to verify how the switch to C99 impacts performance. The 

original version has been completely rewritten to take advantage of C99 features, then 

both the old and the new version have been compiled with icc version 12.0.3 on a desktop 

computer. Each benchmark is run 100 times and the median value is picked. Figure 2.7 

shows results normalized against the original version using the -O3 flag. A result greater 

than one means the C99 version executes faster. 

Figure 2.8 show the result of C to C99 conversion on the CoreMark benchmark [Con]. 

The behavior of both gcc and icc are evaluated and normalized with respect to the 

original version. The metric used is the number of iterations per seconds, as returned by 

the benchmark. gcc compiler flags used are -O3 -ffast-math -march=native and icc’s 

are -O3. A desktop station running a 2.6.38-2-686 GNU/Linux on 2 Intel(R) Core(TM)2 

Duo CPU T9600 @ 2.80GHz is used to run all the benchmarks presented above. 

The same transformations have been performed on the linpack benchmark [DLP03], 

used to compare machines ranking in the top500. Results are displayed in Figure 2.9. It 

shows a small slowdown for the readability gain displayed in Figure 2.6. 

We observe that using C99 can imply a small performance loss, and icc is more impacted 

than gcc. This is due to two causes: 

– some kernels take advantage of pointer arithmetic to perform optimized iterations 

over two-dimensional arrays, while indexed arrays suffer from non-optimized address

2.5. OPENCL PROGRAMMING MODEL ANALYSIS 37 

1.4 

1.2 

1 

0.8 

0.6 


icc gcc 

Figure 2.8: Comparison of two versions of the Coremark benchmark: C89 vs. C99. 

computations; 

– using data allocated on the heap as array pointers involve complex cast from void* 

to, say, int (*)[n][m] that disrupts the pointer analysis. 

However, C99 array declarations provide a readability gain. More important, polyhedral 

analyses are made harder by the use of non-linear array accesses. For instance, the 

Pluto [BBK + 08] compiler only handles affine loop nests. The Paralléliseur Interprocedural 

de Programmes Scientifiques (pips) [AAC + 11] compiler framework suffers from the same 

limitations and most of its analyses lose in accuracy in the presence of non-affine subscript 

expressions. Transformations have been proposed to automatically delinearize array references 

by separating elements of a non-affine equation into affine groups [Mas92] or to 

recover array from pointers [FO03]. In this dissertation, assumption that input codes are 

written using the high level constructs of the language is used and arrays are assumed not 

to be linearized. 

2.5 OpenCL Programming Model Analysis 

opencl [KOWG10] is a recent proposal to standardize the way hardware accelerators 

are programmed. It provides a unified language derived from C99 to write kernels and an 

Application Programming Interface (api) to manage kernel calls. 

It is interesting to analyse the differences between opencl and the C99 language 6 : 

simpler function handling: no recursion, no function pointer, no variable number of 

arguments; 

6. As specified in the opencl sdk reference manual.


1.4 

1.2 

1 

0.8 

0.6 


icc gcc 

Figure 2.9: Comparison of two versions of the Linpack benchmark: C89 vs. C99. 

limited pointer support: pointer to types that are fewer than 32 bits wide cannot be 

dereferenced; 

no variable size structures such as variable-length arrays or structure with flexible arrays; 

storage qualifiers: storage qualifiers are forbidden and replaced by __global, 

__constant, __local or __private; 

image support: built-in types and function to manipulate 2D and 3D images; 

math support: built-in geometric and mathematic functions such as cos 7 or dot but no 

math library; 

vector support for all primary types with up to 16 elements per vector. 

Data transfers are managed through synchronous or asynchronous dma plus prefetching 

and memory fences. It is possible to manage multiple devices in a thread-safe fashion. The 

api also includes facilities to use opengl buffers or textures with opencl code. 

It appears clearly from these differences that opencl was designed primarily for gpu 

devices and signal/video processing, as shown by the built-in image support, vector types 

and the api sharing with opengl. Lower level architectures suffer from the type restriction 

and the absence of bit-fields. Moreover, opencl programming model targets hierarchical 

array of processing elements with corresponding hierarchical memory structure, as found 

in gpus and multicore, but not in fpgas. To partially address this shortcoming, a relaxed 

version of opencl, called opencl embedded profile, has been released. It lowers the 

7. Note however that the standard is less restrictive that the IEEE 754 specification concerning Unit 

in the Last Place (ulp).

2.5. OPENCL PROGRAMMING MODEL ANALYSIS 39 

Host Object 

Sourceto-Binary 

Compiler 0 

Host code 

+ 

Device code 

Host 

Compiler 


Compiler 1 


Compiler 2 

Object 0 Object 1 Object 2 

Figure 2.10: Compilation flow in opencl. 

requirements on data types (no 64 bit integers, no 3D images), on floating-point compliance 

(Inf and NaN not required, lower requirements for some function accuracy) and on 

hardware capacity (minimal image height/width, local memory size). 

So opencl proposes a generic, library-based, approach to the heterogeneity of hardware 

targets problem. The core idea is to expose in a standard api the host calls to the different 

hardware devices. The devices code is generated at runtime using Just In Time (jit) 

compilation from a code shipped in the code as a textual representation, using a sourceto-binary 

compiler. It changes Figure 2.3 into Figure 2.10, where only one device code 

exists. 

A unique representation of the device code is critical for source code portability. It 

achieves source code portability at the expense of deporting the complexity to source-tobinary 

compilers. Since its first release, opencl has been used to generate code for the 

following targets: 

multiple nVidia’s gpgpu combined with multicores through the Nvidia’s opencl 

compiler ; 

multiple amd’s gpgpu combined with multicores through ATI Stream Software 

Development Kit ; 

Intel Multicore through Intel R○ opencl sdk; 

sse-enabled Multicore through the Fixstars opencl Cross Compiler; 

Cell engine through IBM opencl sdk. 

To our knowledge, no company supports fpga code generation and no paper on 

this subject has been published yet. In addition to the complexity of the transla-


tions listed above, this may be due to the existence of several tools to perform this 

task [KBM07, GCB07, GNB08]. As a consequence, switching to opencl would imply 

short-term development costs that overtake the long-term benefit of code portability. 

opencl does not relieve developers from host-side tasks: communication management 

is still handled manually, the api for marshalling arguments being at a rather low level 8 . 

As a consequence, input code has to be completely rewritten and split into two codes: the 

host code with opencl calls and the kernel code. 

Also, opencl does not guarantee the performance portability: an opencl kernel tuned 

for a specific platform is not guaranteed to behave as well on another platform. For 

instance, a performance study [KSA + 10] shows that moving from one gpu to another 

requires adjusting kernel parameters to achieve the best results. 

From an engineering point of view, there are two aspects where the compilation scheme 

given in Figure 2.10 is limited in two ways. Firstly, there is no sharing of common optimizations 

between different opencl compilers. Actually the design does not prevent such 

sharing but, as shown in the above enumeration, the trend is to have each vendor develop 

its own version of an opencl compiler to support its hardware. In fact, these compilers 

are more basic compilers than optimizing compilers, in the sense that their primary goal 

is to generate code for the hardware, not to optimize user code. opencl leaves this task 

either to application developers, or to compiler developers. Given the cost to develop a 

full-fledged optimizing compiler, the task is generally left to developers. 

Still, opencl paves the way to heterogeneous computing normalization. In spite of 

its limitations, its programming model has received a good following, and the number of 

compilers that supports it, as well as software development tools (debuggers, profilers, 

etc.), is quickly growing. Chapter 3 proposes an extended approach that borrows several 

ideas from the opencl model but makes host-side development easier, while achieving a 

good level of performance and enforcing compilation pass reuse. 

2.6 Other Programing Models 

Heterogeneous computing was first found in clusters of machines, where different nodes 

had different processors, and is now making its way to desktop computers. Because performance 

of heterogeneous computations is linked to the proper scheduling of the different 

tasks that compose the program, many papers have studied the decomposition of a program 

into independent tasks and their scheduling. [BSB + 01] compares several static approaches 

to the problem, while [SSM08] uses a stochastic model. Scheduling decisions can also be 

dynamic, as in [ZWZD93]. Recently, frameworks such as Hadoop [Whi09] have emerged 

to provide file system and job scheduling integration. 

However, the number of devices involved in heterogeneous clusters and in heterogeneous 

computers is at a different scale, and the data transfer rate between the network connections 

in a cluster and the Peripheral Component Interconnect (pci) connections inside a 

8. Close to the way arguments are pushed on the stack during a traditional function call.

2.6. OTHER PROGRAMING MODELS 41 

computer introduces different factors. For this reason, heterogeneous computers tend to be 

simpler to use efficiently. In particular, the scheduling issues are limited to a dozen nodes 

(e.g. eight for the Cell Processor [PAB + 06]). As a consequence, we do not focus on this 

topic in this thesis. 

Apart from the opencl model discussed in Section 2.5, other models have been proposed. 

The case of pgi is particularly relevant because it is an industrial compiler, thus 

driven by user needs and working solutions. [Wol10] proposed an accelerator model coupled 

to a high level programming model mostly targeted to gpus. It is based on compiler 

directives and thus benefits from the associated incremental development concept. The 

only required directive is #pragma acc, and the others can be used to refine the compiler’s 

analyses (e.g. to specify data movement or loop scheduling), an approach to parallelization 

that has been shown to provide good results [KS99]. This approach relieves developers 

from most low level manipulations and makes it possible to think of the code in terms of 

kernels only. The issue of performance portability is handled by bundling several versions 

of the kernel—one per targeted hardware—in the same binary. 

The directive approach is also taken by Hybrid Multicore Parallel Programming 

(hmpp) [BB09], but in that case nothing is automatic and all the decisions must be made 

by developers and through directives. The advantage is that the user has greater control 

over its application behavior, at the expense of less automation. 

Hardware-software co-design [Wol94], and especially hardware-compiler codesign 

[ZPS + 96, WGN + 02], is an alternate approach where the hardware and the 

software are evolving hand-in-hand, whereas in the current situation, the software is 

struggling to match hardware evolution. This approach has been taken in the Delft 

workbench [PBV06], which relies on the molen [PBV07] machine organization. Retargetable 

compilation is achieved through the use of reconfigurable hardware to provide 

a user-specific instruction set, and of code transformation aware of reconfigurable architectures, 

e.g. to hide reconfiguration latencies with a specific instruction scheduling 

algorithm. 

A similar approach that also by-passes the limitation induced by communication overhead 

has been proposed: the Convey HC-1 computer [Bre10] is a hybrid system that uses 

Xilinx fpgas as co-processors, with the specificity of the processor and the co-processor 

sharing a virtual memory. The co-processor is accessed through user-defined instructions 

that rely on the concept of personalities—hardware level description of intrinsics exposed 

at the user level. Host and accelerator code are consequently mixed together in a unique 

source code. The benefits of this approach is that developers are relieved from the management 

of data transfers. However, writing personalities involves writing a hardware 

description of each intrinsic, so part of the problem related to heterogeneous computing is 

not solved. 

The usage of an Application-Specific Instruction set Processor (asip) is another way 

to exploit heterogeneous computing. In a nutshell, instead of relying on a general purpose 

instruction set implemented on a general purpose processor, a dedicated processor is built to 

run a single soc. This approach is only viable if the process of generating the processor can 

be automated: a compilation step is naturally involved to generate such processors. In such


situations, retargetability of the compiler is a key property to be able to generate a wide 

range of processors. Such a compiler is described in [GLGP06], using a compilation flow 

that involves a compiler infrastructure parametrized by a processor model and a Hardware 

Description Language (hdl) generator. The processor model contains an instruction set 

model, and both of them are described as a matter of functional units, custom data types, 

connectivity, storage, etc. 

2.7 Conclusion 

Heterogeneous computing and multicores are the two current most important keywords 

for High Performance Computing (hpc) and low power. The amount of available hardware 

implementing different programming models makes it hard for developers to adapt existing 

software to new architectures. Because of the fast pace of evolution, any porting effort or 

performance improvement may be jeopardized by a hardware change. In this situation, 

efforts have been made by the hardware community to leverage their interface, and by the 

software community to propose new languages or libraries to ease the port of applications. 

In between, the compiler community tries hard to generate efficient glue code between the 

hardware layer and the software layer. The difficulty lies in the choice of a combination of 

a programming model and a programming language that is simple enough for developers to 

use, but sufficiently rich so that the compiler can extract enough informations for efficient 

hardware code generation. Pragmatically, a subset of the C99 language is used, to limit 

the cost of the technology shift, from both source code and developer points of view. The 

opencl standard paves the way for such an approach but it suffers from several design 

limitations. 

The contribution of this chapter is to present a state of the art of the existing alternatives 

for heterogeneous computing, centered on compilation aspects. A study of existing 

C dialects shows the advantages of using the C language to program hybrid architectures, 

and the usual hardware limitations exposed by the language. Three different benchmarks 

have been manually converted to C99-style variable-length array declaration to show the 

performance impact of a higher level description. Although current compilers generate 

slightly less efficient code from C99 input, the gain in expressiveness favors maintainability 

and makes it easier to apply high-level transformations such as those from the polyhedral 

model.

Chapter 3 

Compiler Design for Heterogeneous 

Architectures 

Vieux Pont de Dinan, Ille-et-Vilaine c○ iris.din / Flick 

Until the end of the 20 th century, only one kind of architecture was used to build general 

purpose computers. As a consequence, typical compilers have been built to efficiently target 

those architectures: a front-end parses the input code into an Internal Representation, a 

middle-end performs various optimizations (hopefully language-independent), either at the 

basic block level, loop level, function level or program level, and a back-end emits targetspecific 

assembly code. These three components have been studied for years in the literature 

and described intensively, as shown by the periodically updated Dragon Book [ALSU06]. 

The complexity growth seen in compiler architecture and crystallized by heterogeneous 

platform favors a more modular compiler framework. Indeed, it seems reasonable to reuse 

existing compilers, software and libraries that already perform efficiently a specific task. 

Combining them instead of reimplementing the wheel over again and again should be 

possible. Let us take the example of Graphical Processing Unit (gpu) code generation 

for regular kernels. PLuTo [BBK + 08] is an example of this approach: they combine a 

C front-end, a polyhedral optimizer, a Compute Unified Device Architecture (cuda) code 

generator and nVidia’s cuda compiler to generate efficient gpu kernels. 

Additionally, build processes are growing in complexity in order to assemble object 

files generated from different languages by different compilation chains. For instance, the 

43

44 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES 

C 

Fortran 

Java 

C Frontend 

Fortran Frontend 

Java Frontend 

Common 

Optimization 

Infrastructure 

x86 Backend 

ARM Backend 

MIPS Backend 

Figure 3.1: A classical 3-phases retargetable compiler architecture. 

x86 

ARM 

MIPS 

sloccount tool finds out 210 Source Lines Of Code (sloc) in common.mk, the generic 

Makefile infrastructure bundled with the cuda distribution. It makes code generation 

more difficult: not only code for different targets must be generated, but a way to link them 

later on must be found. This task is already non-trivial in a homogeneous environment, 

and it quickly becomes complex in a heterogeneous environment. 

This chapter studies the impact of the heterogeneous computing model, presented in 

Chapter 2, on traditional compiler organization. It proposes the combination of a rich 

compiler infrastructure with a flexible pass manager as a framework to match the new 

architecture constraints. This approach focuses on modularity, re-usability, retargetability 

and flexibility of the compiler design. 

To begin with, Section 3.1 studies the adequacy between mainstream compiler infrastructures 

and heterogeneous machines; and proposes a model to represent programmable 

pass management, a critical aspect of compiler design for modularity. Then, Section 3.2 

argues in favor of using source-to-source transformations, using source files as a common 

medium between all existing tools. Finally, Section 3.3 introduces a high level Application 

Programming Interface (api) to build pass managers, the entities in charge of managing 

the chaining of the compiler passes. It exposes a sufficient abstraction of the compiler 

internals to developers that want to contribute at the pass level. The whole scheme is 

illustrated by the Pythonic PIPS (pyps) interface developed on top of the Paralléliseur 

Interprocedural de Programmes Scientifiques (pips) compiler infrastructure. Related work 

is studied in Section 3.4. 

3.1 Extending Compiler Infrastructures 

3.1.1 Existing Compiler Infrastructures 

A compiler is typically separated into three parts, (see Figure 3.1): 

The Front End is responsible for converting the input source code into compiler’s Internal 

Representation (ir). A single compiler can have many front-ends. For instance 

gnu C Compiler (gcc) offers a front-end for Ada, C, C++, Fortran, Java, Chill, 

ObjectiveC, Pascal, etc. 

The Middle End is in charge of the code optimizations, that are independent of the

3.1. EXTENDING COMPILER INFRASTRUCTURES 45 

Host code 

+ 

Device code 

Source-to-Source Compiler 0 

Host Compiler Source-to-Source Compiler 1 


Compiler Infrastructure 

Source-to-Binary Compiler 0 



Figure 3.2: Improved compilation flow for heterogeneous computing. 

Host Object 

Object 0 

Object 1 

Object 2 

input language or of the output target. Some are parametrized (e.g. unrolling) and 

their parameter are target-dependent. Other can benefit from additional assumptions 

due to the input language, e.g. no aliasing between parameters in Fortran77. There 

is usually a single middle end per compiler infrastructure—this is the case for gcc, 

Low Level Virtual Machine (llvm) and Open64. 

The Back End generates target-specific assembly code generation from the ir. In a 

similar manner to front ends, there can be several back ends in a single compiler 

infrastructure. For instance, llvm can produce assembly code for the following targets, 

as of version 2.7 : x86, sparc, powerpc, alpha, arm, mips, cellspu, pic16, xcore, 

msp430, systemz, blackfin, cbackend, msil, cppbackend, mblaze. 

For the purpose of code generation for heterogeneous hardware from raw source files 

(i.e. without directives or language extensions), only the middle and the back-end are 

affected. In Open Computing Language (opencl), the host compiler is assisted by several 

source-to-binary compilers, one for each targeted device. That is a middle end and a back 

end for each targeted device. This scheme could be improved by merging the middle ends 

of each source-to-binary compiler into a generic middle end and as many back-ends. This 

approach, while tempting, is not possible because, as we have shown in Chapter 2, there is 

a lot of diversity in hardware accelerators, so the code transformations involved can only 

be shared partially. One alternative is to provide a common compilation infrastructure, on 

which all middle-ends are based. Figure 3.2 illustrates this idea by splitting each source-tobinary 

compiler into two parts: a source-to-source compiler, called the hardware language 

translator, and a source-to-binary compiler. The hardware language translator manages 

the target-specific optimization process at the source level and the source-to-binary is solely 

in charge of the translation to binary code. both are built using basic blocks found in the 

compiler infrastructure. 

Section 3.2.1 proposes an approach to make annotated code compatible with Figure 2.3. 

gcc, llvm or Open64 are examples of such compiler infrastructures. Let us now propose 

a very generic definition of a compiler infrastructure. 

Definition 3.1. A compiler infrastructure is a set of passes and analyses organized by a 

consistency manager and made available to compiler developers through a pass manager.


Compilers & Tools 

p4a pipscc pypsearch sac terapyps 

Analyses 

DFG, array regions... 

Pass Manager 

pyps tpips 

Consistency Manager 

pipsmake 

Passes 

inlining, unrolling... 

Internal Representation 

Pretty Printers 

C, Fortran, XML... 

Figure 3.3: pips as a generic compiler infrastructure sample. 

Passes and analyses are defined formally in Section 3.1.2, so we just give informal 

definitions. 

Definition 3.2. A pass is a code transformation that modifies its input source to generate 

a new version. 

For instance, loop unrolling, inlining or forward substitution are passes. In the following, 

passes are also referred as transformations. 

Definition 3.3. An analysis produces an abstraction of the code to be used by further 

passes. 

A call graph, a dependence graph or a polyhedral model are results of analyses. 

Definition 3.4. A consistency manager is a component in charge of providing up-to-date 

analyses to the passes. 

Definition 3.5. A pass manager is a component in charge of chaining the passes. 

Figure 3.3 shows the hierarchical organization of these components, based on the infrastructure 

of pips. The infrastructures of gcc [Nov06] and llvm [LA03] follow a similar 

pattern. 

Typically, a compiler for a single target applies the same sequence of passes on each 

function from its input code. As stated in gcc’s manual [Wik09]: 

Its [the pass manager’s] job is to run all the individual passes in the correct 

order, and take care of standard bookkeeping that applies to every pass. 

The gcc pass manager works on a sequence of passes, as shown by the implementation 

of the init_optimization_passes function. An excerpt of this function taken from gcc 

source code is given in Listing 3.1.


void init_optimization_passes ( void ) 

{ 

struct tree_opt_pass **p; 

# define NEXT_PASS ( PASS ) (p = next_pass_1 (p, & PASS )) 

/* Interprocedural optimization passes . */ 

p = & all_ipa_passes ; 

NEXT_PASS ( pass_early_ipa_inline ); 

NEXT_PASS ( pass_early_local_passes ); 

NEXT_PASS ( pass_ipa_cp ); 

NEXT_PASS ( pass_ipa_inline ); 

[...] 

Listing 3.1: gcc pass manager initialization. 

# dead code elimination + constant propagation + inlining 

opt input .bc -o output .bc -dce - constprop -inline 

Listing 3.2: Dynamic phase ordering using the llvm pass manager command line interface. 

The passes use a function pointer bool (*gate)(void) to potentially guard its execution 

and unsigned int(*execute)(void) runs the pass. The pass order is essentially fixed, but 

it is possible to turn off some of the passes using the gate function and to sidestep some of 

the issues using the plug-in mechanism. This intrinsically static pass scheduling is a severe 

limitation for iterative compilation tools, and its linearity does not match heterogeneous 

compilation requirements. 

The pass manager used in llvm is more sophisticated: it can register several kind of 

passes, depending on the type of object they work on—the whole program, a function, a 

call graph, a loop, a basic block or some machine code—while gcc’s passes only work on 

functions. Regarding pass management, compiler developers can either rely on the existing 

one or provide their own implementation. A Command Line Interface (cli) on top of the 

pass manager, called opt, allows to dynamically change the phase ordering as presented in 

Listing 3.2. 

Iterative compilation is also possible. However, the lack of advanced control structures 

over pass scheduling hardly makes it a candidate for heterogeneous computing. A compiler 

for a heterogeneous platform must apply different sequences to different parts of the code, 

say one per target. As the compiler not only optimizes the code for a specific device, but 

also modifies the code to meet the hardware constraints, it can be expected that unusual 

combination patterns happen. The compilers built during this thesis and presented in 

Chapter 7 validate this assumption.


3.1.2 A Simple Model for Code Transformations 

The core of a compiler optimizer is the pass ordering. This section studies from a formal 

point of view the interaction between passes, inspects the consequences of this formalism 

and elaborates on several transformation composition rules. 

3.1.2.1 Transformations: Definition and Compositions 

Let us first define formally what a transformation is. Let P be the set of well-formed 

programs 1 , F the set of all possible functions 2 and T the set of code transformations. 

Definition 3.6. A program is a n-tuple of functions: 

∀p ∈ P, ∃n ∈ N : p ∈ F n 

The first element of a program is called the entry point of the program. The cardinality 

of a program is given by the | · | operator. 

If p is a program, we denote Vin(p) be the set of its possible input values; and given 

vin ∈ Vin(p), P(p, vin) denotes the results of the evaluation of p. 

Definition 3.7. A code transformation is a P → P application. 

The identity function on T is denoted idT . The set T , together with the ◦ operation, 

the function composition, and idT , is not a group because many transformations are not 

injective and thus have no inverse. 

Proof. The transformation that evaluates constant expressions is not injective, as it produces 

int a = 4; from either int a = 3+1; or int a = 2*2;. 

The former definition of a transformation ignores an important aspect of code transformations: 

they can fail. For instance loop fusion of two loops fails if the second loop 

carries backward dependencies on the first. A loop-carried backward dependency can prevent 

loop vectorization, or the pass can even crash because of a lousy implementation 3 . 

As a consequence, we introduce an error state and propose: 

Definition 3.8. A code transformation is an application P → (P ∪ {error}) that either 

succeeds or preserves the semantics of the program or fails. 

And one can revert to the previous state by defining a failsafe operator: 

1. We call “well-formed programs” programs whose behavior is not undefined according to its language 

norms. 

2. The term module is used in pips instead of the more commonly used term function. As a consequence, 

the term module is used in some code excerpts. 

3. This unfortunately happens a lot in research compilers where many PhD students—just like me— 

contribute.


Definition 3.9. The failsafe operator ˜· : T → T is defined by 

∀t ∈ T , ∀p ∈ P, 

 

˜t(p) 

t(p) 

= 

p 

if t(p) =error 

otherwise 

and then a failsafe composition: 

Definition 3.10. The failsafe composition ˜◦ : T × T → T is defined by 

∀(t0, t1) ∈ T 2 , t1 ˜◦ t0 = ˜t1 ◦ ˜t0 

Chaining transformations can be done using the ˜◦ operator and most compilers are 

using this semantics as their primary way to compose transformations. New passes can 

be defined as the composition of existing passes, promoting modularity instead of monolithic 

passes. For instance a pass that generates Open Multi Processing (openmp) code 

can be written as the failsafe combination of loop fusion, reduction detection, parallelism 

extraction and directive generation instead of a single directive generation. Lattner advocates 

[Lat11] for this low granularity in pass design. 

Still, the fact that a transformation fails carries some important information. In the 

example above, if the loop vectorization succeeds, a vector instruction can be generated, 

otherwise parallelism extraction may be tried. This kind of behavior is represented by a 

conditional composition: 

Definition 3.11. The conditional composition ◦ : T × T × T → T is defined by 

∀(t0, t1, t2) ∈ T 3 , ∀p ∈ P, ((t1, t2) ◦ t0)(p) = 

(t1 ◦ t0)(p) if t0(p) = error 

t2(p) otherwise 

The ◦ operator is not used in llvm or gcc, although it provides interesting perspectives. 

Let us assume the existence of 3 transformations tgpu, tsse and tomp that convert a 

sequential C code into a code with cuda calls, Streaming simd Extension (sse) intrinsic 

calls and openmp directives, respectively. Then the following expression: 

(idT , tsse ˜◦ tomp) ◦ tgpu 

means try to transform the code into a gpu code or if it fails try to generate openmp 

directives then sse intrinsics, whether openmp directives were generated or not. It builds 

a decision tree that allows complex compilation strategies. 

If the intended behavior is to stop as soon as an error occurs, and to keep on applying 

transformations otherwise, as in the sequence: 

(t3, idT ) ◦ t2, idT 

 

◦ (t1, idT ) ◦ t0 

 

then writing the default skip transformation is bothersome. Thus we define an error propagation 

operator:


Definition 3.12. The error propagation operator ◦ : T × T → T is defined by 

∀(t0, t1) ∈ T 2 , t1 ◦ t0 = (t1, idT ) ◦ t0 

which makes it possible to rewrite the above example as 

t3 ◦ t2 ◦ t1 ◦ t0 

For instance, let us assume the existence of a transformation topt_dma that optimizes 

the usage of Direct Memory Access (dma) functions by trying to remove redundant ones, 

to merge them, etc. 4 It is not relevant to apply it if no dma function was generated by 

a tgen_dma transformation. Supposing that tgen_dma returns an error if it failed to generate 

dma operations, this kind of interaction can be represented using the expression: 

topt_dma ◦ tgen_dma. 

3.1.2.2 Parametric Transformations 

Many transformations are parametric. For instance, loop unrolling not only takes a 

program as input, but it also needs a particular loop in a particular function and an unroll 

rate to be completely defined. To represent this, we introduce the concept of transformation 

generator 

Definition 3.13. An application f is a parametric transformation if there exists a set A 

so that f : A → T . 

For instance loop unrolling is a parametric transformation for which A = F × L × N ∗ , 

where first argument is the function to work on, the second argument the loop statement 

to unroll and the third argument is the unroll rate. 

Two particular classes of transformation generators are commonly found in compilers: 

function transformation and loop transformation. 

Definition 3.14. A parametric transformation g : A → T is a function transformation if 

there exists a set B so that A = F × B. 

Definition 3.15. A parametric transformation g : A → T is a loop transformation if 

there exists a set B so that A = F × L × B. 

3.1.2.3 From Model to Implementation 

Moving from a formal description of pass operators to a programming language can be 

tricky. Although it is tempting to design a new language that directly reflects the ˜·, ˜◦ , ◦ , 

and ◦ operator, we make the following points: 

– a transformation changes a program state and behaves like a method in Object 

Oriented Programming (oop); 

4. A similar transformation is described in Section 6.3.


– transformation generators are methods granted with extra parameters; in particular 

function transformation generators can be represented by methods of a hypothetic 

function class and loop transformation generators can be represented by methods of 

a hypothetic loop class; 

– the composition operator is similar to the sequence operator found in many programming 

languages; 

– the failsafe operator is similar to a try ... catch block that surrounds every transformation; 

– the conditional composition is similar to if ... then ... else block; 

– The error propagation operator has a similar semantics as exception propagation. 

As a consequence, we choose to use a general purpose programming language instead 

of designing our own: the class hierarchy with the appropriate methods described in next 

section embodies all the concepts detailed in this section. 

3.1.3 Programmable Pass Management 

3.1.3.1 A Class Hierarchy for Pass Management 

The hierarchy between the host and the accelerators, a consequence of the masterworker 

paradigm, implies a similar hierarchy between the host code and the accelerator 

code. However, it is not enough in practice: depending on the input code, a single function 

can have several loops candidate to offloading. Moreover, opencl takes as input a selfcontained 

code, so the notion of compilation unit—the set of variables, types and functions 

defined in the same source file— is also important. Because compilation for heterogeneous 

computing usually leads to the creation of new functions, it is also important to be able 

to keep track of them and adapt the processing according to their origin. This relationship 

must be visible to the pass manager because different compilation schemes apply to 

each part. The model presented in the previous section confirmed by the experience gathered 

during the development of several heterogeneous compilers has led to the hierarchy 

described below: 

Program: Code transformations manipulate programs. Interprocedural analyses such as 

the call-graph computation and transformations such as constant propagation transform 

the program as a whole. This is the coarser transformation grain. 

Compilation units: Processing can be different depending on the enclosing compilation 

unit. A typical example is for hand-optimized runtime code, code passed to the compiler 

so that it has all the definitions required for interprocedural analyses of generated 

code (see Section 3.2.1). This code should only be analyzed by the compiler, but not 

modified. The same situation occurs when some properties of a source file have been 

certified: the certification does not hold any longer if the code is changed. 

Functions: We have modeled a program as a set of functions, and the pass manager 

generally works at this level of granularity. gcc works at this level. It is especially 

relevant for heterogeneous computing where some functions are executed on the host 

and some are executed on an accelerator.


Workspace 

1 

1 

Program 

1 

1 

* 

* 

Maker 

Compilation 

Unit 

1 

* 

Figure 3.4: pyps class hierarchy. 

1 

Function Loop 

* 

* 

Loops: Many scientific programs spend a lot of time in loops and Single Instruction 

stream, Multiple Data stream (simd) parallelism is commonly found in loops. Exposing 

loops hierarchy to the pass manager is pertinent in many cases, for example 

to apply loop-level transformations, to outline a loop in a new function, (see Section 

4.6.2), or to isolate it from the rest of the memory, as in Section 6.1. Likewise 

the loop nest hierarchy provides significant information concerning the code structure 

and potential candidate for loop transformations such as loop tiling or loop fusion. 

These relationships can be represented by the class hierarchy shown in Figure 3.4. 

The combination of control flow and hierarchical structure makes it possible to express 

constructs such as “select each loop nest of the program that does not belong to the 

runtime, compare its number of operations to its number of memory accesses and if it is 

computationally intensive enough, outline the loop nest into a new function in the gpu 

space. Then transform this new function into a suitable kernel for a gpu device”. 

Note that using the formalism presented above, this sentence can be denoted as: 

∀p ∈ P, ∀f ∈ F \ R, ∀l ∈ L, 

G(f, f ′ )◦ O(f, l, f ′ )◦ C(f, l) (p) 

where R is a set of functions representing the runtime, C : F × L → T is a loop 

transformation that raises an error if the given loop is not computationally intensive; 

O : F × L × F → T is a loop transformation that outlines the given loop into a new 

function; and G : F × F → T is a function transformation that turns a function into a 

kernel and a kernel call. 

Figure 3.4 also introduces a Maker class. This component represents the build process 

and is involved in both the source code generation process and its final compilation into 

machine code. Indeed if the compilation chain is to remain independent of the targeted 

language—remember everything is represented in C—a specialization step is needed. This 

specialization step, called post-processing, takes care of the various steps needed to switch 

from a C representation to the targeted dialect, say cuda or opencl. The Maker class 

plays this role and describes the final steps required to generate the proper external representation. 

Additionally, it generates a Makefile to automate the complex build process 

resulting from the combination of different targets in the same build process. 

0,1


3.1.3.2 Control Flow and Pass Management 

Control flow features are profitable in many situations. The use cases presented in this 

section are excerpts of existing compiler instances presented in Chapter 7. The examples 

are written using the pyps interface described in Section 3.3 but they should nevertheless 

be understandable. 

Sequences represent the ◦ operator. They are common to all pass managers. They are 

used each time transformation chaining is used. 

Methods represent the mathematical function definition. They are used to structure 

the compiler code and to enforce re-ruse. For instance the generation process for 

multimedia instructions is common to sse, Advanced Vector eXtensions (avx) and 

neon instruction sets with minor parameter changes (e.g. size of the vector register). 

It can be packed into a function at the pass manager level as shown in Listing 7.1. 

Conditionals are extensions of the ◦ operator. They are used when the pass scheduling 

depends on compiler switches or on the current compilation status. Figure 3.5 shows 

two situations where it can occur: The pass manager has to activate or deactivate 

some passes, or modify the compilation scheme. 

# check if if_conversion is asked for 

if params . get (’if_conversion ’): 

module . if_conversion () 

(a) Using conditionals as switches. 

# optimize a module if it does not belong to the runtime 

if module .cu != " myruntime ": 

module . optimize () 

(b) Using conditionals to change compilation scheme. 

Figure 3.5: Usage of conditionals at the pass manager level. 

Do Loops represent the map on a set, they are used to perform repetitive tasks. Such 

tasks are often found in a pass manager, e.g. apply a pass to each loop or function 

of a set, or apply a transformation iteratively with varying parameters. 

Exceptions are related to the error state. They can be used by a pass to signal an 

unexpected situation. For instance, the inlining phase raises an exception if it has 

no caller, or an attempt to offload a loop nest fails if generic pointer are involved 

and the engine does not know how to transfer them on the hardware accelerator. 

Listing 3.3 shows an example of such a situation. 

Transformation generators are described as class methods with parameters. To feed 

these parameters, and more generally to feed conditionals or loop ranges, some information 

concerning the code being compiled is necessary. It raises the issue of the pass manager


for kernel in gpu_kernels : 

kernel . generate_communications () 

(a) Iterate over sets. 

for pattern in [" add ", " minus ", " mul "]: 

module . pattern_recognition ( pattern ) 

(b) Iterate over parameters. 

Figure 3.6: Usage of loops at the pass manager level. 

try : 

module . isolate_statement () 

# if isolate statement succeeds , 

# go on up to kernel generation 

module . outline () 

... 

except RuntimeError as re: 

print (" Unable ␣to␣ generate ␣ GPU ␣ code :␣"+ str (re )) 

# maybe try SSE instructions instead ? 

Listing 3.3: Usage of exceptions at the pass manager level. 

interface granularity: should all the compiler internals be exposed to the pass manager or 

should it only show a subset of relevant information? The former approach is taken by 

Rudy et al. [RCH + 10]: they use the lua scripting language to expose all the ir to the pass 

manager, based on which the compiler performs gpu code generation, using an iterative 

algorithm expressed at the script level. 

The benefits of using such an approach are unclear from a separation of concerns point 

of view: high level code is mixed up with lower level code, without a clear boundary between 

the two concerns. We propose an alternative approach based on the following assessment: if 

an access to the ir is needed, then a pass should be used to simplify the pass manager’s job. 

Otherwise it can be done at the pass manager level. That is the high-level transformations 

should be managed by a high-level language given the proper abstraction, and the low 

level transformation should be managed by the native language that has an access to all 

the infrastructure capabilities. Eventually these languages could be the same, but the fact 

they address different needs, basically programmability vs. performance, should lead to 

different engineering choices.

3.2. ON SOURCE-TO-SOURCE COMPILERS 55 

Front Ends Targets 

compiler C C++ Fortran openmp cuda sse vhdl mpi 

pips 

pocc ≡ 

mercurium 

cetus ≡ ≡ 

rose 

gecos 

hmpp 

: feature officially supported ≡: feature mentioned in a paper 

Table 3.1: Comparison of source-to-source compilation infrastructures. 

3.2 On Source-to-Source Compilers 

In the previous sections, we have separated the job of the source-to-source compiler 

from the job of the source-to-binary compiler. The source-to-source compiler takes care of 

language-independent transformations and optimizations based on a common framework, 

and source-to-binary compilers are used as back-ends. 

Definition 3.16. A source-to-source compiler is a compiler that takes a source code written 

in a high level language as input, and outputs code in a high level language. 

3.2.1 Exploring Source-to-Source Opportunities 

Historically, many transformations have been expressed at the source code level, especially 

parallelizing transformations. Many successful compilers are source-to-source compilers: 

HPFC [Coe93], cilk [FLR98], acotes [MAB + 10], or based on source-to-source 

compiler infrastructures pips [IJT91], suif [WFW + 94], rose [Qui00], cetus [iLJE03], 

mercurium [AMG + 99], or GeCoS [DMM + ]. Such infrastructures provide many code 

transformations relevant to our problem, such as parallelism detection algorithm, variable 

privatization, etc. 

Table 3.1 gathers the result of our study of the number of hardware targets for seven 

source-to-source research or industrial compilers still in use or in development. Most of 

them are used to generate code for more than one target. 

All the compilers considered so far take a C dialect as input language, and do not 

provide a binary interface. Thus, a hardware translator is required to generate C code as 

a result of its processing. As stated above, a post-processor is needed to adjust syntactic 

changes between plain C and the targeted dialect. Such modifications can easily be done 

at the source level using common techniques. The first one is regular expressions: many 

language adjustments, such as adding the triple chevron from cuda can be performed with 

a relevant regular expression. However, most programming languages are not regular, so


tr 

source-tosource 

compiler 

tr 

external tool 

tr 

source-tosource 

compiler 

tr 

Figure 3.7: Source-to-source cooperation with external tools. 

regular expressions are not sufficient for textual substitution. When combined with the C 

macro processor that has a weaker rewriting engine but is capable of diverting function 

calls using macro functions, they can catch more pattern, although not providing a reliable 

tool for all situations (e.g. no pairing of brackets). This combination has been successfully 

used for the three compiler implementations described in Chapter 7. 

In addition to the intuitive collaboration with source-to-binary compilers, source-tosource 

compilers can also collaborate between each other to achieve their goal, using source 

files as a common medium, at the expense of extra processing for the additional switches 

between Textual Representation (tr) and ir. Figure 3.7 illustrates this generic behavior. 

For instance, optimization of loop nests can be delegated to the Polyhedral Compiler 

Collection (pocc) tool, a compiler optimized for polyhedral transformations. 

More traditional advantages of source-to-source compilers include their debugging features, 

because the ir can be dumped as a tr at anytime, then compiled and executed. For 

the same reason, they are very pedagogical tools and make it easier to illustrate the behavior 

of a transformation. As claimed above, many transformations, such as loop interchange 

or loop unrolling, are easily described as source-to-source transformations.

3.2. ON SOURCE-TO-SOURCE COMPILERS 57 

.c 

Source-to- 

Source Compiler 

.c 

Post-Processor 

.c dialect 


Binary Compiler 

Machine 

Code 

Figure 3.8: Heterogeneous compilation stages. 

3.2.2 Impact of Source-to-Source Compilation on the Compilation 

Infrastructure 

There are as many C dialects as hardware devices. As a consequence, a source-to-source 

compiler that aims at generating code for several targets has two possibilities: either write 

as many pretty-printers as existing dialects, or regenerate C code and use an external tool 

to perform the translation. This approach requires a post-processing step to fill the gap 

between the C code augmented with runtime functions and the hardware language, as 

shown in Figure 3.8. 

The combination of Figure 3.8 with Figure 3.9 results in the final compilation infrastructure 

diagram described in Figure 3.9. The main achievement shown by this figure is 

that most developments can be done in a source-to-source infrastructure using a common 

ir. This is a great progress compared to the situation with opencl described in Figure 2.10 

on page 39: it favors re-usability.



Host code Host Compiler Source-to-Source Compiler 1 


Source-to-source Compiler Infrastructure 

PP 0 

PP 1 

PP 2 


Binary 

Compiler 0 


Binary 

Compiler 1 


Binary 

Compiler 2 

Figure 3.9: Source-to-source heterogeneous compilation scheme. 

3.3 pyps, a High Level Pass Manager api 

Host 

Object 

Object 0 

Object 1 

Object 2 

PP n : Post-Processor 

The model presented in Section 3.1.2 leads to the design of an api for pass managers. 

All the compilers for heterogeneous devices presented in this thesis are based upon this 

pass manager. It uses a dynamic object-oriented scripting language, Python, for flexibility 

and ease-of-development without much performance loss, for the compiler passes are still 

implemented in a compiled language, C. It uses an object oriented language to represent the 

class hierarchy from Figure 3.4, to enforce code reuse and to facilitate compiler composition. 

This section describes in details the methods exposed at the pass manager level for each 

of the class identified in the previous section: program, function, loop and maker. 

3.3.1 api Description 

This pass manager is implemented in Python on top of the pips compiler infrastructure 5 

and named pyps. It consists in fewer than 700 sloc. In addition to the language properties 

mentioned above, Python has the advantage of having a rich set of libraries and a dynamic 

community. As an example of the benefits of using a mature and feature-rich language, 

combining pyps with the enhanced Python interpreter ipython has led to a powerful cli 

to pyps at almost no development cost, unlike other scripting tools implemented over pips 

such as tpips. The integration with the C language is simple enough to allow an easy 

binding with pips libraries. Note however that the api design is completely independent 

from the underlying compiler infrastructure, which makes it a suitable candidate for other 

compiler infrastructures. 

In this api, two main entities are used to abstract the source-to-source compiler: a 

workspace and a Maker. The former represents a whole program and the transformation. 

The latter represents the global compilation scheme: post-processing, source-to-binary 

5. An overview of the pips compiler infrastructure is given in Appendix A.

3.3. PYPS, A HIGH LEVEL PASS MANAGER API 59 

compiler call, etc. 

A workspace provides the following methods: 

init(sources, flags ): a workspace is initialized from a set of files and preprocessor 

flags. Once created, it has knowledge of the full program code, and access to all the 

relevant runtimes; 

save(dir ): once all transformations have been performed, this method is used to regenerate 

all source files without post-processing; 

build(dir, maker ) uses a Maker to save current workspace and perform the postprocessing. 

The default Maker performs no post-processing and generates a Makefile 

for a traditional compilation scheme. 

compile(dir, maker ): gather all source files, save them in dir, post-processes them with 

maker and use the Makefile generated by the build method to compile them; 

checkpoint() saves current workspace state, returning an identifier; 

restore(chk_id ) restores workspace back to the state when chk_id was obtained. 

A typical use case is to create a workspace using init, perform some transformations, 

save it to check the sequential result, generate a build chain using build and run 

compile to check the generated executable. The purpose of this dividing is to provide entry 

points—junction point in the aspect programming terminology—where compiler developers 

can insert their own code. Indeed the default workspace provide no direct facilities for 

heterogeneous computing. Instead, a source-to-source compiler must inherits from it and 

implements the target-specific transformations using method overloading. More generically, 

given a code that contains n code fragments to be run on m distinct targets, translator 

developers must compose n workspaces together, using multiple inheritance and existing 

or newly created workspaces. 

Let us give a practical example: pyps is shipped with many workspaces types, including 

a workspace to instrument an application for benchmarking purpose, one to generate 

openmp code, one to generate avx code and one to generate cuda code. To build a 

compiler for a heterogeneous machine made of an nVidia gpu and several Intel cores 

with avx support—a classical hardware configuration nowadays—, one can rely on existing 

components and implement the compilation scheme as described in Listing 3.4. In 

this example, a new workspace that inherits from the existing one is created. In fact, a 

new compilation scheme is implemented at a high level, relying on existing ones. Code 

generation is automatically forwarded to the proper base class and does not need to be 

specified here. 

A source-to-source compiler typically inherits from the workspace class. The init 

method can be overridden to pass extra flags to the workspace and to provide additional 

source files to the compiler infrastructure. The latter can be third-party libraries stubs, 

declaration of functions used as parameters for a pattern-matching engine or runtime declarations 

needed by the code generation process. In a similar manner, the save method 

can be used for header substitution, i.e. to add extra header files and #include them 6 . 

6. When the ir cannot represent preprocessor symbols.


# load relevant packages 

import pyps , sac , openmp , cuda 

# assemble workspaces , order usually matters 

class my_workspace (sac , openmp , cuda ): 

pass 

# provide a per - function compilation scheme 

def my_compilation_scheme ( module ): 

w= module . workspace # recover current workspace 

chk =w. checkpoint () # save current state 

try : # CUDA code generation 

m. cuda () # raise an exception in case of failure 

except RuntimeError : 

w. restore ( chk ) # restore pre - CUDA state 

try : # OpenMP and AVX code generation 

m. openmp () 

m. avx () 

except : pass 

Listing 3.4: Example of workspace composition at the pass manager level using pyps. 

When the translator generates constructs that cannot be represented in C or use a specific 

compilation process, a specific Maker is fed to build to perform the post-processing steps. 

For instance the multimedia instruction generator described in Section 7.4 overrides 

the init method to add its own runtime files to the workspace and then forward the call 

to its parent. Likewise the save method is overridden to add a generic header file at the 

top of each source file—a step that cannot be performed earlier as it require an additional 

preprocessor run. The Maker class is extended to add special processing, and, depending 

on the maker the build method is given as arguments, it generates different code. With 

the default maker, the sequential version of the generated vector instructions is used. With 

an sse enabled maker, proper compiler flags and post-processing are activated. 

The composition of workspaces relies on two assumptions: 

1. all classes inheriting from workspace forward calls to their parent; 

2. the compiler developer guarantees that the composition makes sense. 

Assumption 1 makes it possible to compose workspace and follows the idea that a 

workspace takes care of its target specific processing and delegates parts he does not know 

how to handle to its parent—in the end the default workspace manages the leftover. Assumption 

2 guarantees that the composition leads to error-free code: it does not make 

sense to generate avx calls inside a cuda kernel but it does make sense to do so in a loop 

annotated with openmp directives. 

A program is no more than a set of compilation units. All methods of programs are

3.3. PYPS, A HIGH LEVEL PASS MANAGER API 61 

available at the workspace level. 

A compilationUnit does not provide any additional method but is used as a structuring 

element by some passes. Indeed, an important characteristic of heterogeneous computing 

is that different compilation units may have different targets, and thus use different sourceto-binary 

compilers. 

A function provides the following methods: 

get_code():string : the fundamental feature of a source-to-source compiler is the capability 

of switching between the ir and the tr. This method builds the current tr as 

a string; 

set_code(code ): this method replaces current code by a new version given by a string. 

Combined with the previous method, it makes it possible to call an external tool on 

the tr and use the output to build a new ir; 

callers, callees:functions : make it possible to navigate the (static 7 ) call graph; 

passXYZ (params ): all compiler transformations are exposed as function methods. 

Sub-classing of a function is used to provide new code transformations as a complex 

chaining of existing transformations. 

The Loop class provides the member below, mainly used for inter-pass communication. 

label: a read only value used to uniquely identify the loop and the statement holding it; 

pragma: a list of directives attached to the loop; 

loops: a list of all loops directly contained in this loop. 

The Maker class provides a unique method: 

generate(dir, sources ) eventually post-processes files given as sources and found in 

dir, and generates a Makefile to compile them. 

The generate method can be overridden to change the generated Makefile and to 

add post-processing steps. For instance the Maker found in the openmp package adds the 

proper -fopenmp compilation flag to the Makefile. 

Let us illustrate the relevancy of this architecture on a practical situation: sse code 

generation from plain C code. The technical parts are detailed in Section 7.4, only the 

compiler architecture is described here. 

Source-to-Source Compiler: this compiler is a classical vectorizer that generates C intrinsics 

to represent vector operations. Intrinsics are used for both data movement 

and vector operations and have a sequential version written in C. An example of such 

generated code is given in Listing 3.5, and its sequential implementation is given in 

Listing 3.6. 

Post-Processing: because generated code is still C, it can be executed, although not 

efficiently, on a sequential processor. The only post-processing step is to add the 

relevant #include to all source files. 

7. The call graph is static because we do not consider the case of function pointers.


void vadd_l99999 ( float a[4] , float b [4]) 

{ 

// PIPS : SAC generated v4sf vector (s) 

v4sf vec00 , vec10 ; 

SIMD_LOAD_V4SF ( vec00 , &a [1 -1]); 

SIMD_LOAD_V4SF ( vec10 , &b [1 -1]); 

l99999 : ; 

SIMD_ADDPS ( vec00 , vec00 , vec10 ); 

SIMD_STORE_V4SF ( vec00 , &a [1 -1]); 

} 

Listing 3.5: sse C intrinsics generated for a scalar product. 

void SIMD_ADDPS ( float *dst , ...) 

{ 

int i; 

va_list ap , ap_f ; 

va_start (ap , dst ); 

for (i = 0; i

3.4. RELATED WORK 63 

3.3.2 Usage Example 

The five compilers presented in this document are based on the pyps api. The fact 

that we were able to write them is a first step toward the validation of the api. 

As an example of the api validity, we have used pyps to perform fuzz testing on the pips 

compiler infrastructure. Fuzz testing is a software testing technique that injects random 

input into a piece of software to test its behavior. The technique used is described in 

Algorithm 1 and the equivalent pyps code is given in Listing 3.8. 

Data: p ← a program 

Data: g ← a function transformation generator 

binary ← compile(p); 

repeat 

f ← random_function(p); 

p ′ ← g(f)(p); 

binary ′ ← compile(p ′ ); 

until exec(binary) = exec(binary ′ ); 

Algorithm 1: Fuzz testing at the pass manager level. 

This simple program assumes that the input code has a reproducible output printed on 

standard output. We have used it conjointly with a random C code generator from Eric 

Eide and John Regehr [ER08]. This program generates random C programs with deep 

call graphs that prints a hash value representing their execution on standard output. We 

tested 10 transformations with this fuzzer. For each of them, an erroneous instance was 

found. 8 

3.4 Related Work 

Compilers have been built for many years. The first complete Fortran compiler was 

released by an IBM team in 1956 [Bac57]. The first beta release of gcc by Richard M. 

Stallman dates back to the 22 th of March, 1987. Compilers are known to be complex 

pieces of software that evolve slowly, while hardware keeps evolving at a steady rate. We 

have run David A. Wheeler’s sloccount [Whe01] on 2 leading open source compilation 

projects, gcc and llvm and reproduce its output in Table 3.2. It shows that a compiler 

project involves a lot of development skills: many languages are used and the total number 

of sloc gets over 2 · 10 6 for gcc and 5 · 10 5 for llvm. These projects accumulate several 

difficulties for new comers: low level languages, important code database, diversity of the 

languages used. 

To tackle this problem, several approaches have bee proposed by the research community. 

M. Zenger and M. Odersky proposed in [ZO01] a compiler framework to quickly 

8. By courtesy to pips development team, I only tested transformations I contributed to. And I fixed 

most of the found bugs.


import pyps 

import random , sys 

while True : # loop as long as no error is found 

# instanciate a compiler from the source in first argument 

w= pyps . workspace ( sys . argv [1]) 

# compile it using default source -to - binary compiler 

b=w. compile () 

# keep output as a reference 

( r_ref , o_ref , e_ref )=w. run (b) 

# pick a random function in the input code 

f= random . choice (w. all_functions ) 

# select the transformation given in second argument 

# and apply it 

getattr (f,sys . argv [2])() 

# compile the transformed code 

b=w. compile () 

# get its output 

(r,o,e)=w. run (b) 

# close the compiler instance 

w. close () 

# check output versus reference and eventually raise an error 

if r!= r_ref or o!= o_ref or e != e_ref : 

sys . exit (1) 

Listing 3.8: Fuzz testing with pyps.

3.4. RELATED WORK 65 

ansic 2100307 (48.66%) 

java 681858 (15.80%) 

ada 680664 (15.77%) 

cpp 594473 (13.77%) 

f90 79927 (1.85%) 

sh 47006 (1.09%) 

asm 44318 (1.03%) 

xml 29271 (0.68%) 

exp 18422 (0.43%) 

objc 15086 (0.35%) 

fortran 9849 (0.23%) 

perl 4462 (0.10%) 

ml 2814 (0.07%) 

pascal 2176 (0.05%) 

awk 1706 (0.04%) 

python 1486 (0.03%) 

yacc 977 (0.02%) 

cs 879 (0.02%) 

tcl 392 (0.01%) 

lex 192 (0.00%) 

haskell 109 (0.00%) 

(a) gcc sloccount report. 

cpp 453835 (88.82%) 

ansic 16764 (3.28%) 

asm 13711 (2.68%) 

sh 12828 (2.51%) 

python 4322 (0.85%) 

ml 4274 (0.84%) 

perl 2093 (0.41%) 

pascal 1489 (0.29%) 

exp 431 (0.08%) 

objc 334 (0.07%) 

xml 283 (0.06%) 

ada 235 (0.05%) 

lisp 187 (0.04%) 

csh 117 (0.02%) 

f90 36 (0.01%) 

(b) llvm sloccount report. 

Table 3.2: sloccount reports for the gcc and llvm compilers. 

experiment with new language features. They focus on two aspects: the extension of the internal 

representation and the composition of compiler components. The former is achieved 

through the use of extensible algebraic types, to extend simultaneously the Abstract Syntax 

Tree (ast) and the existing phases, and the latter rely on an original design pattern 

called Context Component, to provide an extensible and hierarchical component system. 

In a keynote talk [Coo04], K. D. Cooper emphasises the limitation of existing compiler 

architectures for iterative compilation at the pass level and favors the use of complex patterns, 

beyond list scheduling, supported by a flexible architecture to explore several phase 

orderings. 

Given the difficulty to master existing compiler implementations, several recent papers 

have focused on the interaction between enlightened compiler developers and compiler 

infrastructures. Lattner and Adve introduced in [LA03] a modular pass manager for 

llvm. Later on, gcc overcame its own limitations thanks to a plug-in mechanism described 

in “Plugins” chapter of [Wik09]. Likewise, the extensible micro-architectural optimizer 

described in [HRTV11] relies on the flexibility of their pass manager to load additional 

passes at run time using a plug-in mechanism.


Rudy et al. presented in [RCH + 10] an interactive pass manager based on the lua 

language [Ier06]. The pass manager is used to scan various parameters of polyhedral 

transformations such as loop unrolling rate or blocking size, and to selectively apply transformations 

such as loop interchange. The resulting code is turned into a problem-specific 

cuda kernel and the most efficient one is selected, as well as the associated set of transformations. 

The advantage of the approach is that the compiler configuration for the specific 

target is stored as the pass manager script itself, which can be reused without re-evaluation 

for further re-compilation. 

In [Yi11], separation between the analyses and the transformations is enforced at the 

pass-manager level: A compiler is used to generate a valid sequence of pass, as a script 

written in a dedicated language, then this sequence is executed by the pass manager. This 

approach tries to decouple analyses and transformations, but cannot verify the validity 

of the generated script (otherwise it would require the generator to effectively execute 

the transformations) and does not favor reuse of the compiler infrastructure, as the pass 

manager executes passes written in another compiler. 

Our approach basically turns a compiler into a program transformation system, which 

is an active research area. FermaT [War99] focuses on program refinement and composition 

is limited to sequences. cil [NMRW02] only provides a flag-based composition 

system, that is activation or deactivation of passes in a predefined sequence without taking 

into account any feedback from the processing. The stratego [OOVV05] software 

transformation framework does not separate concepts as clearly as we do, but it uses the 

concept of strategies to describe the chaining of transformations using dedicated operators, 

an approach similar to ours (see Section 3.1.2). However, the effective implementation relies 

on a new language, while we choose to map the concepts to existing constructs. Work 

on optimix by Aßmann [Aß96] proposes considering asts as graphs and passes as graph 

transformations, and to use a graph rewriting system to specify transformations. All the 

transformations and their composition are done at the ast level. 

In parallel to the pass management topic, a growing number of studies has been conducted 

over the past few years to tackle heterogeneous platforms. In [ABCR10], a scheme 

to couple Just In Time (jit) compilation for multiple targets is described. They designed 

a new Object Oriented (oo) language named lime which has the property of being convertible 

into two representations: one targets regular Central Processing Units (cpus) and 

one targets Field Programmable Gate Arrays (fpgas) through a complex compilation flow 

that involves the verilog language. An originality of the approach is that they provide 

a runtime that plays a bridging role between the two representations, allowing a so called 

“mixed mode execution” where the best representation is selected at runtime. In fact, the 

jit approach is orthogonal to our concern. It tackles a slightly different problem: code 

portability. Once a code has been compiled for a target, is it possible to retarget it for 

another hardware, without the need of generating another binary? In the static compilation 

model, it is not possible 9 to address new hardware, hardware that is not known at 

9. pgi’s compiler works around this limitation by bundling several binaries, one per target, in the same 

executable. It does provide a kind of retargetability, but it does not address the issue of supporting new

3.5. CONCLUSION 67 

compilation time, while theoretically, an update of the jit compiler does. 

Ocelot [DKYC10] is a similar project that translates Parallel Thread eXecution (ptx) 

code into x86 emulation code, amd gpu code or llvm code for parallel execution on multicore, 

thanks to a jit compilation infrastructure. Choosing ptx as a front end language 

provides the advantage of having a direct description of the parallelism, but it does not 

address legacy code. 

An obvious approach to tackle heterogeneity while taking advantage of existing transformations 

is to extend traditional compilers to support heterogeneous targets. For instance, 

gomet [BL10] is a gcc-based compiler that propose the following compilation 

flow: firstly, take advantage of gcc to parse input code, generate gimple representation 

and apply high-level passes such as Simple Static Assignment (ssa) or polyhedral 

loop transformations. Then build an interprocedural hierarchical dependence graph that 

is combined with a high-level target description to generate source files to be compiled for 

the target. The target architecture description takes into account three characteristics: a 

cost model to decide offloading profitability, the parallelism description for simd and/or 

mimd code generation, and the current load for runtime-scheduling. 

This approach shares several aspects with our methodology, but code generation is 

performed under the assumption that target hardware compiler accepts plain C code as 

input, which is indeed the case for openmp and the cell engine, but not for gpus or fpgabased 

processors. It explains the lack of Instruction Set Architecture (isa) description 

in the architecture model. Moreover, the architecture description consists in a set of C 

functions that implement the correct behavior, which lacks in abstraction and flexibility. 

The compilation process is hard-coded for each target and not retargetable. 


In this chapter, we have presented the impact of heterogeneous computing on compiler 

infrastructures. We have pointed out that the use of the C language with the proper 

conventions is a good choice of ir. We have described a compilation infrastructure that 

takes advantage of this choice combined with source-to-source capabilities to enforce code 

reuse and make it easier to interact with third-party tools. Based on a new model for 

the combination of code transformations, we have specified an api for pass managers, 

called pyps, and combined it with a generic programming language to end up with a 

flexible way to design compilers for heterogeneous devices. The pyps api and its Python 

implementation are publicly available as part of the pips project. 

This api is used to combine the code transformations needed to match the hardware 

constraints identified in Chapter 2 and presented in the next three chapters. 

architectures.

68 CHAPTER 3. COMPILER DESIGN FOR HETEROGENEOUS ARCHITECTURES

Chapter 4 

Representing the Instruction Set 

Architecture in C 

Pont de Sainte Catherine, Plounevezel, Finistère c○ Jean-Claude Even 

In a keynote given at the Fusion Developers Summit 2011, in Bellevue, Washington, 

Phil Rogers announced that 

The Fusion System Architecture (fsa) is Instruction Set Architecture (isa) 

agnostic for both Central Processing Units (cpus) and Graphical Processing 

Units (gpus). This is very important because we’re inviting partners to join 

us in all areas; other hardware companies to implement fsa and join in the 

platform. . . 

Under the hood, the Fusion System Architecture (fsa) relies on a virtual isa. In Chapter 

3, we state that keeping the Internal Representation (ir) independent from the targeted 

hardware is a requirement to reach a good level of abstraction. But is it possible to represent 

all the refinements of targeted isa in an ir that stays close to the C language [ISO99]? 

Quoting Brian Kernighan [Ker03], 

C is perhaps the best balance of expressiveness and efficiency that has ever 

been seen in programming languages. (. . . ) It was so close to the machine 

69

70CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C 

that you could see what the code would be (and it wasn’t hard to write a good 

compiler), but it still was safely above the instruction level and a good enough 

match to all machines that one didn’t think about specific tricks for specific 

machines. 

This reminds us that the C language was designed to be “close-to-the-metal”. So even if 

the chosen ir remains as close as possible to the C language, it can be sufficiently low level 

to express some of the specificities of the targeted isa. This is examined in Section 4.1. We 

then go through all the aspects of an isa and show that, provided some conventions and 

minor transformations, it is possible to adapt a C code to meet isa constraints. Section 4.2 

examines native data types; Section 4.3 reviews the use of specific registers, Section 4.4 details 

the link between intrinsics and instructions and Section 4.5 goes through the difference 

related to the memory architecture. Issues related to function boundaries are examined in 

Section 4.6 and external libraries are examined in Section 4.7. 

4.1 C as a Common Denominator 

The C language is the de facto standard to program low-level devices. In this section, 

we study the relationships between the standard language and the dialects used to program 

hardware accelerators and argue the C language can be used to embody some aspects of 

the isa. 

4.1.1 C Dialects and Heterogeneous Computing 

A problem raised by heterogeneous architectures from a compiler point of view is the 

choice of a suitable ir. Representing all hardware specificities in a unique language is 

not a feasible task. Although there has been recent work [PBdD11] to express hardware 

constraints at the ir level, representing all of them in a common ir is not easy. Some 

compilers have however chosen to extend their ir to represent target-specific features: Low 

Level Virtual Machine (llvm) integrates vector types to their basic types. Alternatively, 

a target-specific language can be used as a basis for other architectures: fcuda [PGS + 09] 

translates Compute Unified Device Architecture (cuda) kernels into Field Programmable 

Gate Array (fpga) circuits and swan [HF11] translates cuda codes into Open Computing 

Language (opencl) ones. 

An opposite approach is to use the versatility of the C language to represent both high 

level concepts and low level concepts using an ir that matches the initial language without 

target-specific extensions. In essence, this is similar to the concept of language virtualization. 

The definition for language virtualization given by Hassan Chafi et al. [CDM + 10] is 

the following: 

A programming language is virtualizable with respect to a class of embedded 

languages if and only if it can provide an environment to these embedded 

languages that makes the embedded implementations essentially identical to

4.1. C AS A COMMON DENOMINATOR 71 

C dialects 

Handel-C [LWFK02] Mitrion-C c2h cuda opencl 

parent language C C++ C C++ C99 

target fpga fpga fpga gpu gpu & manycore 

Table 4.1: C dialects and targeted hardware. 

corresponding stand-alone language implementations in terms of expressiveness, 

performance and safety—with only modestly more effort than implementing the 

simplest possible complete embeddings. 

We propose a relaxed definition for a programming language that can be derived from 

another programming language providing minor rewriting while maintaining an equivalent 

semantic. 

Definition 4.1. A programming language L0 targeting a hardware H0 is embodied by 

another programming language L1 targeting an hardware H1 if there exists a code transformation 

τ : L0 → L1 so that the execution on H0 of a program P written in L0 yields the 

same operational result as the execution of τ(P ) on H1. 

This definition is deliberately imprecise on the term of “yields the same operational 

result”, because execution order of varying type size introduces output changes that may 

not be significant to the end user. 

Table 4.1 lists a number of languages used to program hardware accelerators and their 

target platforms. It perfectly shows that C dialects are often chosen as an interface between 

the programmer and the hardware. 

4.1.2 From the ISA to C 

In [JRR99], Jones et al. presented a language called C-- that aims to be 

The interface between high-level compilers and retargetable, optimizing 

code generators. 

This language reduces the possibilities of the C language to make it easier to manipulate. 

It is quite low level and does not match the requirements of a high-level language to abstract 

concepts, but it paves the way for the idea of using a subset of C as an ir. 

C extensions such as Embedded-C [ISO08] have also been proposed to handle some 

specificities of heterogeneous computing, like dedicated registers or fixed point arithmetic, 

etc., but more important to us the availability of multiple address spaces. In the ISO/IEC 

TR 18037:2008 specification, a global address space is assumed and additional ones can 

be declared, in a nested fashion, and used as a qualifier. However, operations on disjoint 

address spaces are not allowed, thus data communication must be handled through 

overlapping address spaces or with intrinsic functions. 

In the context of heterogeneous computing, an ir must achieve two goals:


__m128 _mm_set1_ps ( float ); 

typedef struct { 

float data [4]; 

} __m128 ; 

Listing 4.1: Broadcast a single value in sse. 

Listing 4.2: Vector type emulation in C. 

1. make it easy for compiler developers to write passes and analyses; 

2. make it easy for compiler developers to write back-ends. 

The Hierarchical Control Flow Graph (hcfg) used in Paralléliseur Interprocedural 

de Programmes Scientifiques (pips) [CJIA11] somehow matches the first point under the 

constraint of source-to-source compilation, the idea being that the code hierarchy maps 

the hardware hierarchy. For instance if you target Instruction Level Parallelism (ilp), 

you can focus on loops and ignore the rest of the program; for task parallelism, you can 

focus on functions, etc. The second point raises the question: how to model hardware 

specific isa at the ir level? As described in [PH09], each piece of hardware has its own 

isa, and it is not sustainable to extend the ir to match each new hardware specificity, or 

each time a new feature appears. This is however the path taken by some compilers such 

as rose [Qui00, SQ03] where Parallel Thread eXecution (ptx) concepts are exhibited at 

the ir level. A direct consequence of this approach is that all existing algorithms must be 

extended to take this new constructs into account. In this section, we propose an alternate 

approach that consists in modeling enough of the isa at the C level to leverage existing 

analyses. 

Let us take a short example, taken from the C to Streaming simd Extension (sse) 

compiler described in Chapter 7. To duplicate a single single precision floating point value 

in all 4 slots of an sse register, an intrinsic is available with the signature given in Listing 4.1 

Instead of adding vector type support to the ir, one could emulate its behavior using 

an array type, encapsulated in a structure to be able to use it as a returned value from a 

function, as shown in Listing 4.2, and provide a sequential implementation that performs 

the same computations and have the same memory effects. That way the same result is 

produced and the data dependencies are still correct. In this case, it leads to the code in 

Listing 4.3. 

To validate our approach, we have written a replacement of the header file xmmintrin.h 

that does not make use of vector extensions. Said otherwise, we embody the sse extension 

using the C language. An excerpt of this file is shown in Appendix D. It has been 

successfully tested on an sse-based implementation of the SHA-1 algorithm: the project 

is compiled twice, once using the default configuration and once using the same configuration 

plus and additional flag to tell the compiler to use our sequential implementation 

of xmmintrin.h. The same input are processed and we verify we get the same checksum.

4.2. NATIVE DATA TYPES 73 

__m128 _mm_set1_ps ( float v) { 

__m128 res = { { v ,v ,v ,v } }; 

return res ; 

} 

Listing 4.3: Sequential implementation of _mm_set1_ps. 

Additionally, we measure that the sequential implementation runs 25 times slower than 

the optimized version, mainly due to repeated copy between union data types 1 , plus the 

absence of parallelism. 2 

4.2 Native Data Types 

As a result of the specialization of hardware devices, it is more common to find a device 

that does not support a data type than a device that supports a data type not supported 

by the C language. This section examines how to handle languages that do not support 

all the type constructions allowed by the C language. 

4.2.1 Scalar Types 

A common case however is the absence of floating point type. The usual fall-back is 

fixed-point arithmetic. This approach is favored when a Floating-Point Unit (fpu) would 

be too expensive, in area or energy consumption, or when the accuracy loss is not a problem. 

Kum et al. [KKS00] propose an approach to convert C programs with floating-point types 

to equivalent program with fixed-point arithmetic. 

Some hardware support standard type with unusual sizes. For instance, the terapix 

architecture [BLE + 08] uses integer registers of 36 bits. This kind of situation is easily 

emulated using a type definition. See how the C99 standard headers defines int32_t and 

int64_t types. 

4.2.2 Records 

Handling the absence of structures requires more care. The obvious solution is to split 

each variable that has a structure type into as many variables as the number of fields. 

The first step builds the set of all final types involved in a type definition. To do so we 

introduce a function split_struct : T × I → P(T × I) characterized by the following 

induction rules defined on the LuC language (see Appendix B): 

1. These copies could be avoided by using the C++ language, constant references and return value 

optimization, at the expense of losing C compatibility. 

2. Neither gnu C Compiler (gcc), Intel C++ Compiler (icc) or pips are capable of automatically 

revectorizing the sequential code because of the additional structure copies.


final_types(int, id) = {〈int, id〉} 

final_types(float, id) = {〈float, id〉} 

final_types(complex, id) = {〈complex, id〉} 

 

final_types(type[expr], id) = 

final_types(struct id s { fields }, id) = 

〈t,i〉∈final_types(type,id) 

〈t,i〉∈fields 

〈t[expr], i〉 

final_types(t, new_id(id, i)) 

where “new_id” is a function that constructs a new identifier unique to the program p 

using a prefix pre such that no variable declared in p is prefixed by “pre”. 

new_id : I × I → I 

(i0, i1) ↦→ prei0_i1 

Then all references are rewritten using the rename : R → R function defined by the 

induction rules: 

rename(id) = id if id is a scalar 

rename(ref [expr]) = rename(ref )[renamee(expr)] 

where renamee : E → E is defined by: 

rename(ref.id) = rename(kref )_id 

renamee(cst) = cst 

renamee(ref = rename(ref ) 

renamee(expr op expr) = renamee(expr) op renamee(expr) 

This transformation is illustrated by Listing 4.4, under the assumption of pre =__.

4.2. NATIVE DATA TYPES 75 

typedef struct { int f0; float f1; float f2 [3] } my; 

my var ; 

var .f0 =2; 

var .f2[var .f0 ]=4.2; 

int __var_f0 ; float __var_f1 ; 

float __var_f2 [3]; 

__var_f0 =2; 

__var_f2 [ __var_f0 ]=4.2; 

⇓ 

Listing 4.4: Example of structure removal. 

This transformation is problematic in three cases: 

– when a variable of structure type is passed as a function parameter; 

– when the address of a variable involved in a structure is taken; 

– when the sizeof operator is used. 

These situations cannot be handled by our toy language but arise in practice. For the 

first case, it is still possible to extend the function definition to take one parameter per 

final_types(t) element, as illustrated by Listing 4.5. In the second case we cannot do much 

because the memory layout is completely changed by the transformation and we cannot 

assume any previous pointer arithmetics is still correct. The third case is easy to handle, 

by replacing the sizeof expression by the actual type size. 3 

4.2.3 Arrays 

It is also possible to handle the absence of array types in two circumstances: if all arrays 

have fixed size and constant indices in access functions. 4 This is simply done by creating 

as many variables as the array size and then use a renaming convention for all constant 

array indices. The declaration transformation d : T × I → (T × I) + is given by: 

d(int, id) = {〈int, id〉} 

d(float, id) = {〈float, id〉} 

d(complex, id) = {〈complex, id〉} 

d(type[cst] id) = 

i=cst 

i=1 

d(type, new_id(id, cst)) 

3. At that point we are already at the specialization step, so portability across hardware platform is no 

longer an issue and considerations like “the size of a type is platform dependant” are no longer relevant. 

4. This situation was found in code automatically generated from the Faust programming language 

[GO03].


typedef struct { double re ,im; } complex ; 

void cmul ( complex *res , const complex *c0 , const complex *c1) { 

res ->re=c0 ->re*c1 ->re - c0 ->im*c1 ->im; 

res ->im=c0 ->re*c1 ->im + c0 ->im*c1 ->re; 

} 

/* ... */ 

complex a={1 ,0} ,b={0 ,1} , c; 

cmul (&c ,&a ,&b); 

void cmul ( double * res_re , double * res_im , 

const double * c0_re , const double *c0_im , 

const double * c1_re , const double * c1_im ) { 

* res_re =* c0_re * * c1_re - * c0_im * * c1_im ; 

* res_im =* c0_re * * c1_im + * c0_im * * c1_re ; 

} 

/* ... */ 

double a_re =1, a_im =0; b_re =0, b_im =1,c_re , c_im ; 

cmul (& c_re ,& c_im ,& a_re ,& a_im ,& b_re ,& b_im ); 

⇓ 

Listing 4.5: Structure removal in the presence of function call. 

and the reference r : R → R renaming is similarly given by 

r(id) = id if id is a scalar 

r(ref [cst]) = r(preref )_cst 

r( ref.id) = r( ref ).id 

which is similar to the structure case and suffers the same limitations. 

A common case is the absence of array type but support for pointers. The transformation 

from arrays to pointers requires two steps: 

1. a linearization step, where multi-dimensional arrays are converted to uni-dimensional 

ones; 

2. an array-to-pointer conversion step, using the equivalence between a[i] and *(a+i). 

Listing 4.6 illustrates the two steps of this transformation. 

4.3 Registers 

Depending on the isa, some registers can only be used for particular operations. For 

instance, the x86 architecture distinguishes between general purpose registers and floating

4.4. INSTRUCTIONS 77 

// initial code 

float a[n ][3]; 

a [2* i ][1]++; 

// after linearization 

float a [3* n]; 

a [2* i *3+1]++; 

// after pointer conversion 

float *a= alloca ( sizeof (*a)*n *3); 

*(a+2* i *3+1)++; 

⇓ 

⇓ 

Listing 4.6: Two-step transformation from multi-dimensional arrays to pointers. 

point registers. In the ptx [NVI10], special registers are dedicated to thread identifiers 

(%tid) and warp identifiers (%warpid). 

According to the compilation scheme proposed in Figure 3.9, low level transformations 

are taken care of by the vendor compiler. However, there are situations where such compilers 

are not available and only an assembler is given—we found ourselves in this situation 

for the terapix machine. 

In that case, a specific naming scheme is used to distinguish specific registers from 

others, and hardware-specific heuristics are used to distinguish general-purpose registers 

from others. 

Let us take the example of the terapix architecture [BLE + 08] that uses three kinds of 

registers with three prefixes: im for image pointers, ma for mask pointers and re for scalar 

variables. For each variable, depending on its type (array, pointer or scalar), it is possible to 

determine whether it needs the re prefix. Then, to distinguish between mask data, stored 

in a small read-only memory, and image data, stored in a bigger read-write memory, we 

use a heuristic: any array that is written at least once goes to the image memory, and 

an array that is only read goes in the mask memory if its memory footprint is statically 

known to be less than a given constant. Listing 4.7 illustrates this transformation for a 

kernel that lightens up an image by a constant, given in a mask. 

4.4 Instructions 

Simple instructions usually have a C equivalent: set and move can be represented as 

assignments, though more complex functions may be needed to represent load from remote 

memory, as is the case for vector register. Likewise read and write from external devices


void launcher_0_microcode ( int I, int *in , int *out , int * cst ) 

{ 

int j; 

int *out0 , * in0 ; 

in000 = in00 ; 

out000 = out00 ; 

for (j = 0; j < I_18 ; j += 1) { 

* out000 = * in000 +* cst ; 

in000 = in000 +1; 

out000 = out000 +1; 

} 

} 

⇓ 

void launcher_0_microcode ( int * FIFO2 , int *FIFO1 , int *FIFO0 , int N0) 

{ 

int *im0 , *im1 , *im2 , *im3 , * ma4 ; 

int re0 ; 

ma4 = FIFO2 ; 

im3 = FIFO1 ; 

im2 = FIFO0 ; 

im0 = im2 ; 

im1 = im3 ; 

for ( re0 = 0; re0 < N0; re0 += 1) { 

* im1 = * im0 +* ma4 ; 

im0 = im0 +1; 

im1 = im1 +1; 

} 

} 

Listing 4.7: Using naming convention to distinguish registers for the terapix architecture.


can be abstracted as memcpy using the proper memory location abstractions, as discussed 

in Section 4.5. 

Basic arithmetic and logic operations are available in C, and if not (e.g. Fused Multiply- 

Add (fma) or min, max operators) can be represented by an equivalent function. 

Control flow instructions such as branching, looping, conditional branch, indirect 

branch, jumps, etc. all have their C equivalent. If a complex control-flow is missing in the 

targeted language, it is generally possible to lower it to the desired level. Such transformations 

include the transformation of for loops to while loops, repeat until to while, 

or even while to gotos. if then else tests can be split into two if then blocks, etc. 

The maximum number of operands can be restricted by the isa. In that case, complex 

C expressions can be split to take this constraint into account. 

Parallel instructions as found in Advanced Vector eXtensions (avx) instruction sets 

can be emulated by their sequential counterparts, which correspond to one of the possible 

ways of scheduling the parallel operations. 

4.4.1 Instruction Selection 

Specialized instruction sets are used to speedup executions. At the extreme, it has led 

to Complex Instruction Set Computer (cisc). This specialization is often seen in dsp: 

fma, saturated arithmetic, min/max, etc. While not mandatory to obtain a correct code, 

taking advantage of these instructions is mandatory for performance. 

The process of mapping the target-independent ir to a target-specific instruction set 

is called instruction selection and is one of the last steps to be performed in a traditional 

compiler. The problem is generally solved by using dynamic programing on expression 

trees [AJ75] or by sub-graph partitioning [API03]. 

In our case, this process should be delegated to the source-to-binary compiler and not 

to the source-to-source compiler. However, it may be necessary to perform this step: 

– before Single Instruction stream, Multiple Data stream (simd) instruction generation. 

For instance the neon instruction set supports a vectorized fma, so the fma pattern 

must be found prior to simdization. 

– when the source-to-binary compiler does not perform complex instruction selections. 

Basically, a few instruction have to be converted to function calls at the source level. As 

the expression tree of these few expressions either do not overlap (e.g. fma and maximum) 

or are subsets one of another (e.g. maximum and saturated add), a greedy algorithm 

performs well enough. 

An instruction is described by its name, the number and type of operands and the 

expression tree. In a very source-to-source manner, we represent an instruction by a regular 

C function, using the function name as instruction identifier, parameters as operands and 

body as expression tree. 5 From a list of patterns, the algorithm iteratively performs a 

5. This approach is quite similar to the way C intrinsics move assembly operations at the C function 

level. For instance gcc defines the intrinsic __sync_bool_compare_and_swap to issue an atomic 

compare-and-swap. Compiling the call with gcc 4.6.1 on a x86 computer translates into the assembly call 

lock cmpxchgl that can only be represented in C through an asm(...) call.


pattern-matching pass for each element of the sequence. 

4.4.2 N-Address Code Generation 

When a program contains long expressions, it is sometime needed to break these expressions 

into smaller ones. This transformation is called N-address code generation. It 

is generally used for assembly code generation. It can however be useful at the source 

level, either because the back-end compiler takes assembly code as input, or to create 

opportunities for further transformations like invariant code motion. 

The problem can first be stated as follows: given a statement in the form id = expr, 

where id is an identifier and an integer n ∈ N ∗ , transform it to a sequence of assignments 

so that any expression in the right hand side of the assignment does not involve more than 

n references. 

Let us define the transformation a : N + × S → S : 

n, id = cst ↦→ id = cst (4.1) 

n, id = ref ↦→ id = ref (4.2) 

 

id = e0 op e1 if depth(e0) + depth(e1) < n 

n, id = e0 op e1 ↦→ 

a(n, id = e0); a(n, id0 = e1); id = id op id0 otherwise 

(4.3) 

where id 0 is a new identifier unique to the program and depth : E → N is defined by 

cst ↦→ 1 

id ↦→ 1 

id = id op expr ↦→ 1 + depth(expr) 

expr 0 op expr 1 ↦→ depth(expr 0) + depth(expr 1) 

Let us prove that this transformation is correct, that is, given n ∈ N + , a memory state 

σ ∈ Σ and a statement s ∈ S so that s = id = expr, we have: 

S(s, σ) = S(a(n, s), σ) 

Proof. We use a recursive reasoning. Input statement is unchanged by Equations (4.1) 

and (4.2) so the equality is verified for these two cases.


Case (4.3) when depth(expr 0) + depth(expr 1) ≥ n leads to 

S(a(n, s), σ) 

=S(a(n, id = expr 0) ; a(n,id 0 = expr 1) ; id = id op id 0, σ) 

=S(a(n, id = id op id 0), S(a(n, id 0 = expr 1 ), S(a(n, id = expr 0), σ))) 

=S(id = id op id 0, S(id 0 = expr 1 , S(id = expr 0, σ))) from recursion hypothesis 

 

=S(id = id op id 0, S(id 0 = expr 1 , σ[R(id, σ) → E(expr 0, σ)])) 

=S(id = id op id 0, σ ′ [R(id 0, σ ′ ) → E(expr 1, σ ′ )) 

 

=S(id = id op id 0, σ ′ [R(id 0, σ) → E(expr 1, σ)]) 

=σ ′′ [R(id, σ ′′ ) → E(id op id 0, σ ′′ )] 

=σ ′′ [R(id, σ ′′ ) → E(id, σ ′′ ) op E(id 0, σ ′′ )] 

=σ ′′ [R(id, σ ′′ ) → E(id, σ ′′ ) op E(id 0, σ ′′ )] 

=σ ′′ [R(id, σ ′′ ) → σ ′′ (R(id, σ ′′ )) op σ ′′ (R(id 0, σ ′′ ))] 

σ ′′ 

Because id 0 is an identifier, ∀(σ0, σ1) ∈ Σ 2 , R(id 0, σ0) = R(id 0, σ1). 

=σ ′′ [R(id, σ) → σ ′′ (R(id, σ)) op σ ′′ (R(id 0, σ))] 

=σ ′′ [R(id, σ) → E(expr 0, σ) op E(expr 1, σ))] 

=S(id = expr 0 op expr 1, σ)[R(id 0, σ) → E(expr 1, σ)] 

The update of location R(id 0, σ) can be safely ignored as it is a new unique identifier. 

In the definition of the problem, we state that an assignment must not contain more 

then n references. However, a reference can itself contain expressions. Let us define an 

auxiliary function ar : N + × R → (R × S) with the following syntactic rules: 

n, id ↦→ 〈id, ;〉 

n, ref [expr] ↦→ 〈id 0[id 1], sr ; id 0 = rr ; id 1 = expr〉 where 〈rr, sr〉 = ar(n, ref ) 

n, ref.id ↦→ 〈id 0.id, sr ; id 0 = rr〉 where 〈rr, sr〉 = ar(n, ref ) 

Given a positive value n ∈ N + , a memory state σ ∈ Σ and a reference r ∈ R, let us 

prove that: 

σ ′


〈rr, sr〉 = ar(n, r), 

R(r, σ) = R(rr, S(sr, σ)) 

Proof. We use an inductive reasoning over r. The property is true for case id as the 

denoted statement is unchanged. 

Consider the case r = ref [expr], that is 

ar(n, r) = 〈rr, sr〉 = 〈id 0[id 1], s ′ r ; id 0 = r ′ r ; id 1 = expr〉 where 〈r ′ r, s ′ r〉 = ar(n, ref ) 

We have 

R(rr, S(sr, σ)) 

=R(rr, S(s ′ r ; id 0 = r ′ r ; id 1 = expr, σ)) 

=R(rr, S(id 1 = expr, S(id 0 = r ′ r, 

σ ′ 

 

S(s ′ r, σ)))) 

=R(rr, S(id 1 = expr, σ ′ [id0 → σ ′ (R(r ′ r, σ ′ ))])) 

 

=R(rr, S(id 1 = expr, 

σ ′′ ′ 

 

=R(rr, 

σ ′′ [id1 → E(expr, σ ′′ )]) 

=R(id 0[id 1], σ ′′ ′ ) 

=R(id 0, σ ′′ ′ )[E(id 1, σ ′′ ′ )] 

=R(id 0, σ ′′ ′ )[σ ′′ ′ (id1)] 

=R(id 0, σ ′′ ′ )[E(expr, σ ′′ )] 

σ ′′ 

σ ′ [id0 → σ ′ (R(rr, σ))])) 

=R(id 0, σ ′′ ′ )[E(expr, σ[id0 → . . . , id1 → . . . ])] 

=R(id 0, σ ′′ ′ )[E(expr, σ)] 

=R(r ′ r, σ ′ )[E(expr, σ)] 

=R(r ′ r, S(s ′ r, σ))[E(expr, σ)] 

=R(r ′ r, σ ′ )[E(expr, σ)] 

=R(ref , σ)[E(expr, σ)] 

=R(ref [expr], σ) 

Next, consider the case r = ref.id, that is 

ar(n, r) = 〈rr, sr〉 = 〈id 0.id, s ′ r ; id 0 = r ′ r〉 where 〈r ′ r, s ′ r〉 = ar(n, ref )

4.5. MEMORY ARCHITECTURE 83 

We have 

R(rr, S(sr, σ)) 

=R(rr, S(s ′ r ; id 0 = r ′ r, σ)) 

=R(rr, S(S(id 0 = r ′ r, 

σ ′ 

 

S(s ′ r, σ)))) 

=R(rr, S(σ ′ [id0 → σ ′ (R(r ′ r, σ ′ ))])) 

=R(rr, σ ′ [id0 → σ ′ (R(ref ))]) 

=R(id 0.id, σ ′ [id0 → σ ′ (R(ref ))]) 

=R(id 0, σ ′ [id0 → σ ′ (R(ref ))]).id 

=R(ref , σ ′ [id0 → σ ′ (R(ref ))]).id 

=R(ref , σ).id 

=R(r, σ) 

The application of ar generates valid input for a so that code that contains no more 

than n identifiers per statement can be generated. 

4.5 Memory Architecture 

The logical memory model for the C language is flat. Qualifiers can be used to change 

some of the storage properties: const, volatile, register, etc. Heap and stack concepts 

are not linked to the C language. 

However, heterogeneous machines have different separated memories. It is still possible 

to emulate these different memories using different address spaces for different memories, 

much like user space is separated from kernel space. To distinguish between two memory 

spaces, we use a naming convention over the variable type name. For instance, variables 

allocated on a gpu have their type name prefixed by __gpu_. An other option is the 

qualifier extension proposed for embedded C [ISO08] but it implies an extension of the ir. 

4.6 Function calls 

4.6.1 Removing Function Calls 

Some isa simply do not support function calls. At the source level, inlining [CH89, 

JM99] is a convenient solution to sidestep the issue without breaking the code structure 

as stack emulation or indirect goto would, as long as recursive calls are not involved.


Although inlining seems a straight-forward transformation, there are several details 

that must taken care of in the context of source-to-source compilation, because generated 

code must still be correct: 

1. if the inlined function contains static declarations, these declarations must be made 

global; 

2. if the inlined function contains references to global variables, they must be declared 

with external storage at the call site; 6 

3. if the inlined function contains references to enumeration or structure fields that 

are not visible from the caller compilation unit, they must be redeclared in this 

compilation unit; 7 

4. most naming conflicts can be dealt with using an extra block declaration, but not 

when static variables are promoted as globals, neither for label names nor for effective 

parameter names; 

5. return instructions must be replaced by a goto, possibly preceded by an assignment. 

A basic inlining simulates by-copy parameter passing through additional assignments 

and generates a lot of gotos. Yet it is possible to use forward substitution [Muc97] to 

remove the spurious assignments and goto elimination [Ero95] to restructure the control 

flow. 

4.6.2 Outlining 

Most languages dedicated to hardware accelerators use functions to separate the host 

code from the accelerator code, generally using a new qualifier (e.g. __kernel__) to identify 

accelerator functions. However, the code fragment that is to be promoted as a kernel is 

usually part of another statement, so it is necessary to extract a function from a code 

fragment—a statement. This transformation is called outlining. 

4.6.2.1 Outlining Algorithm 

There exist two main ways of passing parameters: 

by copy: during function call, the formal parameter is replaced by a copy of the actual 

parameter. It holds the same value but has a (possibly) different name and a different 

location. This is the passing mode used in C. 

by reference: during function call, the formal parameter is directly replaced by the actual 

parameter. It holds the same value and has the same location. It is emulated in C 

by passing the address of a variable as parameter instead of the variable. This is the 

passing mode used in Fortran 

6. This also includes function calls. 

7. This also includes calls to static functions.

4.6. FUNCTION CALLS 85 

by constant reference: During function call, the formal parameter is directly replaced 

by the effective parameter as in by reference parameter passing, but it is guaranteed 

that the corresponding memory locations are not written in the function body. 

Passing a parameter by copy generates a copy of the full variable. A common optimization 

for a variable that uses a non-trivial memory size is to pass those parameters by 

reference in order to avoid the extra copies. To make it clear that the variable is read-only, 

it is a common practice to pass it as a constant reference. 8 

The goal when designing the outlining transformation is to pass as few parameters 

as possible to the generated functions, with the more restrictive and efficient parameter 

passing mode. 

Let outline : S × Σ → P(R) × P(R) × P(R) be a function that maps a statement 

in a memory state to a triplet containing the parameters passed by copy, the parameters 

passed by reference and the parameters passed by constant reference. It is defined by: 

 

{r | r ∈ (Ri(s, σ) − Ro(s, σ)) ∧ typeof(r) ∈ Tscalar} 

 

outline(s, σ) = Ro(s, σ) 

{r | r ∈ (Ri(s, σ) − Ro(s, σ)) ∧ typeof(r) ∈ Tscalar} 

where Tscalar is the set of all scalar types. 

Each reference gathered by the function is textually used as an effective parameter 

and replaced in s by a new identifier of the corresponding type. Note that Ri(σ, s) and 

Ro(σ, s)) automatically filter out private variables and locally declared variables. 

Listing 4.8 illustrates the outlining process, in which the internal loop is outlined as a 

new function kernel. 

The example from Listing 4.8 can be further improved by using a variant of common 

subexpression elimination: the variable i is only used to compute the sub-arrays in[i] 

and out[i]. It is possible to detect this situation and generate code as in Listing 4.9. 

The basic idea to perform this transformation is to scan all statement references and 

look for a constant prefix. Let first introduce the concept of reference prefix. 

Definition 4.2. Given (r, r ′ ) ∈ R 2 , r ′ is a prefix of r, denoted r ′ ≺ r, if and only if 

∃n ∈ N ∗ , ∃(e1, . . . , en) ∈ E n : r = r ′ [e1][. . . ][en] 

For instance a[2*i] is a prefix of a[2*i][k+1]. 

Given a set of references R ∈ outline(s, σ) and a reference r ∈ R, if ∃r ′ ∈ R, s.t. r ′ ≺ r, 

and if ∀k ∈ 1 : n, ∀x ∈ refs(s), x ∈ Rw(s, σ) then the prefix is constant with respect to 

statement s and a constant prefix reference is found. r ′ is used as the effective parameter 

instead of r and the substitution in s is performed accordingly. 

8. This practice is extensively used in C++.


void erode ( int n, int m, int in[n][m], int out [n][m]) { 


4.6. FUNCTION CALLS 87 

4.6.2.2 Using Outlining to Reduce Compilation Time 

Outlining can also be useful to reduce compilation time. Indeed many compiler algorithms 

have polynomial if not exponential complexity. 9 In that case it is beneficial to split 

a complex function into several parts that can be considered independently: if the complexity 

c(s) of a statement transformation verifies the property c(s0; s1) > c(s0) + c(s1) 

then it is profitable to split computations. 

For instance a sequence of two loops can be outlined in two distinct functions and 

compiled separately. Of course this decision can kill some optimization opportunities, and 

thus must be used with care. We have however found it especially useful in some situations, 

like the generation of vector instructions in a function that contains two loop nests that 

cannot be merged. 

As many transformations focus on loop nests, we propose to isolate each loop nest in 

one single function using outlining and to apply further transformations on these functions. 

Note that loop fusion is tried before outlining, because it would be difficult to apply it after 

outlining. This process is described in Algorithm 2. 

Data: f ← a function 

Result: l, a list of functions 

loop_fusion(f); 

k ← 0; 

l ← ∅; 

for l ∈ outer_loops(f) do 

outline(l, fk); 

l ← l ∪ {fk}; 

k ← k + 1; 

end 

return l; 

Algorithm 2: Compilation complexity reduction with outlining. 

A variant of Algorithm 2 is used for Multimedia Instruction Set (mis) generation: after 

some loop tiling to improve locality, the code that performs the computations on a tile 

is outlined to a new function, because each loop can be considered independently of the 

other. 

We carried out the following experiment on pips convex array region analysis: starting 

from two consecutive matrix multiplications, each loop nest is outlined to a new function 

and its innermost loop is unrolled by a given rate. It results into several code versions with 

increasingly number of statements. We then performed the computation of the convex 

array regions on each version. 

Figure 4.1 shows that an overhead linked to the inter-procedural analysis exists, but 

when the loop body holds enough statements, it is beneficial to apply the analysis on 

separated functions. 

9. Some cases of loop fusion are solvable in polynomial time and others are NP-complete [Dar99].


analysis time (s) 

12 

10 

8 

6 

4 

2 

0 

without outlining 

with outlining 

0 20 40 60 80 100 120 140 

unroll rate 

Figure 4.1: Using outlining to reduce analyse compilation time on an unrolled sequence of 

matrix multiplications. 

4.7 Library Calls 

The C language is minimalist: little functionality is present in the language and many 

concepts are implemented in libraries (e.g. threading support, logging, multi-threading, 

etc.). As a consequence, the use of libraries is very common, and we cannot assume the 

input code does not contain library calls. 

The problem with libraries is that their source code may not be available on the targeted 

platform, or a similar library does exists but with a slightly different Application 

Programming Interface (api). This raises two issues: 

1. How do we perform an inter-procedural analysis on extern calls? 

2. How can accelerator code call an external library? 

To handle this problem, we introduce the concept of stub broker. A stub broker is a 

runtime library that interacts with the compiler to manages a collection of functions by 

representing a function as a 3-tuple 〈stub, seq, {〈archi, impl〉}〉, where stub is a C stub of 

the function that has the same memory effects and is analyzable by a compiler; seq is a 

sequential version of the function that has the same semantics as the original function 10 . A 

couple 〈archi, impl〉 stores the implementation of the function for a particular architecture. 

The compiler infrastructure performs requests to the function broker, asking for a function 

10. It may be similar to stub but can also forward the call to external libraries, something that a stub 

cannot do because it must be analyzable by the compiler infrastructure, thus be self-contained.


and either a stub, a sequential version or a particular architecture version. The broker 

answers with the appropriate code, or an error to notify the infrastructure that the request 

cannot be fulfilled. 

During parsing and analysis of its input, the compiler infrastructure asks for stubs. 

During translation from ir to Textual Representation (tr), it asks for a sequential version 

and during specialization, i.e. during the post processing step described in Figure 3.9, the 

target-specific implementation is used. 

A particular case of external calls is the I/O. Depending on the hardware, I/O may be 

limited to data transfers through a Direct Memory Access (dma) call or it can be done 

through different devices (screen/printer/socket/. . . ). As the input code is not hardwarespecific, 

it cannot be aware of these facilities. However, the stub broker abstraction can 

benefit from them to provide working implementations of external calls using hardwarespecific 

calls. 


In this chapter, we have enumerated the different aspects of an isa and we have shown 

that most characteristics can be represented at the C level using the proper conventions. 

To this end, we have listed basic transformations that can be incrementally used to lower 

a C code down to assembly-like code, while being still compatible with a C compiler: 

conversion from arrays to pointers, structure removal, constant array scalarization, naddress 

code generation, inlining, outlining, instruction selection, etc. have been detailed 

as source-to-source transformations. This set of fine grain transformations enforces reuse 

and adaptability to the target. 

This approach enforces the principle of “C as an Internal Representation” and is the 

key to be able to use a source-to-source compiler as a bridge between regular C code and 

C dialects as the ones used to program many hardware accelerators. 

Experimental results are given in Chapter 7, Sections 7.3, 7.4 and 7.2. The validity 

of the proposed transformations is illustrated in Chapter 7. Next chapter discusses the 

impact of parallelism constraints on compilers for heterogeneous platforms.

90CHAPTER 4. REPRESENTING THE INSTRUCTION SET ARCHITECTURE IN C

Chapter 5 

Parallelism with Multimedia 

Instructions 

Pont de Saint Goustan, Morbihan c○ Gwenael AB / flickr 

A recurring feature of hardware accelerators is their use of parallelism to provide 

speedup. This parallelism can take various forms and levels, generally a mixture of Single 

Instruction stream, Multiple Data stream (simd) and Multiple Instruction stream, Multiple 

Data stream (mimd) parallelism, as found in General Purpose gpu (gpgpu). Both 

kinds of parallelism have been studied for a long time, with a focus on loop parallelization: 

hyperplane loop transformation [Lam74], handling of control dependence [AKPW83], loop 

vectorization [AK87], parallelism extraction [WL91b], supernode partitioning [IT88], communication 

optimizations [DUSsH93], interaction with caching [KK92] and tiling. [DSV96, 

AR97, YRR + 10] David F. Bacon et al. wrote an interesting survey [BGS94] on compiler 

transformations for High Performance Computing (hpc) that includes many loop transformations. 

Vivek Sarkar studied the automatic selection of transformations based on a 

cost model [Sar97]. These techniques have been applied successfully in both research compilers, 

SUIF [WFW + 94], Polaris [PEH + 93], Paralléliseur Interprocedural de Programmes 

Scientifiques (pips) [IJT91, AAC + 11], Rose [Qui00], Pocc [PBB10], and production com- 

91

92 CHAPTER 5. PARALLELISM WITH MULTIMEDIA INSTRUCTIONS 

pilers, IBM XL [Sar97], Low Level Virtual Machine (llvm) [GZA + 11], gnu C Compiler 

(gcc) [TCE + 10], Intel C++ Compiler (icc) [DKK + 99], pgi [Wol10] and are used to 

detect or extract parallelism and overcome some parallelization issues. 

In this chapter, we focus on two aspects of code parallelization: Instruction Level Parallelism 

(ilp) and reduction parallelization. The former takes advantage of intra-loop or 

intra-sequence parallelization opportunity using the Multimedia Instruction Sets (miss) 

available in most modern processors and is detailed in Section 5.1. The latter is a critical 

concern when a code involving a reduction must be mapped on a pure simd hardware 

and is addressed in Section 5.2. Section 5.3 proposes a simple model based on the parallelism 

found in remote accelerators to decide whether it is profitable or not to offload a 

computation. 

5.1 Super-word Level Parallelization 

Many processors now have a small vector unit, used by the main processor as a small 

accelerator to speedup regular computations, typically multimedia applications. Modern 

Central Processing Units (cpus) have 128-bit (e.g. arm Cotex-A), 256-bit (e.g. Intel 

Sandy Bridge) or even 512-bit (e.g. Intel larabee) vector units. To automatically take 

advantage of this extra computing power, there are two approaches: vector parallelism, at 

the loop level, and Super-word Level Parallelism, at the block level. This section presents 

an algorithm to combine both approaches at source level while maintaining retargetability: 

the proposed algorithm is parametrized by the targeted mis. 

5.1.1 Related Work 

Several approaches are able to take advantage of miss such as those found in Intel, 

amd and arm processors. Writing inline assembly code remains the best option for those 

who seek out speedup, but prohibitive development costs, difficulty of maintenance and 

limited portability all limit this approach to only critical code segments. For instance, the 

source code of the open source project mplayer contains many multimedia kernels that 

use manually tuned assembly code, such as the excerpt in Listing 5.1. 

Figure 5.1 illustrates several abstractions that can leverage the code portability beyond 

plain assembly: 

intrinsics: C functions that directly map to a sequence of one or more assembly instructions. 

This option remains a low-level one, but it is portable across compilers; 

vector types: syntactic sugar has been added in gcc with predefined vector types, but it 

is not portable on other compilers. Moreover it only deals with arithmetic operators, 

so it only exposes a limited set of operations. The ArBB library [NSL + 11] uses a 

similar approach based on C++ templates and operator overloading. ; 

auto-vectorization: for simple cases, an alternative is to let the compiler automatically 

vectorize the sequential version. It is the only approach that does not change the 

development cost, but it offers few guarantees of performance.

5.1. SUPER-WORD LEVEL PARALLELIZATION 93 

__asm__ volatile ( 

" movd ␣␣␣␣␣␣␣␣␣␣␣%4,␣%% xmm5 ␣\n" 

" pxor ␣␣␣␣␣␣␣%% xmm7 ,␣%% xmm7 ␣\n" 

" pshuflw ␣$0 ,%% xmm5 ,␣%% xmm5 ␣\n" 

" movdqa ␣␣␣␣␣␣␣␣␣%6,␣%% xmm6 ␣\n" 

" punpcklqdq ␣%% xmm5 ,␣%% xmm5 ␣\n" 

" movdqa ␣␣␣␣␣␣␣␣␣%5,␣%% xmm4 ␣\n" 

"1:␣\n" 

" movq ␣␣␣␣␣␣ (%2 ,%0) , ␣%% xmm0 ␣\n" 

" movq ␣␣␣␣␣␣ (%3 ,%0) , ␣%% xmm1 ␣\n" 

" punpcklbw ␣␣%% xmm7 ,␣%% xmm0 ␣\n" 

Listing 5.1: Excerpt from libmpcodecs/vf_gradfun.c file from the mplayer source tree. 

Nonetheless, most developers of non time-critical code rely on automatic vectorization. 

In that field, proprietary compilers, such as icc, still outperforms open-source ones like 

gcc or llvm on their processors. This is checked using the linpack [DLP03] benchmark 

to compare llvm, gcc and icc vectorization engines. Figure 5.2 shows the score of icc, 

gcc and llvm on a desktop station running 2.6.38-2-686 GNU/Linux on a 2 Intel(R) 

Core(TM)2 Duo CPU T9600 @ 2.80GHz. icc version 12.0.3 is run using the -O3 flag, 

llvm version 2.7 is run using the -O3 -march=native -ffast-math, and gcc version 

4.6.1 is run using the -O3 -march=native -ffast-math flags. 

Of course gcc and to a lower extent llvm support a wider variety of targets, including 

arm processors, than icc. On the other hand, icc achieves better performance. Nonetheless, 

from application developers point of view, auto vectorization is the way to go, provided 

the generated code vectorization is efficient. 

In fact, from a compiler developer point of view, the following points are strong constraints: 

– instruction sets are in constant evolution: Figure 5.3 summarizes their evolution over 

the past ten years and shows a steady evolution from Matrix Math eXtension (mmx) 

to Advanced Vector eXtensions (avx) in x86 processors for example; 

– debugging generated code (or intermediate code) is difficult; 

– integrating new code transformations is a long-term task; 

– dealing with code written in a legacy instruction set is difficult. 

This means that in addition to the constraints of auto-vectorization and efficiency of 

generated code, compilers must also be retargetable to keep up with hardware design 

pace. 

Several solutions have been proposed to tackle the challenge of efficient simd code 

generation: the detailed view of icc internals given in [Bik04] shows that performance 

is reached through the intensive use of loop vectorization techniques [BGS94] and that it 

relies on re-rolling techniques to vectorize already manually unrolled loops. For obvious economical 

reasons, it does not focus on retargetability issues, whereas llvm [LA04, CSY10] 

and gcc [GS04, RNZ07] do. These latter both provide auto-vectorizers, in early stage


movaps -24(% ebp ), % xmm0 

movaps -40(% ebp ), % xmm1 

addps %xmm1 , % xmm0 

movaps %xmm0 , -56(% ebp ) 

# include < xmmintrin .h> 

foo () { 

__m128 v0 ,v1 ,v2; 

v0= _mm_add_ps (v1 ,v2 ); 

} 

# include < xmmintrin .h> 

foo () { 

__v4sf v0 ,v1 ,v2; 

v0=v1+v2; 

} 

foo () { 

float v0 [4] , v1 [4] , v2 [2]; 



2e+06 

1.5e+06 

1e+06 

500000 

0 

FLOPS 

icc gcc llvm 

Figure 5.2: Comparison of llvm, gcc and icc vectorizers using linpack. 

avx 

sse4.2 

sse3 

sse2 

sse 

3DNow! 

mmxPentium 

I 

Pentium III 

AMD I 

Pentium 4 ’Prescott’ 

Pentium 4 

Sandy Bridge 

Core I7 

1996 1998 2000 2002 2004 2006 2008 2010 2012 

Figure 5.3: Multimedia Instruction Set history for x86 processors.


for llvm [CSY10], more advanced for gcc, with two approaches. One [RNZ07] is 

based on the combinations of loop vectorization techniques and Super-word Level Parallelism 

(slp) [LA00, SCH03, SHC05], the other relies on the strength of the polyhedral 

model [TNC + 09, GZA + 11]. 

Indeed, the historic approach to simd instruction generation is inherited from loop 

vectorization techniques, where loop nests are first optimized (e.g. using loop fusion, interchange, 

skewing, distribution . . . ) then strip mined by the register width. Larsen et al. 

introduced in [LA00] the slp algorithm, which uses a pattern matching algorithm to find 

vector instructions in code sequences, and thus capturing potentially more parallelism that 

the previous method. However, it cannot optimize loop nests with the same efficiency as 

polyhedron-based approaches. A good illustration of this assessment is the introduction 

of the unroll-and-jam transformation in the context of slp. It is a particular case of loop 

tiling combined with loop unrolling [WL91a]. 

Some papers [ZC98, BGGT02, JMH + 05] focus on the discovery of instruction-set specific 

patterns, e.g. Fused Multiply-Add (fma) or horizontal-add. Finding such patterns 

provide significant improvement for some kernels depending on the target architecture. 

The growing number of mis and the steady pace at which they evolve has led to several 

attempt to build a retargetable vectorizer: swarp [PBSB04] used annotated C code to 

describe simd instruction patterns and combines this representation with a generic pattern 

matching engine. This approach offers a very flexible way to describe mis, but as the 

register width grows, as one can expect considering the forthcoming larabee that contains 

a 512-bit vector processing unit, pattern matching gets slower and less practical. Moreover, 

the pattern description proved, from its author, to be “too hard to maintain”. The approach 

of [HEL + 09] achieves retargetability through detailed description of each instruction at the 

source-level. Pre-processing phases are in charge of applying unrolling and scalar expansion 

before their vectorization engine. 

Recently, joint work of Nuzman et al. [NRD + 11] has proposed an interesting new 

approach to the problem. The vectorization engine generates calls to an abstract mis 

parametrized by the vector length. A Just In Time (jit) compiler is in charge of the 

generation of target-dependant code at execution time. Likewise, optimization related to 

data alignment can be deferred. That way, there is no need to recompile an application 

to benefit from the vector instruction unit. But because vectorization is performed in a 

target-independent way at compile time, this approach does not take advantage of some 

optimization opportunities: intra-loop vectorization or slp, cycle-shrinking [BGS94], etc. 

It is also difficult to perform a vectorization profitability estimation without information 

about the target. 

In spite of the many efforts from the research community, production compilers either 

target (relatively) efficiently a single architecture, e.g. icc, or inefficiently multiple 

architectures, e.g. gcc.


typedef float v4sf [4]; 

Listing 5.2: Sample of representation of a vector register using C type. 

5.1.2 A Meta-Multimedia Instruction Set 

To achieve retargetability, decoupling the code transformations from the targets is necessary: 

we propose a generic and parametric mis as a single target for the whole transformation 

process. This kind of meta-mis has already been proposed in the past [Roj04, Sch09]. 

This instruction set is parametrized by the vector size, but unlike their approach, this 

size is set at compile time and not at execution time, which provides more vectorization 

opportunities. Another difference with existing approaches is that all instructions and 

vector types are described in C, following the principle given at Section 4.1, so that the 

compiler infrastructure can analyse them and perform transformations that are correct 

with respect to the sequential implementation. 

The supported simd types are vectors of scalars, simple/double floating point, and 

8 to 128 bits integers. The length of these vectors is a parameter. They are represented 

internally as plain arrays of the according types and sizes. A typedef is used to wrap these 

vector types to ease code generation. Listing 5.2 illustrates this approach for a vector of 4 

single precision floats. The typedef naming convention used is gcc’s. Complex numbers 

are treated as arrays of two elements to be compatible with the vector representation. 

The vector operations supported by the meta-mis are taken from the union of avx and 

neon mis, restricted to pure simd instructions: 

– common mathematical operators: addition, subtraction, multiplication and division; 

– trigonometric functions: sine, cosine; 

– comparison: equal, greater/lesser than; 

– multiply-addition operation, implemented in neon and proposed in avx, but not yet 

implemented in the Sandy Bridge architecture; 

– logical operations; 

– data movement: packed and unpacked loads and stores; 

– memory reorganization operations: shuffle and broadcast. 

Non-simd instructions such as haddps from Streaming simd Extension (sse) 4.2, an 

horizontal add on single precision floats, are more difficult to deal with because the pattern 

matching algorithm has a non-linear complexity. 1 

The meta-mis is parametric in the vector length: at code generation time and depending 

on the targeted mis, all operations are generated for the given vector length. Figure 5.4 

shows a sample of usage of this mis for a vector size of 128 bits. 

A n-instance of the meta-mis, denoted n-mis, is the set of all types and operations for 

a vector of n bits. For instance, Figure 5.4 is an example of the 128-mis. Given an n-mis, 

it is possible to generate a set of patterns that fully describes the mis in a form easier to 

1. The authors of the paper on swarp gave up with the generic approach when avx was released 

because the pattern matching algorithm did not scale well for 256-bits vector registers.


v4sf vec0 , vec1 , vec2 ; 

/*...*/ 

for (i0 = 0; i0


void SIMD_MULADD_PS ( float w[4] , float x[4] , float y[4] , float z [4]) { 



for (i0 = 0; i0


a [0] = a [0] + b [0] * c [0]; (s0) 

a [1] = a [1] + b [1] * c [1]; (s1) 

a [1] = a [1] + b [1] * c [2]; (s2) 

Listing 5.7: C excerpt to illustrate statement closeness. 

5.1.4 Generation of Optimized simd Instructions 

5.1.4.1 Statement Closeness 

An important aspect of the generation of efficient simd code is the usage of packed data. 

To create such packs, we introduce the notion of closeness between two statements that 

match the same pattern. It represents the likelihood to form a perfectly packed operation 

from statements. Intuitively, two statements are close if they share the same pattern and 

each involved array reference from a statement is close to an array reference in the other 

statement. For instance, in Listing 5.7, statements s1 is closer to s0 than s2 because it 

has more references close to s0 than s2. 

Let s0 = p, r0 0, . . . , r0 

n−1 and s1 = p, r1 0, . . . , r1 

n−1 be two statements that match 

the same pattern p. The statement closeness c(s0, s1) is given by 

where 

cmax ∈ N is chosen so that: 

¯d(r 0 i , r 1 i ) = 

c(s0, s1) = 

 

n−1 

k=0 

¯d(r 0 k, r 1 k) 2 

cmax : r 0 i = r 1 i 

d(r 0 i , r 1 j ) : otherwise 

∀r 0 i , r 1 j , d(r 0 i , r 1 j ) = ∞ ⇒ d(r 0 i , r 1 j ) < cmax 

Given a statement so, a set of statements {s0, . . . , sn} that share the same pattern as 

so can be ordered using the following comparison function: 

⎧ 

⎨ 

cmp(si, sj) = 

⎩ 

−1 : c(so, si) < c(so, sj) 

0 : c(so, si) = c(so, sj) 

1 : c(so, si) > c(so, sj) 

which ensures that the statement with the most and closest memory references to so are 

ranked first. This method is used in the “select_closest” function below. 

5.1.4.2 Parametric Vector Instruction Generation Algorithm 

The algorithm presented in this section generates optimized vector instructions given 

a register width w, a basic block denoted b and a set of patterns denoted patterns. It


is inspired from the preliminary work of François Ferrand [Fra03]. The originality of 

this algorithm, with respect to the original version of Samuel Larsen and Saman P. 

Amarasinghe [LA00], lies in the generation of load, store and shuffle operations. It is 

presented in Algorithm 3 and makes an extensive use of the “statement closeness” between 

two statements that share the same pattern (see Section 5.1.4.1). 

A block of statements, b, is processed statement by statement. For each s of these 

statements that matches a simd pattern from the input patterns, the statements that can 

be moved right after it, according to the dependence graph, are extracted by function 

extract_no_conflict. Among them, those that match the same pattern are selected by 

function extract_isomorphics. They are then ordered using the comparison function introduced 

in Section 5.1.4.1 and the w − 1 first elements are extracted to form a pack. A 

set of loaded vectors and stored vectors is then derived from this pack. 

Let vi denotes the i th vector of the pack, that is 

vi = r i 0, . . . , r i w−1 

If the corresponding memory locations are written, then a store from the vector to the memory 

location is unconditionally generated, a binding between the vector and the memory 

locations is added to live_registers and all the previous vectors referencing these locations 

are removed from live_registers. If the memory locations are read, live_registers is scanned 

for an existing vector that already holds their values in the same order. If any, no load 

is generated and the previous vector is used. If ∀r ∈ vi, r = r i 0, a broadcast operation is 

generated. Otherwise all permutations of the memory locations are checked. If a binding 

exists, then a shuffle operation, an operation that performs a permutation of the register 

content, is generated instead of a load. In all cases, the association between the vector and 

the memory references is stored in live_registers. 

5.1.5 Pattern Discovery 

The vectorization algorithm proposed in Section 5.1.4.2 is only efficient for sequences 

that contain enough statements to reveal patterns. This is sometimes the case for manually 

unrolled loops, such as the one found in the linpack benchmark. 3 To obtain more 

patterns, we successively apply well-known loop vectorization techniques: loop interchange 

to improve locality, loop tiling to favor data packing, and finally loop unrolling. Data 

dependences inherited from reductions are removed through expansion of the reduction 

variables. 

5.1.6 Loop Tiling 

The loop tiling strategy used for vectorization is rather simple: given a loop nest L of 

depth n with an innermost loop body BL, the tiling matrix is chosen as a diagonal matrix, 

3. In that case, the unroll rate, 5, is not a power of 2 and offers very poor intra-loop parallelism.


Data: w ← width of vector register 

Data: patterns ← set of patterns characterizing the instruction set 

Data: b ← list of statements 

Result: list of potentially vectorized statements 

visited ← ∅; 

new_b ← ∅; 

live_registers ← ∅; 

while b = ∅ do 

s ← head(b); 

if s ∈ visited then 

visited ← visited ∪ {s}; 

if match(s, pattern) then 

nconflict ← extract_no_conflict(tail(b), s); 

iso_stats ← extract_isomorphics(nconflict, s); 

if iso_stats = ∅ then 

simd_s ← select_closest(iso_stats, s, w); 

load_s ← gen_load(simd_s, live_registers); 

store_s ← gen_store(simd_s, live_registers); 

update_live_registers(simd_s, live_registers); 

new_b ← new_b ; load_s; 

new_b ← new_b ; simd_s; 

new_b ← new_b ; store_s; 

for s ′ ∈ simd_s do 

visited ← visited ∪ {s ′ }; 

end 

else 

new_b ← new_b ; s; 

update_live_registers(s, live_registers); 

end 

else 

new_b ← new_b ; s; 

update_live_registers(s, live_registers); 

end 

end 

b ← tail(b); 

end 

Algorithm 3: Parametric vector instruction generation algorithm.


for (it = 0; it 4* it +3) 

for ( jt1 = 0; jt1 4* jt1 +3) 

for ( i11 = 4* it; i11


Data: w ← width of vector register 

Data: prog ← whole program 

Result: vectorized program 

for f ∈ functions(prog) do 

if_conversion(f) ([AKPW83]) 

n_address_code_generation(3, f) 

for l ∈ loops(f) do 

loop_interchange(f, l) (if profitable) 

loop_tiling(f, l) (see § 5.1.6) 

end 

for l ∈ innermost_loops(f) do 

unroll(f, l, w) 

end 

reduction_parallelization(f) (see § 5.2) 

for b ∈ basic_blocks(f) do 

scalar_renaming(f, b) ([ASU86]) 

slp(f, b, w) (see § 5.1.4.2) 

end 

dead_code_elimination(f) 

redundant_load_store_elimination(f) (see § 6.3) 

end 

Algorithm 4: Hybrid vectorization at the pass manager level.


5.2 Reduction Parallelization 

A reduction is informally defined as the processing of a data structure of n elements to 

compute n−1 or fewer values. For instance the computation of an histogram over a dataset 

of n elements split in k < n categories is a reduction. Formally, a reduction operation occurs 

when an associative operator, say ⊗, operates on a variable x as in x = x ⊗ expression and 

x is not referenced in expression. 

Code with reductions are not parallel and are indeed a bottleneck for many scientific 

applications. For instance [GPZ + 01] reported the presence of reductions in several hpc 

benchmarks and measured that the parallelization of those reductions led to an average 

speedup of ×2.7 on 16 processors. For that reason, techniques to parallelize reductions 

have been developed. 

Parallelization of reduction algorithms themselves has been well studied and efficient 

algorithms are available [KRS90, Lei92]. The challenge is first to detect the reduction, 

then to parallelize it depending on the hardware features. The former is a well known 

subject [JD89, ZC91] and involves the detection of the reduction pattern, a check of the 

reduction operator property and a check of the data dependencies with the loop body— 

reduction parallelization is only valid if the reduction variable is not used outside of the 

reduction. The latter requires more attention. A common way to parallelize reductions is to 

place them into a critical section, as proposed in Open Multi Processing (openmp) [Ope11] 

for non-atomic reductions, or to use atomic version of the reduction operators when they 

are available, but it puts all the contention in a single place. To overcome this, parallel 

prefixes [LF80, Ble89] are generally used. They rely on the associativity of the reduction 

operator to perform partial reductions in parallel. 

However, these generic algorithms do not take advantage of the specific hardware feature 

that may optimize the parallel reduction. In [GPZ + 01], Marìa Jesùs Garzaràn proposed 

a hardware design that makes it possible to perform reductions efficiently thanks to the 

delegation of the merging phase to the hardware. Field Programmable Gate Array (fpga) 

design for such algorithms exist [Zim97] and take into account the speedup/area ratio. 

More recently, versions of parallel prefix have been implemented for nVidia Graphical 

Processing Unit (gpu) [SHG08], while the Brook language [BFH + 04] provides built in 

support for reductions on gpu. 

This leads to the idea that performing a reduction efficiently on a specific hardware 

must be done by taking into account the hardware specificities. However it is difficult for 

a compiler to automatically generate an optimized, target-dependant reduction algorithm 

for non-trivial cases. It is more practical to call a generic routine or use a pre-defined stub 

instead. Two strategies are explored: Section 5.2.1 details a template-based approach and 

Section 5.2.2 details how to delegate reduction handling to a third-party function. 

5.2.1 Reduction Detection Inside a Sequence 

The slp algorithm presented in Section 5.1 only works on sequences. Because reductions 

introduce data dependencies that prevent the vectorization, they need to be removed,

5.2. REDUCTION PARALLELIZATION 107 

something typically done in sse using a partial sum vector on for loops. 

We have extended this approach to process sequences, as shown by Figure 5.5. 

int a, b, c, d; 

int r = 0; 

r += a; 

r += b; 

r += c; 

r += d; 

(a) Reduction in a sequence before parallelization. 

int a, b, c, d; 

// PIPS generated variable 

int RED0 [4]; 

int r = 0; 

RED0 [0] = 0; 

RED0 [1] = 0; 

RED0 [2] = 0; 

RED0 [3] = 0; 

RED0 [0] = RED0 [0]+ a; 

RED0 [1] = RED0 [1]+ b; 

RED0 [2] = RED0 [2]+ c; 

RED0 [3] = RED0 [3]+ d; 

r = RED0 [3]+ RED0 [2]+ RED0 [1]+ RED0 [0]+ r; 

(b) Reduction in a sequence after parallelization. 

Figure 5.5: Parallelizing reductions in a sequence. 

To achieve this goal, we first use the reduction analysis presented in [JD89] to 

perform a semantic detection of reduction statements which associates to each statement 

a set of couple 〈reduction, operator〉. Once all statements holding a reduction are 

flagged, these reductions are aggregated at the sequence level to form a set of pairs 

{〈〈reductioni, operator i〉 , ni〉} where ni is the number of times reduction reductioni is performed 

in the sequence. During the aggregation process, any reduction that is referenced 

by a non-reduction statement is pruned out. Then for each reductioni, an array aredi of 

ni elements is created to hold the intermediate values. A prelude fills the array with the 

neutral values, depending on the reduction operator operator i, and a postlude performs 

the reduction using the same operator. They are added before and after the sequence, 

respectively. 

If the statement block is the body of a loop and there is no data dependency between 

the loop index and the reduction variable, then the prelude and postlude can be moved 

around the surrounding loops. 

This behavior is shown in Figure 5.6 starting from an extract of the ddot_r function


for ( LU_IND0 = 0; LU_IND0

5.2. REDUCTION PARALLELIZATION 109 

__m128d xres = _mm_setzero_pd ( ); 

for (i =0;i





– an execution time on the accelerator, given as the ratio between the sequential execution 

time on the host processor, th, and the average relative speedup provided by 

the accelerator, ath. 

This model is based on two assumptions: the data transfer time is a linear function of 

the amount of data transferred, and the accelerator execution time is proportional to the 

host execution time. These assumptions are discussed in Section 5.3.2. 

The profitability of the offloading can be expressed as Inequality (5.2), which turns 

Equation (5.1) into Inequality (5.3). 

τ0 + 

ta(s, σ) < th(s, σ) (5.2) 

V (s, σ) 

B 

< th(s, σ) × ath − 1 

ath 

(5.3) 

τ0, B, ath and th(s, σ), are parameters that depend from the hardware and host target. 

On the other hand V (s, σ) is program-dependent and can be computed at compile-time. 

As a consequence, the offloading decision is postponed to runtime. For instance, in the case 

of a matrix multiplication between two n × n matrices, th = O(n 3 ), while V (s, σ) = O(n 2 ); 

in the absence of constraints on n, an off-line asymptotic decision would unconditionally 

offload the kernel to an accelerator, whereas a high τ0 value should prevent the offloading 

for small value of n. 

5.3.2 Limitations of the Model 

The model described in this section does not take into account several aspects of data 

V (S,σ) 

transfers. Data transfer time cannot solely be represented by τ0 + . In the case 

B 

of gpu boards, data alignment has a significant impact on performances, and zero copy 

mechanism can be used for data that are read/written only once, but these aspects are 

ignored. Asynchronous transfers are often used to overlap communications and hide data 

transfer cost, which makes our approach over-pessimistic. In a similar manner, a succession 

of kernels can result into redundant data transfers. Our method only takes into account 

local information. 


In this section, we have focused on Super-word Level Parallelism and proposed an 

original algorithm that combines the traditional loop-based approach with the more recent 

sequence based pattern-matching, parametrized by the Multimedia Instruction Set 

description and without the need of loop rerolling. This combination makes it possible to 

discover parallelism outside of loops or in manually unrolled loops, while still benefiting 

from the research led over the past decades in loop parallelization. Its validity is examined 

on several linpack kernels in Chapter 7.


In a similar manner, we have extended reduction parallelization to sequences where it 

only held for loops, which led to more parallelization opportunities when combined with 

the slp algorithm. We also propose a methodology to ignore hardware-specific mechanism 

for reductions at compilation time. 

In the next chapter, we examine a last category of hardware constraints: distributed 

memory.

Chapter 6 

Transformations for Memory Size and 

Distribution 

Pont de Pacé, Ille-et-Vilaine c○ Pymouss / WikipediA 

Wm. A. Wulf and Sally A. Mckee concluded their article [WM95] “Hitting the 

Memory Wall: Implications of the Obvious” published in 1995 by the following sentence: 

The most “convenient” resolution to the problem would be the discovery of a 

cool, dense memory technology whose speed scales with that of processors. We 

are not aware of any such technology (. . . ). 

Fifteen years later, we are no more aware of any such technologies, and memory is still 

a critical issue for many parallel applications. In the context of heterogeneous computing, 

where host and accelerator memory space are often separated, it is important to handle this 

hardware constraint with care. To this end, we introduce three generic transformations: 

statement isolation that separates the accelerator memory space from the host memory 

space, presented in Section 6.1, memory footprint reduction that finds out tiling parameters 

for a loop nest so that the inner loops fit into the target memory, presented in Section 6.2, 

and redundant load-store elimination presented in Section 6.3. 

113

114CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION 

void foo ( int i) { 

int j; 

j=i*i; //

6.1. STATEMENT ISOLATION 115 

where i ′ are new identifiers unique to the program, idss is a function that collects all 

identifiers syntactically used by a given statement and si→i ′ 

identifier i is syntactically changed into i 

is a statement where the 

′ . 

Definition 6.1. The idss : S → P(I) function is defined by the syntactic rules: 

idss(;) = ∅ 

idss({ type id ; stat }) = idst(type) ∪ idss(stat) 

idss( ref=expr) = idsr(ref) ∪ idse(expr) 

idss( ref=read) = idsr(ref) ∪ {istdin} 

idss(write expr) = {istdout} ∪ idse(expr) 

idss(f(ref)) = idsr(ref) 

idss(if( expr ) { stat0 } else { stat1 }) = idse(expr) ∪ idss(stat0) ∪ idss(stat1) 

idss(while( expr ) { stat }) = idse(expr) ∪ idss(stat) 

where idst : T → P(I) is given by: 

idss(stat0 ; stat1) = idss(stat0) ∪ idss(stat1) 

idst(int | float | complex) = ∅ 

idst(struct id { fields }) = ∅ 

where idse : E → P(I) is given by: 

and idsr : R → P(I) is given by: 

idst(type [ expr ]) = idst(type) ∪ idse( expr) 

idse(cst) = ∅ 

idse(ref) = idsr(ref) 

idse(expr0 op expr1) = idse(expr0) ∪ idse(expr1) 

idsr(id) = {id} 

idsr(ref [ expr ]) = idsr(ref) ∪ idse(expr) 

idsr(ref . fieldname) = idsr(ref) 

Definition 6.2. We denote si→i ′ the statement where the identifier i is syntactically 

changed into i ′ , where i ′ ∈ I \ idss(s) ∧ ∀σ ∈ Σ, σ(i ′ ) = unbound. 

ei→i ′ and ri→i ′ have a similar meaning in the context of expression and references.


A lesser version of Theorem (6.1) is given in Theorem (6.2). 

Theorem 6.2. The evaluation of a statement where one identifier has been isolated yields 

the same memory state as the evaluation of the original statement. 

Given a statement s ∈ S, a memory state σ ∈ Σ and i ∈ idss(s) s.t. ∀j ∈ idss(s), I(j) ∈ 

{I(istdin), I(istdout)} 

S({typeof(i) i ′ ; i ′ = i ; si→i ′ ; i = i′ ; }, σ) = S(s, σ) 

Theorem (6.1) results from the iterative application of Theorem (6.2) on all variables 

referenced by s. The remaining of the section is dedicated to the proof of Theorem (6.2). 

6.1.1.1 Expression Renaming 

Lemma 6.3. The evaluation of an expression e ∈ E in state σ ∈ Σ is not changed by the 

renaming of an identifier. 

∀i ∈ idse(e), ∀i ′ ∈ idse(e) s.t. typeof(i ′ ) = typeof(i), 

E(e, σ) = E(ei→i ′, σ[I(i′ ) → σ(I(i))]) 

Proof. Let prove this lemma by induction on the syntactic elements of the expression 

domain. Let e be an expression, and σ a memory state. We choose i ∈ idse(e) and 

i ′ ∈ idse(e), s.t. typeof(i) = typeof(i ′ ). 

Constants If e = cst, we have 

E(ei→i ′, σ) 

=E(cst i→i ′, σ[I(i′ ) → σ(I(i))]) 

=E(cst, σ[I(i ′ ) → σ(I(i))]) 

=cst 

=E(e, σ) 

Identifiers If e = id, we have 

E(ei→i ′, σ[I(i′ ) → σ(I(i))]) 

=E(id i→i ′, σ[I(i′ ) → σ(I(i))])


if i = id, 

=E(i ′ , σ[I(i ′ ) → σ(I(i))]) 

=(σ[I(i ′ ) → σ(I(i))]))(I(i ′ )) 

=σ(I(i)) 

=E(e, σ) 

otherwise we have id i→i ′ = id and 

=E(id, σ[I(i ′ ) → σ(I(i))]) 

=E(id, σ) 

which terminates the induction proof for the initial elements of E. 

References We now consider non-initial elements. If e = ref . fieldname, we have: 

E(ei→i ′, σ[I(i′ ) → σ(I(i))]) 

=E(ref i→i ′ . fieldname, σ[I(i′ ) → σ(I(i))]) 

=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))]).fieldname 

=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))]) .fieldname 

=σ(R(ref , σ)).fieldname from induction hypothesis 

=σ(R(ref , σ).fieldname) 

=E(e, σ) 

If e = ref [ expr ], we have: 

E(ei→i ′, σ[I(i′ ) → σ(I(i))]) 

=E(ref i→i ′ [ expr i→i ′ ], σ[I(i′ ) → σ(I(i))]) 

=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))])[E(expr i→i ′, σ[I(i′ ) → σ(I(i))])] 

=(σ[I(i ′ ) → σ(I(i))]) R(ref i→i ′, σ[I(i′ ) → σ(I(i))])[E(expr, σ)] from induction hypothesis 

 

= σ[I(i ′ ) → σ(I(i))] R(ref i→i ′, σ[I(i′ ) → σ(I(i))]) 

[E(expr, σ)] 

= σ(R(ref , σ)) [E(expr, σ)] from induction hypothesis 

= σ(R(ref , σ))[E(expr, σ)] 

=E(e, σ)


Arithmetic Operations In the case of e = expr 0 op expr 1, we have, 

E(ei→i ′, σ[I(i′ ) → σ(I(i))]) 

=E(expr 0 op expr 1i→i ′, σ[I(i′ ) → σ(I(i))]) 

=E(expr 0i→i ′, σ[I(i′ ) → σ(I(i))]) op E(expr 1i→i ′, σ[I(i′ ) → σ(I(i))] 

=E(expr 0, σ) op E(expr 1, σ) from induction hypothesis 

=E(e, σ) 

6.1.1.2 Type Renaming 

We state a similar lemma for type evaluation: 

Lemma 6.4. The evaluation of a type t ∈ T in state σ ∈ Σ is not changed by the renaming 

of an identifier. 

∀i ∈ idse(e), ∀i ′ ∈ idst(t) s.t. typeof(i) = typeof(i ′ ), 

T(t, σ) = T(ti→i ′, σ[I(i′ ) → σ(I(i))]) 

Proof. We use an induction proof on the type definition. 

Scalar Types The equality is direct for int, float and complex that are independent 

from the memory state and unchanged by the renaming rule. 

Structures In the case of t = struct id { fields }, we have 

= T(t, σ) 

T(ti→i ′, σ[I(i′ ) → σ(I(i))]) 

= 

T(fi→i ′, σ[I(i′ ) → σ(I(i))]) 

〈f,i〉∈fields 

= 

〈f,i〉∈fields 

Arrays In the case of t = type [ expr ], we get 

T(ti→i ′, σ[I(i′ ) → σ(I(i))]) 

T(f, σ) from induction hypothesis 

=T( typei→i ′, σ[I(i′ ) → σ(I(i))]) × E(expri→i ′, σ[I(i′ ) → σ(I(i))]) 

=T( type, σ) × E(expr, σ)from induction hypothesis and Lemma 6.3 

=T(t, σ)


6.1.1.3 Statement renaming 

Lemma (6.3) and Lemma (6.4) can be extended to the statement domain: 

Lemma 6.5. The evaluation of a statement s ∈ S in memory state σ ∈ Σ is not changed 

by the renaming of an identifier. 

∀i ∈ idss(s) s.t. I(i) ∈ {I(istdin), I(istdout)} , ∀i ′ ∈ idss(s) s.t. typeof(i) = typeof(i ′ ), 

S(si→i ′, σ[I(i′ ) → σ(I(i))]) = S(s, σ)[I(i) → σ(I(i)), I(i ′ ) → 

 

S(s, σ) I(i) 

] 

We were not able to prove this lemma formally. Informally, it describes the result of 

the evaluation of a statement where variable i is syntactically changed into i ′ in a memory 

state where each memory location associated to i holds the value of the associated memory 

location from i. It states that this new memory state is the same as the one resulting 

from the evaluation of the initial statement in the initial memory state, with the memory 

locations associated to i unchanged, and the memory locations associated to i ′ holding the 

updated values. 

6.1.1.4 Restricted Statement Isolation 

Proof. 

We can now prove Theorem (6.2) 

S({typeof(i) i ′ ; i ′ = i ; si→i ′ ; i = i′ ; }, σ) 

=unbind(S(i ′ = i; si→i ′ ; i = i ;′ , loc(id, T(type, σ), σ) 

 

=σ ′ 

), i ′ ) 

=unbind(S(i = i ′ , S(si→i ′, S(i′ = i, σ ′ ))), i ′ ) 

=unbind(S(i = i ′ , S(si→i ′, σ′ [I(i ′ ) → σ ′ (I(i))])), i ′ ) 

We can apply Lemma 6.5 to the evaluation of si→i ′ to get 

=unbind(S(i = i ′ , S(s, σ ′ )[I(i) → σ ′ (I(i)), I(i ′ 

) → S(s, σ ′ ) I(i) 

=unbind(S(s, σ ′ )[I(i) → σ ′ (I(i)), I(i ′ ) → 

=unbind(S(s, σ ′ )[I(i ′ ) → 

=unbind(S(s, σ ′ )[I(i ′ ) → 

=S(s, σ) 

 

S(s, σ ′ ) I(i) 

 

S(s, σ ′ ) I(i) 

, I(i) → 

 

S(s, σ ′ ) I(i) 

], i ′ ) 

, I(i) → 

 

S(s, σ ′ ) I(i) 

], i ′ ) 

]), i ′ ) 

 

S(s, σ ′ ) I(i) 

], i ′ )


6.1.2 Statement Isolation and Convex Array Regions 

The application of Theorem (6.1) leads to a correct but very poor code. Indeed any 

variable referenced in the isolated statement is transferred back and forth, even if it is only 

used, or if it is only defined by the statement. In a similar manner, arrays are transferred 

as a whole, while only a sub-array may be needed. This section studies the interaction 

between convex array regions and statement isolation, using the former to reduce the data 

transfers generated by the latter. 1 

In a nutshell, the approach is similar to using Theorem 6.1 for all variables, then calling 

an enhanced version of dead code elimination to remove all the “dead” data transfers. 

Given a statement s, it is possible to compute an estimate of the array regions imported 

or exported by this statement for each array reference r referenced by s. These regions 

are denoted Ri(s, σ)[r] and Ro(s, σ)[r], respectively. Depending on the accuracy of the 

analysis, these regions are either exact, R = i (v), or over-estimated, R 

i (v). There is a 

strong relationship between these array regions and the data to be transferred. Considering 

a statement s ∈ S: 

Transfers from the accelerator All data that may be exported by s must be copied 

back to the host from the accelerator: 

TH←A : S × Σ → P(R) = (s, σ) ↦→ Ro(s, σ) (6.1) 

Transfers to the accelerator All data that may be imported by s must be copied from 

the host to the accelerator: 

TH→A : S × Σ → P(R) = (s, σ) ↦→ Ri(s, σ) 

Indeed, all data for which we have no guarantee of a preliminary write by s must be 

copied in. Otherwise, uninitialized data may be transferred back to the host without 

being initialized. So the extended formula is: 

TH→A : S × Σ → P(R) = (s, σ) ↦→ Ri(s, σ) ∪ (R o (s, σ) − R = o (s, σ)) (6.2) 

Based on Equations (6.1) and (6.2) it is possible to allocate new variables on the 

accelerator, to generate copy operations from the old variables to the newly allocated ones 

and to perform the required change of frame on s. Listing 6.2 illustrates this transformation 

on the running example from Listing 5.9. It presents the variable replacement, the data 

allocation and the 2D data transfers. Thanks to region analysis, in0 is not copied out 

and out0 is not copied in. The generated data transfers are target-independent and the 

implementation is specialized depending on the targeted accelerator. 

1. Statement isolation can also be used to generate thread local storage, or improve cache behavior.

6.2. MEMORY FOOTPRINT REDUCTION 121 


int (* out0 )[n][m] = 0, (* in0 )[n][m+1] = 0; 

P4A_accel_malloc (( void **) &in0 , sizeof ( int )*n*(m+1) ); 

P4A_accel_malloc (( void **) &out0 , sizeof ( int )*n*m); 

P4A_copy_to_accel_2d ( sizeof ( int ), n, m, n, m+1, 0, 0, &in [0][0] , 

* in0 ); 

P4A_copy_to_accel_2d ( sizeof ( int ), n, m, n, m, 0, 0, & out [0][0] , * 

out0 ); 

for ( int i = 0; i


To compute Vl we gather all the identifiers syntactically found in s and sum up their 

type size. 

Vl : S → E = s ↦→ 

i∈decls(s) 

sizeof(i) 

where decls : S → P(I) is a function similar to idss : S → P(I) that collects all identifiers 

declared in a statement. 

This approach would be too naive for variables declared outside of s, as it includes all 

array elements, even those that are never written or read. Convex array regions are useful 

here, and we can use the formulae: 

Vo : S × Σ → E(s, σ) ↦→ |Rr(s, σ) ∪ Rw(s, σ)| 

where the cardinal operator counts the number of elements in the resulting set. It can be 

split in a per-variable form: 

Vo(s, σ) = 

i∈decls(s) 

|Rr(s, σ)[i] ∪ Rw(s, σ)[i]| 

where the [·] operator selects all the references prefixed by the given identifier. If we 

compute the convex hull of the region union, it is possible to count symbolically its cardinal 

using Ehrhart polynomials [Cla96]. 

Vo(s, σ) ≤ 

i∈decls(s) 

|Rr(s, σ)[i] ¯∪ Rw(s, σ)[i]| 

where · ¯∪ · is the convex union. 

Because arrays in C have rectangular shapes 3 , it is more realistic to consider the rectangular 

hull of the regions, which leads to Equation 6.3. 

V (s, σ) ≤ Vl(s) + 

i∈decls(s) 

|⌈Rr(s, σ)[i]¯∪Rw(s, σ)[i]⌉| (6.3) 

where ⌈·⌉ is the rectangular hull. When s is surrounded by loops and the above expression 

depends on the loop indices, it is possible to transform these loops to change the memory 

footprint. 

6.2.2 Symbolic Rectangular Tiling 

Let us consider a perfectly nested loop s of depth n. In order to work out the tiling 

parameters, a two-step process is used: n symbolic values denoted p1, . . . , pn are introduced 

to represent the computational blocks and a symbolic tiling, parameterized by these values, 

compute their exact value. It is however possible to compute an over-estimation of this volume thanks to 

statement preconditions, but this dissertation does not dive into these details. 

3. It is possible to allocate non-rectangular convex shapes using pointer arrays. . .

6.3. REDUNDANT LOAD STORE OPTIMIZATION 123 

is performed. It generates n outer loops and n inner loops. The statement carrying the 

inner loops is denoted sinner and the memory state before its execution is denoted σinner. 

The idea is to run the inner loops on the accelerator once the pk are chosen so that the 

memory footprint of sinner does not exceed a threshold defined by the hardware. To this 

end, the memory footprint V (sinner, σinner) is computed and one of the solutions satisfying 

Condition (6.4) is searched. 

V (sinner, σinner) ≤ Vmaxa 

(6.4) 

Vmaxa is the memory size of the considered accelerator a. This gives one inequality over 

the pk. Other constraints are derived from the accelerator model specified in Section 2.1: 

e.g. a vector accelerator requires p1 to be set to the vector size. The algorithm is given in 

a synthetic form in Algorithm 5: 

Data: ln ← a perfect loop nest of depth n 

Data: Vmax a maximum memory footprint 

Data: c an additional system of linear inequalities 

Result: a statement that matches c and the volume constraint 

l2n ← rectangular_tiling(〈x1, . . . , xn〉); 

l ′ ← inner_loop(l2n, n); 

p ← memory_footprint(l ′ ); 

〈p1, . . . , pn〉 ← solve(s ∧ (p ≤ Vmax), 〈x1, . . . , xn〉); 

return ; 

i∈1,n int xi = pi ; l2n 

Algorithm 5: Memory footprint reduction algorithm. 

Listing 6.3 shows the effect of a symbolic tiling and the result array region analysis on 

the running example. As a result, the memory footprint of sinner is given as a function of 

p1, p2 in Equation (6.5). 

V (sinner, σinner) = 2 × p1 × p2 

For terapix, the constraint system is 

x1 ≤ 128 

2 × x1x2 ≤ 1024 

and the tuple 〈128, 512〉 is the maximal solution. 

6.3 Redundant Load Store Optimization 

(6.5) 

At every parallelism level, be it at the node, cpu or instruction level, data transfers 

is often the performance bottleneck. The time spent to transfer data does not contribute 

directly to the computation. There are two complementary approaches to limit the time 

loss:



int p_1 , p_2 ; 

for ( int it = 0; it


We also introduce a function that checks if two statements satisfy the Bernstein’s condition 

[Ber66], B : S × S → {true, false}. 

Characterizations of Direct Memory Access (dma) are used in the form of load and 

store statements. 

Definition 6.3. A statement s ∈ S in memory state σ ∈ Σ is a dma statement if it 

verifies the following properties: 

1. s is a function call ; 

2. s writes a single convex array region: ∃i ∈ I s.t. Rw(σ, s) = {i[φ0, . . . , φk]} 

A dma statement is a function call statements that write data to a single location. As 

such the assignment operator, “=” is a form of dma. loads and stores are distinguished 

by their name—the Internal Representation (ir) does not distinguish between host and 

remote memory. 

Given a dma statement, we define its reciprocal as follows. 

Definition 6.4. The reciprocal of a dma statement d is a statement denoted d −1 that 

verifies the following properties: 

∀σ ∈ Σ, ∀l ∈ (L \ R(Rw(σ, d), σ)), S(d ; d −1 , σ)(l) = S(d, σ)(l) 

For instance, the statement denoted by “memcpy(a,b,10*sizeof(in));” is a dma and its 

reciprocal is denoted by “memcpy(b,a,10*sizeof(in));”. The idea is that in the sequence 

memcpy(a,b,10*sizeof(in));memcpy(b,a,10*sizeof(in)), the second call is useless. 

6.3.1 Redundant Load Elimination 

The algorithm used to move load statement upward is based on a simple idea: step-bystep 

move load operations upward in the hcfg so that are executed as soon as possible. 

Combined with the redundant store elimination transformation described in Section 6.3.2, 

it can lead to two optimizations: 

– Move load operations outside of loops, leading to an optimization related to invariant 

code motion; 

– Remove load and store operations when they meet. 

The next sections define the legality conditions for moving a statement in the three 

most common control flow constructs—sequences, tests and loops—and how this can be 

done interprocedurally. 

6.3.1.1 Sequences 

Let consider a statement sequence where sl is a load statement: 

s = s0 ; sl


Bernstein’s condition give us a condition when it is valid to swap them, as shown in 

Equation (6.6). 

 

sl ; s0 Bern(s0, sl) 

Rl(s) = 

(6.6) 

s otherwise 

6.3.1.2 Tests 

Let consider a branch statement. 

s = if(ec) { s0 ; st } else { s1 ; sf } 

Depending on the nature of s0 and s1, it is possible and profitable to move them before 

the condition. If s0 and s1 are similar, there is an opportunity to merge both statements 

into a single one. 

Let σc denotes the memory state after the evaluation of ec. Both s0 and s1 are evaluated 

in the same memory state σc. If they both are load statements, satisfy the Bernstein’s 

condition and are textually equal, it is possible to move them upward in a single statement, 

as summarized by Equation (6.7). 

 

s0 ; if(ec) { st } else { sf } (Bern)(s0, ec) ∧ s0 

Rl(s) = 

s otherwise 

t 

= s1 

(6.7) 

Similarly, if only s0 or s1 is a load statement, and it satisfies the Bernstein’s condition, 

then it can be moved outside the test. 

6.3.1.3 Loops 

Let consider a loop statement: 

s = do { sl ; s0 } while(ec); 

A sufficient condition to move s0 out of the loop if that s0 satisfies the Bernstein’s 

condition with s0 and ec, leading to Equation (6.8). 

 

sl ; do { s0 } while(ec) Bern(s0, sl) ∧ Bern(sl, ec) ∧ S(sl ; sl) = sl 

Rl(s) = 

s otherwise 

(6.8) 

Proof. We use a recursive proof on the number of iteration of loop s. Let s n denote s 

when the loop body executes n times. The property to prove is that Equation (6.8) holds 

∀s n , n ∈ N ∗ . 

For n = 1 

s 1 = sl ; s0 ; ec 

= Rl(s 1 )


Let assume the property is true for n ∈ N ∗ , 

s n+1 = do { sl ; s0 } while(ec) 

= sl ; s0 ; ec ; s n 

= sl ; s0 ; ec ; Rl(s n ) from induction hypothesis 

= sl ; s0 ; ec ; sl ; do { s0 } while(ec) from definition 

= sl ; sl ; s0 ; ec ; do { s0 } while(ec) from Bern(sl, s0) ∧ Bern(sl, ec) 

= sl ; s0 ; ec ; do { s0 } while(ec) sl is idempotent 

= Rl(s n+1 ) 

6.3.1.4 Interprocedurally 

As the result of moving load statement upward in the hcfg, a load can be found at 

the entry point of a function. In that case it may be interesting to move the load at the 

call sites. To do so, one must first ensure that the memory state before the call site is the 

same as the memory state at the function entry point. It is the case if there is no write 

effect in function parameters. In that situation, the load statement can be moved before 

the call state after backward translation from formal parameters to effective parameters. 

6.3.2 Redundant Store Elimination 

This section describe the conditions to move store statements upward in the hcfg. 

The equations are similar to redundant load elimination’s. 

6.3.3 Sequences 

This problem is quite similar to its load counterpart from Section 6.3.1.1. 

s = ss ; s0 

Bernstein’s condition give us a condition when it is valid to swap them, as shown in 

Equation (6.9). 

6.3.4 Tests 

Rs(s) = 

Let consider a branch statement. 

s0 ; ss Bern(s0, ss) 

s otherwise 

(6.9)


s = if (ec) { st ; s0 else sf ; s1 } 

We get an equation that mirrors Equation (6.7) except for the condition over ec. 

6.3.5 Loops 

t 

if (ec) { st } else { sf } s0 s0 = s1 

Rs(s) = 

s otherwise 

Let consider a loop statement: 

s = do { s0 ; ss } while (ec) 

The store version is given by Equation (6.11). 

(6.10) 

 

do { s0 } while (ec) ss Bern(ss, s0) ∧ Bern(ss, ec) ∧ S(ss ; ss) = ss 

Rs(s) = 

s otherwise 

(6.11) 

The proof follows the same idea as for Equation (6.8). 

6.3.6 Interprocedurally 

If the same store statement is found at each exit point of a function, it may be possible 

to move it past its call site. To do so, one must ensure that the store statement only 

depends on formal parameters and that these parameters are not written by the function. 

If this the case, the load statement can be removed from the function call and added after 

each call site after parameter backward translation. 

6.3.7 Combining Load and Store Elimination 

This section examines the interaction between load and stores in two situations: in a 

sequence, when a load is followed by a store, and in loops, when the loop body is surrounded 

by a load and a store. These two situations are eventually triggered by the upward move 

of dma in the hcfg. 

6.3.7.1 Sequence 

Let consider a simple sequence of two statements: 

s = s0 ; s1 

By definition, if s0 is a dma and s1 its reciprocal, then we have:


s = s0 ; s −1 

0 

= s0 

that eliminates the second call and may make it possible to continue the upward propagation. 

6.3.7.2 Loops 

Let consider a loop statement whose body is surrounded by dma calls: 

s = do { sl ; s0 ; ss } while (ec) 

then it can be translated into Equation (6.12). 

under the following conditions: 

R(s) = sl do { s0 } while (ec) ss (6.12) 

sl = s − s 1 (6.13) 

Bern(ss, s0) (6.14) 

Bern(ss, ec) (6.15) 

(6.16) 

Proof. We use a recursive proof on the number of iteration of loop s. Let s n denote s when 

the loop body executes n times. 

Equation (6.12) is true when n = 1: 

s 1 = sl ; s0 ; ss ; ec 

= sl ; s0 ; ec ; ss from hypothesis 6.15 

= R(s 1 ) 

Let us assume it is true if the loop iterates n times. In that case a loop that would 

iterate n + 1 times can be decomposed as follows: 

s n+1 = s n ; sl ; s0 ; ss ; ec 

= sl ; R(s n ) ; ss ; sl ; s0 ; ss ; ec from recursion hypothesis 

= sl ; R(s n ) ; ss ; s0 ; ss ; ec from hypothesis 6.13 

= sl ; R(s n ) ; s0 ; ss ; ec from hypothesis 6.14 

= sl ; R(s n ) ; s0 ; ec ; ss from hypothesis 6.15 

= R(s n+1 )


6.3.8 Main Algorithm 

Applying iteratively redundant load elimination, redundant store elimination and combine 

load store may lead to fewer data communication. This process is detailed in Algorithm 

6. 

Data: p ← a program 

repeat 

p ′ ← p; 

p ← redundant_load_elimination(p); 

p ← redundant_store_elimination(p); 

p ← combine_load_store(p); 

pdead_code_elimination(p); 

until p = p ′ ; 

Algorithm 6: Redundant load store elimination algorithm at the pass manager level. 

Listing 6.4 illustrates the result of this algorithm on an example taken from the Paralléliseur 

Interprocedural de Programmes Scientifiques (pips) validation. It demonstrates 

interprocedural elimination of data communications represented by the load and store 

functions. These functions are first moved outside of the loop, then outside of the a function 

then redundant loads are eliminated. 


In this chapter, we have presented and proved Theorem (6.1) to completely isolate a 

statement from its original memory. This transformation is the basic building block for 

many transformations related to heterogeneous computing, for they usually use a separated 

memory space. 

The generated data transfers generated are not optimized globally. Hence we have 

proposed Algorithm 6 to iteratively merge these transfers in order to suppress redundant 

ones. This algorithm is independent from the previous one and also works with the dma 

generated by Algorithm 3. 

We have also developed Algorithm 5 to take into account the limited memory size of 

the targeted hardware, based on loop tiling and memory footprint estimation. 

The experiments related to the usage of these transformations are presented with the 

compiler implementations in Chapter 7.


void a( int i, int j[2] , int k [2]) { 

while (i - - >=0) { 

load (k, j); //

132CHAPTER 6. TRANSFORMATIONS FOR MEMORY SIZE AND DISTRIBUTION

Chapter 7 

Compiler Implementations and 

Experiments 

Pont de Bruz, Ille-et-Vilaine c○ Pymouss / WikipediA 

This thesis introduces and describes a methodology to customize compilers for different 

heterogeneous platforms, building up on a rich toolbox of source-to-source transformations, 

a programmable pass-manager Application Programming Interface (api) and a simple 

hardware description. It would not be complete without an experimental validation. 

The methodology claims to make it easier to assemble compilers. To validate it, we 

have chosen five different targets: three general purpose Central Processing Unit (cpu) with 

different vector instruction units, a Field Programmable Gate Array (fpga)-based image 

processor [BLE + 08] and an nVidia Graphical Processing Unit (gpu). For each of them, 

we have developed a compiler prototype using the techniques presented in Chapters 4, 5 

and 6. The efficiency of the code generated by these research compilers is measured using 

benchmarks or applications from the relevant domain. 

This chapter begins with a simple Open Multi Processing (openmp) directive generator 

in Section 7.1 to show how to apply the principles discussed in this thesis to a simple, yet 

real, example. The compiler for gpus implemented by hpc project based on our work is 

detailed in Section 7.2. Section 7.3 presents terapyps, a compiler from C to terasm, the 

assembly language for the terapix image processor. Finally, a retargetable compiler for 

Multimedia Instruction Set (mis) is described in Section 7.4 for three targets: Streaming 

simd Extension (sse), Advanced Vector eXtensions (avx) and neon. 

133

134 CHAPTER 7. COMPILER IMPLEMENTATIONS AND EXPERIMENTS 

memory 

ram 

shared 



Multicore 

Device 


Parallelism 

mimd 

Figure 7.1: Multicore hardware feature diagram. 

7.1 A Simple OpenMP Compiler 

The goal of this section is to illustrate the idea developed in this thesis on a simple 

example: a multicore machine. 

7.1.1 Architecture Description 

First step is to list the hardware constraints of the target machine. The (simple) 

hardware feature diagram is given in Figure 7.1. The only constraint is Multiple Instruction 

stream, Multiple Data stream (mimd) parallelism and it is optional. As a consequence, the 

only required transformation is mimd parallelism detection/extraction. Optional features 

and optimizations are not taken into account. 

7.1.2 Compiler Implementation 

The input language is C and the output language is C with openmp directives. As directives 

can be represented in the Internal Representation (ir), it means no post-processor 

is needed. So we have a very classical source-to-source compilation flow, detailed in Figure 

7.2. 

Algorithm 7 is used by the source-to-source compiler. It involves privatization, parallelism 

detection, reduction detection and directive generation. Additionally, loop fusion is 

used to improve locality. If parallelism detection fails, the loops are distributed using the 

Allen & Kennedy algorithm [AK87], and the detection is tried again. 

For reference, the Pythonic PIPS (pyps) script executed by the pass manager is given 

in Listing 7.1.

7.1. A SIMPLE OPENMP COMPILER 135 

Sequential 

Code 

Translator 

Sequential 

Code 

+ 

directives 

openmp 

Compiler 

Binary 

Figure 7.2: Source-to-source compilation scheme for openmp. 

Data: s ← a statement 

Result: a statement with openmp directives 

s ← loop_fusion(s); 

privatization(s); 

reduction_detection(s); 

if parallelism_detection(s) then 

s ← directive_generation(s); 

else 

s ← loop_distribution(s); 

if parallelism_detection(s) then 

s ← directive_generation(s); 

end 

end 

return s Algorithm 7: Parallel loop generation algorithm for openmp.


def openmp (m, verbose = False , ** props ): 

""" parallelize function with opennmp """ 

... # initialization stuff 

m. loop_fusion () 

# some analyses perform better after this 

m. split_initializations (** props ) 

# privatize scalar variables 

m. privatize_module (** props ) 

# try openmp parallelization coarse grain 

# custom functions 

try : 

m. coarse_grain_parallelization (** props ) 

except : 

m. internalize_parallel_code (** props ) 

# directive generation 

m. ompify_code (** props ) 

m. omp_merge_pragma (** props ) 

# eventually print the resulting code 

if verbose : 

m. display (** props ) 

Listing 7.1: Original PyPS script for openmp code generation.

7.1. A SIMPLE OPENMP COMPILER 137 

CFLAGS +=- fopenmp 

LIBS +=- fopenmp 

## pipsrules ## 

Listing 7.2: Makefile stub for openmp compilation. 

translator post-processor maker #pass involved 

SLOC 41 0 2 8 

Table 7.1: sloccount report for an openmp directive generator prototype written in pyps. 

The build process is mostly unchanged, but an additional flag to tell the compiler to 

interpret openmp directives. The makefile stub is given in Listing 7.2. No additional rules 

are provided, but the compiler and linker flags are changed. 

7.1.3 Experiments & Validation 

The aim of this section is not to build an efficient openmp code generator, but to 

provide a sample example as an introduction to the next sections. As a consequence we do 

not choose focus on getting impressing speedups on real-world applications, but rather on 

giving evidences that a compiler prototype can achieve reasonable results in spite of the 

little amount of work dedicated to its construction. 

The benchmark suite used is the polybench. Although this benchmark is intended 

at testing polyhedral transformations, it contains numerous kernels that are easily automatically 

parallelized, so they do not stress our naïve implementation, while showing the 

relevancy of the approach. Figure 7.3 shows the median speedup of the accelerated version, 

measured by taking the median over 100 executions and for the default benchmark 

sizes. 

The reference timings are obtained from a laptop running 2.6.38-2-686 GNU/Linux. 

It has 2 Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz. The code is compiled with 

gnu C Compiler (gcc) version 4.6.1 and -O3 -ffast-math flags. The accelerated code 

is obtained by running Algorithm 7 on each function of the program. It is compiled with 

same compiler with the same flags plus the openmp flag. 

To obtain this result, we have directly scripted the compiler with pyps. This script is 

a good illustration of pyps flexibility. It is reproduced in Appendix C. 

For each compiler we have implemented, we issue a small report that states the number 

of SLOC for the source-to-source compiler, the post-processor and the maker. We also 

compute the number of passes and analyses involved in the whole compilation process. 

The result for the openmp prototype is given Table 7.1. It shows that the prototype is 

simple and really capitalizes on existing transformations. The assembling is done in a very 

lightweight way.


2.5 

2 

1.5 

1 

0.5 

0 

generated OpenMP code relative execution time 

lu.c 

covariance.c 

correlation.c 

jacobi-2d-imper.c 

fdtd-2d.c 

jacobi-1d-imper.c 

adi.c 

seidel.c 

fdtd-apml.c 

gauss-filter.c 

reg-detect.c 

durbin.c 

symm.c 

symm.exp.c 

gemm.c 

mvt.c 

bicg.c 

trisolv.c 

trmm.c 

2mm.c 

syrk.c 

gemver.c 

cholesky.c 

atax.c 

syr2k.c 

doitgen.c 

gesummv.c 

3mm.c 

ludcmp.c 

dynprog.c 

gramschmidt.c 

Figure 7.3: Performance of an openmp directive generator prototype on the polybench 

benchmark.

7.2. A GPU COMPILER 139 

7.2 A GPU Compiler 

This section describes a prototype compiler for machine with nVidia’s gpus. It is not 

an optimizing compiler: it does not take advantage of some hardware capabilities, but it 

still generates speedups for computationally intensive applications. 


A gpu is an accelerator that does not share a memory space with its host. It couples 

mimd parallelism at coarse grain with Single Instruction stream, Multiple Data stream 

(simd) parallelism at fine grain. Two characteristics are important: a huge number of cores 

that provide important theoretical speedup, and a constrained memory: shared memory 

with limited size and low transfer rate between the gpu and the cpu when compared with 

transfer rate between cpu and main memory, not to mention that coalesced accesses are 

critical to reach high throughput. 

An nVidia gpu board is a set of multiprocessors. For instance a GTX580 board 

has 16 multiprocessors containing up to 32 thread processors each, that is 512 Compute 

Unified Device Architecture (cuda) cores as a whole. It can run a maximum number of 

1536 threads per multiprocessor. The blocks of threads are scheduled transparently by the 

hardware. It assumes each multiprocessor is independent. 

The memory hierarchy is quite complex: 

– gpu have registers for fast 1-cycle thread-local access so they are to be privileged for 

computations. Unfortunately since there are only 32KB registers per multiprocessor 

and thousands of threads on an nVidia GTX580, this is a scarce resource. 

– each thread has a local memory; 

– the shared memory is local to a multiprocessor and global for each thread of the 

multiprocessor. It is 16 or 48 KB large but it is accessed as fast as registers. They 

are often used as scratchpad memories to drastically increase performance; 

– the extended memory, called global memory (typically 1 to 6 GB) can be accessed 

by each thread but with a far bigger latency (800–1000 cycles). Each access must be 

coalesced by the use of a large number of threads per block; 

– the texture cache uses a small portion of the global memory. With a 50-cycles access 

time must be privileged over the global memory when possible. It is only readable, 

the coherence with the global memory is not ensured; but the memory amount is 

only 8 KB per multiprocessor; 

– the 64 KB constant memory. 

The different memory level and Processing Element (pe) layout are visible on the Fermi 

chip shown in Figure 7.4. It is synthesized in the form of a hardware feature diagram 

in Figure 7.5. In this diagram, many features are flagged as optional. This allows an 

incremental compiler development: first mandatory constraints are taken into account, 

then optional constraints are integrated to enhance performance. In this thesis, we focus 

on building a prototype that works for the mandatory features. Building a compiler that 

takes into account all gpu features is a PhD subject on its own!


Figure 7.4: nVidia Fermi architecture. 

memory 

rom ram 

gpu 

distributed shared 




Parallelism 

simd mimd 

Figure 7.5: gpu hardware feature diagram.



The hardware feature diagram of a gpu has some similarities with the hardware feature 

diagram of terapix (see Figure 7.10): both have their own memory and benefit from simd 

acceleration. As a consequence, the two compilers roughly use the same algorithm to turn 

the input code into host and accelerator part. In a similar manner, Direct Memory Access 

(dma) generation uses the same analyses, although the api naturally differs. 

The main difference lies in the kernel Instruction Set Architecture (isa). Firstly, unlike 

terapix, a compiler from a C++ dialect, cuda, to gpu isa, Parallel Thread eXecution 

(ptx), already exists, and takes care of low level transformations. However, some constraints 

such as the lack of support for variable length arrays or the normalization of the 

iteration spaces, require additional transformations. 

Secondly, cuda extends the C89 syntax with function qualifiers (e.g. __global__) or 

kernel calls, using the triple chevron syntax (>). We do not extend the Paralléliseur 

Interprocedural de Programmes Scientifiques (pips) ir to cover these extensions but use a 

macro-based compatibility header. 

The compilation scheme for gpu code generation is given in Figure 7.6. The new parts 

are the compatibility headers, the source-to-source compiler and the cuda translator. 

Other modules are simply reused. 

Compatibility header A compatibility header is a set of macro functions that performs 

a translation from C syntax to cuda syntax. For instance the return type of a kernel 

can be set to a typedef, say typedef __global__void void, which is correct C, that 

is defined as #define __global__void __global__ void in the compatibility header. 1 

Initial translator The initial translator takes a sequential code , detects parallel loops 

and splits each of them into two parts: a sequential C code that contains a call to 

a kernel, and a sequential call that embodies the kernel. A loop proxy code applies 

the kernel to each element of the loop iteration space. This loop proxy code is 

needed for the sequential semantics but does not appear in the final code, nor in 

Figure 7.6. Let us take a simple example, the sum of two arrays: the two arrays are 

initialized, a loop over the array elements perform the addition and stores the result 

in a third array. The initial translator splits this code in three parts: the host code 

that performs initializations and calls a kernel; the loop proxy code that iterates over 

all the elements and calls a kernel code for each; the kernel code that performs the 

addition. Figure 7.7 illustrates this structure. 

Once the code is split into host and kernel part, statement isolation is used to generate 

data transfers in the host part. 

cuda translator The cuda dialect is very close to the C language. The cuda translator 

takes care of the following syntactic changes: convert variable-length array to pointers 

1. Sometimes, the C preprocessor is not sufficient. In that case, regular expressions or (better) C++ 

templates with type inference can become handy. The reader may doubt the relevancy of using third-party 

tools instead of a large and monolithic ir. However, we are confident that using a wide range of specialized 

tools is more flexible.


Compatibility 

header 

C 

Preprocessor 

cuda 

Host Code 

cuda 

Compiler 

Host Binary 

Sequential 

C Code 

+ 

kernel call 

Sequential 

Code 

Initial 


Sequential 

C Code 

= 

kernel 

cuda 


C Code 

+ 

compatibility 

layer 

Figure 7.6: Source-to-source compilation scheme for gpu. 

C 

PreProcessor 

cuda 

Kernel Code 

cuda 

Compiler 

Host Binary 

Compatibility 

header


double in0 [n], in1 [n], out [n]; 




SLOC 3836 0 N/A 20 

Table 7.2: sloccount report for a cuda generator prototype written in pyps. 

using array linearization, normalize the iteration space using loop normalization, 

make sure no additional iterations are performed using iteration clamping, convert 

C99 complex type to cuda complex types and finally take care of the cuda specific 

syntax. Language differences are handled by the compatibility header. 


We have validated the tool on a set of image processing kernels: a convolution, with a 

n 

window size of 5 × 5, and a finite impulse response filter, with a window size of . The 

1000 

erode used as a running example so far does not pass the computational intensity test (see 

Section 5.3) on the considered machine and is not included in the benchmark. 

Measurements have been made using a desktop station hosting a 64-bit Debian/testing 

with gcc 4.3.5 and a 2-core 2.4 GHz Intel Core2 cpus. The cuda 3.2 compiler is 

used and the generated code is executed on a Quadro FX 2700M card. Compilation is 

fully automatic. The whole run is measured, i.e. timings include gpu initialization, data 

transfers, kernel calls, etc. The median over 100 runs is taken. Figure 7.8 shows additional 

results for digital signal processing kernels extracted from [Orf95] and available on the 

website http://www.ece.rutgers.edu/~orfanidi/intro2sp: a N-Discrete Time Fourier 

Transform and a sample cross correlation. 

The sloccount report for each part of the prototype is given in Table 7.2. 

7.3 An FPGA Image Processor Accelerator Compiler 

Heterogeneous computing is all about balancing the hardware and the associated costs, 

say intellectual property rights, energy consumption, volume, throughput, maintenance or 

development costs. For embedded devices, the balance is all the more difficult to find as the 

constraints are tighter. As a consequence, the hardware is likely to be highly specialized, 

which often means difficult to program. The terapix platform is a good illustration of this 

phenomenon: it is a low-power, high-throughput device specialized for image-processing, 

based on fpga, developed by thales. There are two main motivations for this machine: 

1. to be able to process a stream of images directly on the camera that generates them— 

so called “intelligent camera”. In the context of event recognition, if the events are 

scarce, it is too expensive to transfer all data to a remote processing engine. Performing 

the detection in place allows to transfer only valuable data; 

2. to be independent of a circuit provider. For longterm maintenance, it is not acceptable 

to depend on a third-party, closed-source, hardware. Choosing an fpga-based 

circuit unties the machine from the hardware.

7.3. AN FPGA IMAGE PROCESSOR ACCELERATOR COMPILER 145 

execution time (s) 


30 

25 

20 

15 

10 

5 

0 

160 

140 

120 

100 

80 

60 

40 

20 

0 

0 

0 

1000 

10000 

2000 

20000 

3000 

4000 

5000 

input size 

6000 

(a) Convolution 

30000 

40000 

50000 

input size 

60000 

on GPU 

on CPU 

7000 

8000 

on GPU 

on CPU 

70000 

80000 

9000 

90000 

10000 

100000 

(c) N-Discrete Time Fourier Transform 

execution time(s) 


18 

16 

14 

12 

10 

8 

6 

4 

2 

0 

40 

35 

30 

25 

20 

15 

10 

5 

0 

0 

0 

1e+06 

10000 

2e+06 

20000 

3e+06 

30000 

4e+06 

5e+06 

input size 

(b) fir 

40000 

50000 

input size 

6e+06 

60000 

(d) Correlation 

on GPU 

on CPU 

7e+06 

70000 

8e+06 

on GPU 

on CPU 

Figure 7.8: Median execution time on a gpu for dsp kernels. 

80000 

9e+06 

90000 

1e+07 

100000


# 

! " 

! " 

Figure 7.9: terapix architecture. 

This section presents the compilation chain for this hardware and its result on a few 

benchmarks. Section 7.3.1 describes the architecture and models it as a feature diagram. 

Section 7.3.2 proposes a compilation flow based on this model and the transformations 

presented in this thesis. Section 7.3.3 validates the approach on various image processing 

algorithms and compares manually compiled code to automatically generated ones. 


In this section, we give a quick summary of the terapix architecture and emphasize 

on the hardware constraints. The reader interested in more details is referred to [BLE + 08]. 

The terapix architecture is an fpga-based circuit implemented on a Virtex-4 SX-55 

from Xilinx. A general purpose softcore microprocessor, the µP , implements the control 

part and a simd Processing Unit (pu) is used for the image kernels that require high 

processing power. This pu consists of 128 pes that run at 150 MHz. The interconnect 

between pes follows a ring topology so that each pe has access to its neighbours’ memory 

and to its local Random Access Memory (ram) of 512 × 36b used for registers. A Read 

Only Memory (rom) of limited size can be accessed by all pes. 

The isa is dedicated to image processing: it uses a Very Long Instruction Word (vliw) 

instruction set that provides arithmetic operations over integers and a conditional assignment. 

Neither division nor floating point operations are available. Direct and indirect 

addressing modes are supported as well as a special pattern addressing mode to describe 

complex memory access patterns. Using a vertical pattern, a pe that access a[i] retrieves 

the a[#PE][i] element of the 2D global memory, while using a diag2 pattern, it retrieves 

the a[#PE][i+#PE] element. The sequencer only provides three control operations: a 

counter-based loop, a continue and a return. 

A vliw instruction consists of 5 fields given Table 7.3. The image field manipulates 

$$$


32b 26b 32b 13b 5b 

Image Mask Register Alu Sequence 

Table 7.3: Description of a terapix microinstruction. 

pointers to the global ram and to neighbouring pes, the mask field manipulates pointers 

to the global rom, the register field manipulates pointers to the local ram, the ALU field 

selects arithmetic operations and operators and the sequence field is used for the program 

counter. An example of vliw assembly code is given Listing 7.3. 

The set of hardware features is described in Figure 7.10, using the methodology proposed 

in Chapter 2. 

This hardware is currently only programmed by hand: the developer writes the C code 

for the host side and the microcode, the assembly code for the accelerator. Three tools are 

provided to the developer: a compiler from the assembly code that generates a microcode 

image in the form of a an array of bytes defined in a C header for inclusion on the host 

side, a cycle-accurate simulator to test the resulting code and a code compactor to pack 

vliw instructions when possible. 

In addition to the restricted isa, programming such a machine is difficult because of the 

combination of a large simd unit and limited dma. For instance, to perform a point-topoint 

operation on a vector of 130 elements, one must load the first 128 elements, perform 

the computation and copy back the result, then load the elements from the third to the 

130 th , perform the computation and copy the result back, leading to 126 computations 

being performed twice. Figure 7.11 illustrates this troublesome behavior. 

7.3.2 terapix Compiler Implementation 

Compiling for terapix requires two steps: the input code is scanned for parallel loops 

and each of them is split into an host part and an accelerator part, then the accelerator 

part is translated into terapix assembly code. 

A compilation scheme that takes into account terapix specificities is given in Figure 

7.12. 

7.3.2.1 Input Code Splitting 

The separation of the kernel from its caller is performed by Algorithm 8 based on the 

transformations presented in Sections 6.1 and 6.2. 

Once a kernel has been extracted, it must be converted to meet the hardware constraints. 

However, no C-to-assembly compiler is available, and we are left with an assembler 

and a code compacter. As a consequence, we first perform as many refinements as 

possible at the source level, using the ideas developed in Chapter 4. Then we use an ad 

hoc C-to-terasm tool developed for this purpose. It generates uncompacted code and pipes 

it through the code compactor to generate the final assembly code.


prog convol 

sub convol 

pattern vertic || || || || 

im ,i1= FIFO1 +NN ||ma ,m1= FIFO3 || || || 

im ,i2= FIFO2 +NN || || || do_N1 || 

im ,i1=i1+SS || || || || 

im ,i2=i2+SS || || || || 

im ,i3=i1+W || || || || 

im ,i4=i2+W || || || do_N2 || 

im ,i3=i3+E || ma=m1 ||P=im*ma || || 

im=im+E || ma=ma+E ||P=P+im*ma || || 


im=im+S || ma=ma+S ||P=P || || 

|| ||P=N+im*ma || || 

im=im+S || ma=ma+S ||P=P || || 

|| ||P=N+im*ma || || 

im=im+W || ma=ma+W ||P=P+im*ma || || 

im=im+W || ma=ma+W ||P=P+im*ma || || 

im=im+N || ma=ma+N ||P=P || || 

|| ||P=S+im*ma || || 


im ,i4=i4+E || ||P,im=P || loop || 

|| || || loop || 

|| || || || return 

endsub 

endprog 

Listing 7.3: terapix assembly for a 3 × 3 convolution kernel. 

memory 

rom ram 

distributed 


terapix 


Parallelism 

simd 

Figure 7.10: terapix hardware feature diagram.


(a) First step: no redundant computations. 

(b) Second step: redundant computations. 

Figure 7.11: terapix redundant computations. 

Sequential 

C Code 

+ 

kernel call 

C Compiler 

Host 

Binary 

Sequential Code 


Sequential 

C Code 

= 

kernel 

terapix 

PostProcessor 

Assembly 

(not compacted) 

terapix 

Compactor Assembler 

Microcode Assembly 

(compacted) Binary 

Figure 7.12: Source-to-source compilation scheme for terapix.


Data: s ← a statement 

Data: pe ← the number of Processing Element 

Data: m ← the accelerator memory size 

Result: k a set of kernel codes 

for l ∈ loops(s) do 

if depth(l) = 2 then 

declare_variable(s, size_t height); 

declare_variable(s, size_t width); 

l ′ ← symbolic_tiling(l, 〈height, width〉); 

solve_linear_system(l ′ , pe, m); 

generate_rom(l ′ ); 

s ′ ← isolate_statement(s, l ′ ); 

k ← k ∪ {outline(s, s ′ )}; 

end 

end 

return k 

Algorithm 8: terapix kernel extraction algorithm at the pass manager level. 

Algorithm 9 details the steps involved in assembly code generation. It first processes 

each loop to normalize its iteration space and converts each do-loop to its while-loop 

counterpart. Declaration blocks are removed by flatten code and all array references are 

replaced by their pointer equivalent using array linearization. strength reduction transforms 

pointers into iterator whenever possible. The granularity of the C code is then lowered by 

split update operator and n address code generation. 


We categorize the image operators found in terapix’s application domain as either 

point-to-point, vertical, horizontal or stencil operators. This leaves apart operators such as 

histograms that are not covered by our compilation scheme because of the more complex 

parallelization scheme. For each category, we choose a specific operator, namely brightness, 

vertical erode, horizontal convolution and convolution. A terapix expert manually wrote 

an optimized assembly version for these kernels, and we wrote the text-book version of 

these algorithms in C and piped it through our automatic compiler. Table 7.4 gives the 

ratio between microcode cycle counts for automatic and manual code generation. It shows 

that the automatically generated code’s execution time is close to the manual one. The 

slowdown of the vertical erode is due to a naïve register allocation scheme that suffers from 

the low-latency terapix registers. 

The sloccount report for each part of the prototype is given in Table 7.5. This prototype 

is far more complex than the previous ones, but so is the target. 

Listing 7.4 illustrates the behavior of Algorithm 9 on a horizontal erosion for the host 

side. Listing 7.5 illustrates the accelerator side. Listing 7.6 shows the generated assembly


Data: k ← a kernel from Algorithm 8 

Data: I ← {f | f is an instruction in Terasm} 

Result: k formatted as a terapix microcode 

for l ∈ loops(k) do 

k ← loop_normalize(l, lower_bound=0); 

k ← do_loop_to_while_loop(l); 

end 

k ← flatten_code(k); 

k ← linearize_array(k, pointer_conversion=True); 

k ← strength_reduction(k); 

k ← split_update_operator(k); 

k ← n_address_code_generation(k, 2); 

k ← normalize_terapix_microcode(k); 

k ← dead_code_elimination(k); 

for i ∈ I do 

k ← instruction_selection(k, i); 

end 

return k 

Algorithm 9: C-to-terapix translation algorithm at the pass manager level. 

brightness horizontal convolution vertical erode convolution 

automatic 

manual ×1 ×1.31 ×2.12 ×1.31 

Table 7.4: Ratio between terapix microcode cycle counts for automatic and manual code 

generation. 


SLOC 211 218 18 32 

Table 7.5: sloccount report for a terapix assembly generator prototype written in pyps.


code after compaction. The comparison with the initial listing demonstrates the need for 

an automatic generation tool. 

7.4 A Retargetable Multimedia Instruction Set Compiler 

This section presents a retargetable compiler for mis based on pips. It is based on the 

work on Super-word Level Parallelism (slp) presented in Section 5.1 and the communication 

optimization from Section 6.3. It targets three different instruction sets: sse, avx 

and neon. 


miss rely on the vector unit found in most modern processors. The set of hardware 

feature is described of such processors is given in Figure 7.13. Each mis has a specific 

isa but we have already shown in Section 5.1.2 how to represent them using a generic 

instruction set. As a consequence, the instruction set is mainly represented by the number 

of bits per vector register. 


The input language is C and the output language is C with mis intrinsics. As intrinsics 

are C functions, there is no need for post-processing. However, header substitution is 

required to specialize the generic mis. Figure 7.14 summarizes the compilation flow. The 

source-to-source translator relies on Algorithm 3 from Chapter 5 to generate vector code. 

7.4.3 Multimedia Instruction Set on Desktop and Embedded Processors 

Three sets of experiments have been carried out. They all use the same set of C source 

files. Three mis are tested on Core2 Duo running at 2.2 GHz, using a 2.6.34 Linux Kernel. 

A board with an ARMv7 processor and a 2.6.28 Linux kernel was used for the neon mis. 

A machine with a 2.6.32 Linux kernel and an Intel SandyBridge (running at 2.6 GHz) 

executed the avx tests. 

Applications have been chosen to point out limitations of compilers (including ours). 

daxpy_u?r.c, ddot_u?r.c and dscal_u?r.c are taken from linpack [DLP03] benchmark 

and illustrate the impact of manual unrolling on vectorization. matrix_*.c are taken from 

the Coremark [Con] benchmark and show the impact of tiling. stencil.c is a typical 

stencil application and a good candidate for vectorization. 

Other benchmarks are text-book versions of well known computations kernels (Finite 

Impulse Response filter, average power, alpha-blending, convolution with a 3 × 3 kernel) 

taken from a dsp manual [Orf95].

7.4. A RETARGETABLE MULTIMEDIA INSTRUCTION SET COMPILER 153 

void runner ( int n, int img_out [n][n -4] , int img [n][n]) { 

for ( int y = 0; y


void launcher_0_microcode ( int I_29 , int img00 [258] , int img_out00 

[254]) { 

for ( int x = 0; x


prog launcher_0_microcode 

sub launcher_0_microcode 

im , i12 = FIFO2 ||||P,re (0)=1 || || 

im , i11 = FIFO1 |||| P=P || || 

im , i10 = i11 +1* E |||| P=P || || 

im ,i9=i11 +2* E |||| P=P || || 

im ,i8=i11 +3* E |||| || || 

im ,i7=i11 +4* E |||| || || 

im ,i1=i7 |||| || || 

im ,i2=i8 |||| || || 

im ,i3=i9 |||| || || 

im ,i4=i10 |||| || || 

im ,i5=i11 |||| || || 

im ,i6=i12 |||| || || 

im ,i6=i6 +1* W |||| || || 

im ,i5=i5 +1* W |||| || || 

im ,i4=i4 +1* W |||| || || 

im ,i3=i3 +1* W |||| || || 

im ,i2=i2 +1* W |||| || || 

im ,i1=i1 +1* W |||| || do_N1 || 

im=i5 +1* E ||||P,re (1)= im*re (0) || || 

im=i4 +1* E |||| P=P || || 

|||| P=P || || 

|||| P=P || || 

||||P,As=P-im*re (0) || || 

||||P,re (2)= im*re (0) || || 

im=i3 +1* E |||| P=P || || 

|||| P=P || || 

|||| P=P || || 

|||| P=re (1) || || 

||||P,re (8)= if(As =1 ,P,re (2))|| || 

|||| P=P || || 

|||| P=P || || 

|||| P=P || || 

||||P,As=re (8) - im*re (0) || || 

||||P,re (7)= im*re (0) || || 

im=i2 +1* E |||| P=P || || 

|||| P=P || || 

|||| P=P || || 

|||| P=re (8) || || 

||||P,re (7)= if(As =1 ,P,re (7))|| || 

|||| P=P || || 

|||| P=P || || 

|||| P=P || || 

||||P,As=re (7) - im*re (0) || || 

||||P,re (6)= im*re (0) || || 

im=i1 +1* E |||| P=P || || 

|||| P=P || || 

|||| P=P || || 

|||| P=re (7) || || 

||||P,re (6)= if(As =1 ,P,re (6))|| || 

|||| P=P || || 

|||| P=P || || 

|||| P=P || || 

||||P,As=re (6) - im*re (0) || || 

||||P,re (2)= im*re (0) || || 

im=i6 +1* E |||| P=P || || 

|||| P=P || || 

|||| P=P || || 

|||| P=re (6) || || 

||||P,im=if(As =1 ,P,re (2)) || loop || 

|||| || || return 

endsub 

endprog 

Listing 7.6: Illustration of terapix compacted assembly.


memory 

ram 


mis 


Parallelism 

simd 

Figure 7.13: mis hardware feature diagram. 

Sequential 

Code 


Sequential 

Code 

+ 

Intrinsics 


Compiler 

Binary 

Specialization 

header 

Figure 7.14: Source-to-source compilation scheme for mis.


The experiments consist of measuring the execution time of each kernel, using either 

the initial version or the optimized version generated by our compiler. The compilers used 

were gcc 4.4.5 for both i386 and ARM architectures, with -O3 -ffast-math flags, and 

median over 150 runs of each program is measured. 

The challenge is to have the same level of performance as Intel C++ Compiler (icc) 

for sse and avx, while supporting another architecture, arm, in the same unified infrastructure, 

pips, to provide performance portability. 

7.4.4 Results & Analyses 

Figure 7.15a, 7.15b and 7.15c show the result of the experiments, giving speedup of vectorized 

version compared to the reference sequential version. This reference run is given by 

the original source compiled with gcc with -O3 -ffast-math and -fno-tree-vectorize. 

These experiments lead to the following assessments: 

1. gcc vectorization engine hardly achieves any speedup: only rather simple kernels 

get a 2× speedup, fir even has a significant slowdown, while icc gets very good 

speedups for all kernels; 

2. using our vectorization engine and running gcc on the output is almost always beneficial 

(green bars are above red bars). It is especially visible on the arm processor; 

3. we outperform icc (pink bars above blue bars) for matrix-mul-* on sse, due to the 

combination of tiling and vectorization, but always lose to it for the same kernels on 

avx; 

4. the Fused Multiply-Add (fma) operation is available in the neon mis and gcc does 

not use it. This explains the super-ideal speedups and shows the benefits of using 

target-specific instructions; 

5. unrolled version of linpack kernels are better vectorized by pips, thanks to the slp 

approach; 

6. pips output behaves better when compiled by icc than when compiled by gcc. It 

illustrates the source-to-source approach that hooks the compilation flow to add a 

feature, here vectorization, and delegates the remaining work to other compilers. In 

this case, icc performs additional optimizations on the vector code gcc is not aware 

of. 

These experiments validate the approach: performance within the reach of icc is 

achieved, and this performance is portable across architectures. 

The sloccount report for each part of the prototype is given in Table 7.6. The SLOC 

for the post-processor and the maker are given for the avx driver, sse and neon drivers 

have similar values.


speedup vs. sequential execution 



3 

2.5 

2 

1.5 

1 

0.5 

gcc+nopips 

gcc+pips 

icc+nopips 

icc+pips 

fir matrix-mul-matrix 

corr 

convol3x3 

ddot-r 

dscal-r 

matrix-mul-vect 

ddot-ur 

alphablending 

stencil 

matrix-mul-const 

dscal-ur 

daxpy-r 

daxpy-ur 

(a) Vectorization using the sse mis with 32-bit OS. 

8 

7 

6 

5 

4 

3 

2 

1 

gcc+nopips 

gcc+pips 

icc+nopips 

icc+pips 


corr 

convol3x3 

ddot-r 

dscal-r 


ddot-ur 

alphablending 

stencil 


dscal-ur 

daxpy-r 

daxpy-ur 

(b) Vectorization using the avx mis with 64-bit OS. 

12 

10 

8 

6 

4 

2 

gcc 

gcc+pips 


corr 

convol3x3 

ddot-r 

dscal-r 


ddot-ur 

alphablending 

stencil 


dscal-ur 

daxpy-r 

daxpy-ur 

(c) Vectorization using the neon mis.



SLOC 223 166 1 30 

Table 7.6: sloccount report for an avx intrinsic generator prototype written in pyps. 

pass count 

30 

25 

20 

15 

10 


5 

0 

1 2 3 4 

# of compilers where a pass is found 

Figure 7.15: Pass reuse among 4 pyps-based compilers. 

We claim in Chapter 3 that an important characteristic of a compiler infrastructure is 

the pass reuse among compilers. To verify that our framework matches this expectation, 

we use the following experiment over the six compilers presented in this chapter: for each 

pass and analysis available in pips, we count the number of compilers it is used in. The 

mis compiler is counted only once. The result of this analysis is given in Figure 7.15. This 

chart does not take into account parsers and pretty-printers. Although the targets are very 

different (a multicore device, a mis, a gpu and an embedded accelerator), we still have 

good pass reuse, as more than 50% of the passes are used in at least two compilers, if they 

are used at all. The high number of passes used only in one compiler is linked to the fact 

that each target is subject to very different hardware constraints, especially for the isa. 

As a consequence, each compiler has many small passes to take into account target-specific 

aspects. The passes that are reused the most are also the most complex ones: symbolic 

tiling, outlining, statement isolation, etc. The addition of an Open Computing Language 

(opencl) compiler would certainly hare a lot with the existing cuda compiler. 

Table 7.7 summarizes the information presented for each compiler prototype. It shows 

that each complier was assembled using a reasonable development effort, but also that the



openmp 41 0 2 8 

terapix 211 218 18 32 

mis 223 166 1 30 

cuda 597 0 N/A 20 

Table 7.7: Summary of the sloccount reports for the compiler prototypes written in pyps. 

more complex the target is, the more complex the pass manager is. However, pass reuse 

makes it possible to keep this complexity low. 

Another point to consider about pass reuse is the compiler composition. Some compilers 

may not share many passes, but we have provided a way to compose them using multiple 

inheritance. The consequence of this modular design is that each compiler focuses on a 

specific task, not an all-around purpose. 

This chapter presents the design and implementation of four compilers for heterogeneous 

devices, based on the heterogeneous model analyses from Chapter 2, the compiler 

infrastructure described in Chapter 3 and the transformations described in Chapters 4, 5 

and 6. 

This chapter describes how to build a basic openmp compiler using the ideas developed 

in this thesis: model the architecture, identify the output language and re-use existing 

transformations. 

Using the same methodology, we present three other compiler prototypes we have built 

during our phd: a retargetable compiler for mis that targets both sse, avx and neon, 

a compiler for the terapix image processor that goes from C down to assembly, and a 

C-to-cuda compiler for nVidia gpus. 

Each prototype is validated on a set of benchmarks to ensure that it generates valid 

and reasonably efficient code. We also provide an analyse of each compiler in the term of 

Source Lines Of Code (sloc), pass usage and pass reuse.

Chapter 8 

Conclusion 

Vieux pont du Bono, Morbihan c○ Pierre Yves Sabas 

The path toward performance is led by heterogeneous devices: even the laptop used to 

write this dissertation can use the processing power of two gpps, the two associated sse 

vector units, and a gpgpu. The main concern with such devices is programmability. In 

this thesis, we have taken the path of compilation to automate the production of code for 

hardware accelerators. We focused on the ability to produce cheaply different compilers for 

different hardware. As modern hardware is usually programmable in a C dialect, we have 

set the goal of automatically translating text book algorithm written in the C language 

into several target-dependent kernels written in a C dialect, and to generate the glue code 

between the host and the accelerator. 

The advantage of this approach is its modularity: many transformations can be reused 

from one accelerator to the other. It reduces the cost of producing compilers as new 

targets become available. Moreover, using a source-to-source based infrastructure makes 

it possible to interact with existing tools, especially compilers that generate binary code 

from C dialects. 

161

162 CHAPTER 8. CONCLUSION 

To avoid the pitfall of parallel compilers designed in the 80’s, we deliberately chose to 

take as inputs programs for which the sequential and parallel algorithms are similar. This 

has made it possible to focus on the translation task and not the parallelism extraction. 

Contributions 

Methodology to Build Source-to-Source Compilers 

We have proposed to model hardware devices with a hardware constraint diagram. This 

diagram identifies the mandatory and optional features of the hardware, and the manual 

association between constraints and code transformations guides the compiler developer 

through the compiler development process. 

Generic Compiler Infrastructure Design 

The heterogeneity of accelerating devices makes it difficult to build a unique compiler to 

target them all. Moreover, pieces of software already exist, at different levels, to program 

these machines. In Chapter 3, we have proposed a compilation flow that combines a 

comprehensive source-to-source transformation toolbox, an api for pass management and 

an heterogeneous machine model. This methodology is validated in Chapter 7 for 4 different 

targets. It is used in the Par4All tool developed by hpc project. 

Transformations for isa Constraints 

Heterogeneous devices provide acceleration through specialization: basically, they perform 

better on a narrower set of applications. The direct consequence is a specialization of 

the isa. This specialization is visible in the C dialects proposed to program these devices. 

Chapter 4 proposes a set of source-to-source transformations to lower the level of the input 

language, including an original algorithm for outlining based on convex array regions. 

Hybrid slp Algorithm 

Multimedia instructions are now commonly found in gpp and even provide the base for 

acceleration in hybrid cpu/gpu chips. We have developed an original algorithm based on 

existing works around loop vectorization and Super-word Level Parallelism that combines 

the benefits of loop-based and sequence-based approaches in a unified algorithm. It is 

parametrized by a C description of the isa and so respects the criteria of retargetability 

raised at Chapter 3. The algorithm has been tested for three Multimedia Instruction Set: 

sse, avx and neon. This work received the third best poster award at PACT 2011.

Transformations to Meet Memory Constraints 

Memory is critical for many heterogeneous systems: when the accelerator does not share 

memory with its host, rpc and dma are needed. The programming model is much more 

complex than classical ones. We have presented in Chapter 6 three transformations to take 

this into account: statement isolation that separates accelerator memory and host memory; 

memory footprint reduction that finds a tiling matrix that make sure there is enough 

memory on the accelerator to run the tiled code; and redundant load-store elimination 

that removes redundant data transfers. 

Implementation 

All the transformations presented in this thesis have been developed in the pips sourceto-source 

compiler infrastructure for the C language and assembled using the pyps pass 

manager. They have led to the implementation of four compilers: a prototype of Open 

Multi Processing (openmp) directive generator, a retargetable compiler for mis, a kernel 

generator for an fpga-based image processor, terapix, and a gpu code generator developed 

by the hpc project. It validates both the overall compiler infrastructure design and 

the algorithm proposed in the thesis. Experiments and compilation flows are detailed in 

Chapter 7. 

Contributions to the pips community 

It is difficult to untie a PhD in applied computer science from the development task. 

Developing new passes and extending some of the existing passes, originally designed for 

the Fortran, to the C language, has taken a significant amount of time. As a member of the 

pips team, I have managed the modernization of the build system and the rationalization 

of the software packaging. 

I have supervised five trainees at Télécom Bretagne during internships related to the 

pips project and contributed to the scientific dissemination of our works through two 

tutorials. 

Future Work 

hpc is steadily changing. Sparc64 are ranking first in the top500 of June, 2011, while 

nVidia’s gpu led the race 6 months before. In this moving environment, nothing is ever 

settled yet and hardware vendors are still pushing their standards to have a common 

programming model supported by efficient engineering tools. This requires cooperation 

and interoperability between tools. To that extent, bridging the gap between opencl and 

existing vhdl generators is an interesting challenge and it remains an open research field. 

However, hpc is still a niche market compared to embedded systems and smartphones. In 

these fields, the hardware constraints are all the more important due to limited battery 

163

164 CHAPTER 8. CONCLUSION 

capacity, weight, space, etc. The transformations and the approach studied during this 

PhD can certainly be used in that fields. 

mis are becoming more and more flexible, and it is common to find non-simd instructions 

in their C api. These instructions allows more load/store patterns (e.g. strided loads) 

and helps to achieve higher throughput in memory-constrained applications. The incremental 

addition of transformations that handle these instructions to our mis compiler is a 

promising subject. 

We see two possible extensions to the work on pass managers. First, the operator 

combinations creates an oriented graph that presents interesting parallelization opportunities 

at the pass manager level. It would make the compilation process itself parallel and 

improve compilation time. Second, some phase combination are known to be redundant, 

meaningless, etc. Adding semantics on code transformation would yield interesting graph 

pruning, for instance in the context of iterative compilation.

Appendix A 

The PIPS Compiler Infrastructure 

Pont de Saint Goustan, Morbihan c○ Gwenael AB / flickr 

Paralléliseur Interprocedural de Programmes Scientifiques (pips) [IJT91, ISCKG10, 

AAC + 11] is a source-to-source compiler infrastructure started in 1988 at MINES Paris- 

Tech, when parallel architectures were preeminent. Since then it has been successfully 

used to analyse, check or parallelize industrial Fortran codes. C support started ten years 

ago, bringing both new challenges and new applications. The key ideas of the framework, 

that makes it still relevant in 2011, are: a minimalistic Internal Representation (ir), interprocedural 

analyses and abstract interpretation on polyhedral lattices. The compiler 

infrastructure for heterogeneous targets, the algorithms and passes described in this thesis 

have been fully implemented on top of pips. Finally, the three compilers described 

in Chapter 7 are also based on pips. The short overview given in this appendix should 

help readers not accustomed to pips to understand some technical parts. A more detailed 

overview is given in [AAC + 11], and the interested reader is advised to refer to the theses 

of Nga Nguyen [Ngu02] and Béatrice Creusillet [Cre96] for a detailed description of 

the underlying mathematical framework. 

165

166 APPENDIX A. THE PIPS COMPILER INFRASTRUCTURE 

void foo ( int n, int threshold , int a[n], int b[n]) { 

int k =0; 

for ( int h =0;h threshold ) 

b[k ++]= a[h]; 

} 

} 

Available Analyses 

Listing A.1: A simple loop to illustrate pips analyses. 

In addition to classical compiler analyses such as use-def chains, dependence graph or 

read-write effects, pips provides accurate interprocedural analyses. We illustrate each of 

them based on the loop from Listing A.1. 

Preconditions and Postconditions are affine predicates over scalar variables that are 

proved to hold before or after the execution of a given statement, respectively (see 

Listing A.2). 

// P() {} 

void foo ( int n,int threshold , int a[n], int b[n ]){ 

// P() {} 

int k = 0;{ 

// P(k) {k ==0} 

// P(h,k) {k ==0} 

for ( int h = 0; h

T() {} 


// T(k) {k ==0} 

int k = 0;{ 

// T(h) {} 

// T(h,k) {0

168 APPENDIX A. THE PIPS COMPILER INFRASTRUCTURE 

// : a [*] threshold 

// : b [*] 

// < is read >: n 


// < is written >: k 

int k = 0;{ 

// : a [*] h k threshold 

// : b [*] k 

// < is read >: n 

// < is written >: h 

for ( int h = 0; h : h n threshold 

if (a[h]> threshold ) 

// : a [*] 

// : b [*] 

// < is read >: h k n 

// < is written >: k 

b[k ++] = a[h]; 

} 

} 

Listing A.4: Example of cumulated memory effects analysis. 

//

Appendix B 

The LuC language 

The LuC language is used in some proofs of this dissertation. This language is similar 

to Fortran with a C syntax. Redundant constructs such as the += operator are not 

represented to keep proofs simple. The main differences with the C language are the 

removal of recursive calls, unions and pointers and the addition of reference passing mode 

for function parameters. Global variables are not allowed. A short reference of its syntax 

is given here, using typing conventions to differentiate non-terminal symbols and terminals 

from language constructs. 

B.1 Syntactic Clauses 

prog : fdecls 

fdecls : ∅ | fdecl fdecls 

fdecl : void id ( param ) { stat } 

type : int | float | complex | struct id { fields } | type [ expr ] 

param : type id 

fields : ∅ | field fields 

field : type id 

expr : cst | ref | expr op expr 

ref : id | ref [ expr ] | ref . fieldname 

stat : ∅ | ; | { type id ; stat } 

| ref =expr ; | ref =read ; 

| write expr ; 

| ref (ref ) ; 

| if( expr ) { stat } else { stat } 

| while( expr ) { stat } 

| stat ; stat 

169

170 APPENDIX B. THE LUC LANGUAGE 

B.2 Semantic Clauses 

T(int|float, σ) = 1 

T(complex, σ) = 2 

T(struct id { fields }, σ) = 

〈t,i〉∈fields 

T(t, σ) 

T( type [ expr ] , σ) = T( type, σ) × E(expr, σ) 

R(id, σ) = I(id) 

R(ref [ expr ], σ) = R(ref , σ)[E(expr, σ)] 

R( ref . fieldname, σ) = R(ref , σ).fieldname 

E(cst, σ) = cst 

E(ref , σ) = σ(R(ref )) 

E(expr op expr, σ) = E(expr, σ) op E(expr, σ) 

S(∅, σ) = σ 

S(;, σ) = σ 

S({ type id ; stat}, σ) = unbind(S(stat, loc(id, T(type, σ), σ)), id) 

S(ref =expr, σ) = σ[R(ref ) → E(expr, σ)] 

S(write expr, σ) = push(σ(istdout), E(expr, σ)); σ 

S(ref =read, σ) = σ[R(ref, σ) → pop(σ(istdin))] 

S( id(ref ), σ) = σ[lk → vk | lk ∈ R(ref , σ) ∧ vk = 

S(body(id), {I(formal(id)) → E(ref , σ)})(lk)] 

S(if(expr){ stat 0 }else{ stat 1 }, σ) = if E(expr, σ) then S(stat 0, σ) 

else S(stat 1, σ) 

S( while( expr ) { stat }, σ) = if E(expr, σ) then 

S( while( expr ) { stat }, S(stat, σ)) 

else σ 

S(stat 0 ; stat 1, σ) = S(stat 1, S(stat 0, σ))

Appendix C 

Using PyPS to Drive a Compilation 

Benchmark 

This verbatim copy of the script used to benchmark our Open Multi Processing 

(openmp) translator on the polybench is a good illustration of the flexibility of Pythonic 

PIPS (pyps). It instantiates a compiler for each application found in the polybench source 

tree and turns each sequential kernel into a parallel kernel and instruments it to gather 

execution time information. This is achieve by composition of the openmp compiler with 

an instrumentation compiler and an abstraction—pyrops.pworkspace—that runs each 

compiler in a new process. 

import pyrops 

import workspace_gettime 

import openmp 

from glob import glob 

import shutil 

from os. path import basename 

map ( shutil . rmtree , glob (" PYPS *")) 

map ( shutil . rmtree , glob (" .*. tmp ")) 

class workspace ( workspace_gettime . workspace , pyrops . pworkspace ): 

pass 

ITER =10 

result = list () 

for src in glob (" polybench -2.0/*/*/*. c")+ glob (" polybench -2.0/*/*/*/*. c"): 

if src [ -6:] != " pocc .c": 

name = basename ( src ). replace ("_","-") 

# workspace . delete ( name ) 

w= workspace (src , cppflags ="-Ipolybench -2.0/ utilities /",verbose = False ) 

w. fun . main . benchmark_module () 

times0 =w. benchmark ( iterations =ITER , LDFLAGS ="-lm",CFLAGS =’-O3␣-ffast - ma 

w. fun . main . openmp ( internalize_parallel_code = False ) 

171

172 APPENDIX C. USING PYPS TO DRIVE A COMPILATION BENCHMARK 

times1 =w. benchmark ( openmp . ompMaker (), iterations =ITER , LDFLAGS ="-lm",CFL 

count =0 

for line in w. fun . main . code . split (’\n’): 

if line . find ("# pragma ␣ omp ␣" )!= -1: 

count +=1 

result . append (( name ,count , times0 [’main ’][0] , times1 [’main ’][0])) 

w. close () 

fout = file (" polybench - openmp . dat ","w") 

for r in result : 

print >> fout ,r[0] ,r[1] ,r[2] ,r[3] 

fout . close ()

Appendix D 

Using C to Emulate sse Intrinsics 

An excerpt of the header used as a sequential replacement of the Streaming simd 

Extension (sse) header xmmintrin.h is reproduced here for the interested reader. It shows 

how the sse Instruction Set Architecture (isa) can be emulated using pure C code. This 

header was used to compile sse enabled applications on processors that do not have sse 

vector units. 

# include 

/* Some Macros from xmmintrin .h */ 

# define _MM_SHUFFLE (z, y, x, w) ((( z)

174 APPENDIX D. USING C TO EMULATE SSE INTRINSICS 

/* data reorganization */ 

inline __m128i _mm_unpacklo_epi64 ( __m128i v0 , __m128i v1) { 

__m128i ov = { . u64 = { v0.u64 [0] , v1.u64 [0] } }; 

return ov; 

} 

inline __m128i _mm_shufflehi_epi16 ( __m128i v, int mask ) { 

__m128i ov = { . u16 = { v. u16 [0] , v. u16 [1] , v. u16 [2] , v. u16 [3] , 

v. u16 [4+(( mask > >0)&3)] , v. u16 [4+(( mask > >2)& 3)] , 

v. u16 [4+(( mask > >4)& 3)] , v. u16 [4+(( mask > >6)& 3)] } }; 

return ov; 

} 

inline __m128i _mm_shufflelo_epi16 ( __m128i v, int mask ) { 

__m128i ov = { . u16 = { v. u16 [( mask > >0)&3] , v. u16 [( mask > >2)& 3], 

v. u16 [( mask > >4)& 3], v. u16 [( mask > >6)& 3], 

v. u16 [4] , v. u16 [5] , v. u16 [6] , v. u16 [7] } }; 

return ov; 

} 

inline __m128i _mm_shuffle_epi32 ( __m128i v, int mask ) { 

__m128i ov = { . u32 = { v. u32 [( mask > >0)&3] , v. u32 [( mask > >2)& 3], 

v. u32 [( mask > >4)& 3], v. u32 [( mask > >6)& 3] } }; 

return ov; 

} 

inline __m128i _mm_unpackhi_epi64 ( __m128i v0 , __m128i v1) { 

__m128i ov = { . u64 = { v0.u64 [1] , v1.u64 [1] } }; 

return ov; 

} 

/* pure vector operations */ 

inline __m128i _mm_or_si128 ( __m128i v0 , __m128i v1) { 

__m128i ov; 

for ( size_t i =0;i

} 

return ov; 

inline __m128i _mm_add_epi32 ( __m128i v0 , __m128i v1 ){ 

__m128i ov; 

for ( size_t i =0;i

176 APPENDIX D. USING C TO EMULATE SSE INTRINSICS 

} 

return ov; 

/* bit operations on full vector */ 

inline __m128i _mm_slli_si128 ( __m128i v, int count ) { 

__m128i ov; 

count *=8; 

ov.u64 [1]= v. u64 [1] > (64 - count ): 

v. u64 [0] count ; 

ov.u64 [0]|= ( count < 64) ? 

v. u64 [1] > count %64; 

ov.u64 [1]= v. u64 [1] >> count ; 

return ov; 

}

Code Transformation Glossary 

Transformations marked with a † are passes implemented in pips during the PhD or 

passes I made significant contributions too. 

array linearization † is the process of converting multidimensional array into unidimensional 

arrays, possibly with a conversion from array to pointer.. 144, 150 

common subexpression elimination † is the process of replacing similar expressions by 

a variable that holds the result of their evaluation.. 85, 167 

constant propagation is a pass that replaces a variable by its value when this value is 

known at compile time.. 51 

dead code elimination is the process of pruning from a function all the statements whose 

results are never used.. 120, 167 

directive generation is a common name for code transformations that annotate the code 

with directives.. 49, 134 

flatten code is the process of pruning a function body from declaration blocks so that all 

declarations are made at the top level.. 150 

forward substitution † is the process of replacing a reference read in an expression by 

the latest expression affected to it.. 46, 84, 167 

goto elimination is the process of replacing goto instructions by a hierarchical control 

flow graph.. 84 

inlining † is a function transformation. Inlining a function foo in its caller bar consists 

in the substitution of the calls to foo in bar by the function body after replacement 

of the formal parameters by their effective parameters.. 46, 53, 83, 84 

instruction selection † is the process of mapping parts of the Internal Representation to 

machine instructions.. 79 

invariant code motion is a loop transformation that moves outside of the loop the code 

from its body that is independent from the iteration.. 167 

iteration clamping is a loop transformation that extends the loop range but guards the 

loop body with the former range.. 144 

loop fusion is a loop transformation that replaces two loops by a single loops whose body 

is the concatenation of the bodies of the two initial loops.. 49, 52, 87, 134, 167 

177

178 Glossary 

loop interchange is a loop transformation that permutes two loops from a loop nest.. 

102, 167 

loop normalization is a loop transformation that changes the loop initial increment 

value or the loop range to enforce certain values, generally 1.. 144 

loop rerolling finds manually unrolled loop and replace them by their non-unrolled version.. 

108 

loop tiling is a loop nest transformation that changes the loop execution order through a 

partitions of the iteration space into chunks, so that the iteration is performed over 

each chunk and in the chunks.. 52, 102, 167 

loop unrolling is a loop transformation. Unrolling a loop by a factor of n consists in the 

substitution of a loop body by itself, replicated n times. A prelude and/or postlude 

are added to preserve the number of iteration.. 46, 50, 102 

loop unswitching † is a loop transformation that replaces a loop containing a test independent 

from the loop execution by a test containing the loop without the test in 

both true and false branch.. 104 

memory footprint reduction † is the process of tiling a loop to make sure the iteration 

over the tile has a memory footprint bounded by a given value.. 121, 163 

n address code generation † is the process of splitting complex expression in simpler 

ones that take at most n operands.. 150 

outlining † is the process of extracting part of a function body into a new function and 

replacing it in the initial function by a function call.. 18, 84, 87, 159, 162 

parallelism detection is a common name for analysis that detect if a loop can be run in 

parallel.. 134 

parallelism extraction is a common name for code transformations that modifies loop 

nest to make it legal to run them in parallel.. 49 

privatization is the process of detecting variables that are private to a loop body, i.e. 

written first, then read.. 134 

reduction detection is an analysis that identifies statements that perform a reduction 

over a variable.. 49, 134 

redundant load-store elimination † is an inter procedural transformation that optimizes 

data transfers by delaying and merging them.. 113, 124, 163 

scalar renaming † is the process of renaming scalar variables to suppress false data dependencies.. 

100 

split update operator † is the process of replacing an update operator by its expanded 

form.. 150 

statement isolation † is the process of replacing all variables referenced in a statement by 

newly declared variables. A prologue and an epilogue are added to copy old variable 

values to new variable, back and forth.. 114, 120, 141, 159, 163 

strength reduction † is the process of replacing an operation by an operation of lower 

cost.. 150

Acronyms 

Transformations marked with a † are passes implemented in pips during the PhD or 

passes I made significant contributions too. 

api Application Programming Interface. 2, 7–9, 15, 18, 22, 37–40, 44, 58, 63, 67, 88, 133, 

141, 162, 164 

asic Application-Specific Integrated Circuit. 6, 27 

asip Application-Specific Instruction set Processor. 41 

ast Abstract Syntax Tree. 65, 66, 98 

avx Advanced Vector eXtensions. iii, xvii, 16, 18, 53, 59, 60, 79, 93, 95, 97, 133, 152, 

157–160, 162 

cisc Complex Instruction Set Computer. 79 

cli Command Line Interface. 47, 58 

cpu Central Processing Unit. 9, 15, 18, 33, 66, 69, 92, 121, 123, 133, 139, 144, 162 

cri Centre de Recherche en Informatique. 4, 24 

cuda Compute Unified Device Architecture. iii, xvii, 2, 5, 7, 8, 10, 16, 22, 25, 26, 31, 34, 

43, 44, 49, 52, 55, 59, 60, 66, 70, 71, 139, 141, 142, 144, 159, 160 

dma Direct Memory Access. 7, 11, 19, 32, 33, 38, 50, 89, 110, 125, 128–130, 141, 147, 163 

dsp Digital Signal Processing. xiv, 79, 145, 152 

flops FLoating point Operations per Second. 2, 5, 22, 26 

fma Fused Multiply-Add. xi, 11, 79, 96, 98, 99, 157 

fpga Field Programmable Gate Array. iii, 4–6, 15, 19, 24–27, 32, 38, 39, 41, 66, 67, 70, 

71, 106, 133, 144, 146, 163 

fpu Floating-Point Unit. 73 

fsa Fusion System Architecture. 9, 10, 69 

gcc gnu C Compiler. xi, xiii, xvii, 12, 17, 34, 36, 44–47, 49, 51, 63, 65, 67, 73, 79, 92–97, 

137, 144, 157 

gpgpu General Purpose gpu. 2, 5, 6, 12, 17, 22, 25, 26, 31, 33, 34, 39, 91, 161 

gpp General Purpose Processor. 4, 6, 26, 27, 161, 162 

179

180 Acronyms 

gpu Graphical Processing Unit. xiv, 4, 5, 8, 9, 15, 16, 18, 19, 24–26, 30–33, 38, 40, 41, 

43, 49, 52, 54, 59, 67, 69, 71, 83, 106, 111, 133, 139–142, 144, 145, 159, 160, 162, 163 

hcfg Hierarchical Control Flow Graph. 72, 124, 125, 127, 128 

hdl Hardware Description Language. 42 

hmpp Hybrid Multicore Parallel Programming. 7, 41 

hpc High Performance Computing. 12, 19, 20, 42, 91, 106, 163 

hpec High Performance Embedded Computing. 36 

icc Intel C++ Compiler. xiii, 12, 17, 36, 73, 92, 93, 95, 96, 157 

ilp Instruction Level Parallelism. 12, 72, 92 

ir Internal Representation. 23, 43–45, 54, 56, 57, 59, 61, 67, 69–72, 79, 83, 89, 125, 134, 

141, 165, 177 

isa Instruction Set Architecture. 4, 6, 9, 10, 16, 18, 24, 32, 33, 67, 69, 70, 72, 76, 79, 83, 

89, 134, 140, 141, 146–148, 152, 156, 159, 162, 173 

jit Just In Time. 39, 66, 67, 96 

llvm Low Level Virtual Machine. xi, xiii, xvii, 12, 45–47, 49, 63, 65, 67, 70, 92, 93, 95, 

96 

mimd Multiple Instruction stream, Multiple Data stream. 5, 12, 16, 26, 29, 32, 33, 67, 

91, 134, 139, 140 

mis Multimedia Instruction Set. viii, x, xiii, xiv, 12, 18, 20, 87, 92, 94–97, 99, 111, 124, 

133, 152, 156–160, 162–164 

mkl Math Kernel Library. 27, 109 

mmx Matrix Math eXtension. 93, 95 

mp-soc MultiProcessor System-on-Chip. 5, 25 

mpi Message Passing Interface. 29, 55 

oop Object Oriented Programming. 50 

oo Object Oriented. 66 

opencl Open Computing Language. xiii, 5–7, 10, 20, 26–28, 34, 37–42, 45, 51, 52, 57, 

70, 71, 159, 163 

opengl Open Graphics Library. 5, 26, 31, 38 

openmp Open Multi Processing. iii, xi, xiv, xv, xvii, 16, 19, 31, 49, 55, 59, 60, 67, 106, 

133–138, 160, 163, 171 

pci Peripheral Component Interconnect. 40 

pe Processing Element. 139, 146, 147, 150 

pips Paralléliseur Interprocedural de Programmes Scientifiques. iii, xii, xiii, 4, 8, 9, 12, 

19, 24, 37, 44, 46, 48, 55, 58, 63, 67, 72, 73, 87, 91, 130, 141, 152, 157, 159, 163, 

165–167, 177, 179

Acronyms 181 

ps3 PlayStation 3. 2, 22 

ptx Parallel Thread eXecution. 67, 72, 77, 141 

pu Processing Unit. 146 

pocc Polyhedral Compiler Collection. 55, 56 

pyps Pythonic PIPS. xi, xiii, xiv, xvii, 8, 9, 19, 44, 46, 52, 53, 58–60, 63, 64, 67, 98, 134, 

137, 144, 151, 159, 160, 163, 171 

ram Random Access Memory. 33, 134, 140, 146, 147 

rom Read Only Memory. 32, 140, 146, 147 

rpc Remote Procedure Call. 19, 29, 30, 163 

sdk Software Development Kit. 5, 8, 25, 37, 39 

simd Single Instruction stream, Multiple Data stream. 5, 12, 16, 17, 20, 26, 29, 31–33, 

52, 67, 79, 91–93, 96–102, 139–141, 146–148, 156, 164 

sisd Single Instruction stream, Single Data stream. 29 

sloc Source Lines Of Code. 8, 44, 58, 63, 160 

slp Super-word Level Parallelism. viii, xi, 18, 92, 96, 98, 104, 106, 111, 112, 152, 157, 162 

soc Software On Chip. 5, 25, 41 

ssa Simple Static Assignment. 67 

sse Streaming simd Extension. iii, xi, xiii, 10, 16–18, 29, 39, 49, 53, 55, 60–62, 72, 94, 95, 

97, 98, 107, 109, 133, 152, 157, 158, 160–162, 173 

tr Textual Representation. 56, 61, 89 

ulp Unit in the Last Place. 38 

vhdl vhsic Hardware Description Language. 5, 7, 20, 26, 34, 55, 163 

vliw Very Long Instruction Word. 16, 146, 147

182 Acronyms

Bibliography 

[AAC + 11] Mehdi Amini, Corinne Ancourt, Fabien Coelho, Béatrice Creusillet, Serge 

Guelton, François Irigoin, Pierre Jouvelot, Ronan Keryell, and Pierre Villalon. 

PIPS Is not (only) Polyhedral Software. In First International Workshop on 

Polyhedral Compilation Techniques, IMPACT, April 2011. 

[ABC + 06] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James 

Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester 

Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The 

landscape of parallel computing research: A view from berkeley. Technical 

Report UCB/EECS-2006-183, EECS Department, University of California, 

Berkeley, 2006. 

[ABCR10] Joshua S. Auerbach, David F. Bacon, Perry Cheng, and Rodric M. Rabbah. 

Lime: a Java-compatible and synthesizable language for heterogeneous 

architectures. In Proceedings of the 25th Annual SIGPLAN Conference on 

Object-Oriented Programming, Systems, Languages, and Applications, OOP- 

SLA, pages 89–108, New York, NY, USA, October 2010. ACM. 

[ACIK97] Corinne Ancourt, Fabien Coelho, François Irigoin, and Ronan Keryell. A linear 

algebra framework for static High Performance Fortran code distribution. 

Scientific Programming, 6(1):3–27, 1997. 

[AJ75] Alfred V. Aho and Stephen C. Johnson. Optimal code generation for expression 

trees. In Proceedings of seventh annual symposium on Theory of 

computing, STOC, pages 207–217, New York, NY, USA, 1975. ACM. 

[AK87] Randy Allen and Ken Kennedy. Automatic translation of FORTRAN programs 

to vector form. Transactions on Programming Languages and Systems, 

9:491–542, 1987. 

[AKPW83] John R. Allen, Ken Kennedy, Carrie Porterfield, and Joe D. Warren. Conversion 

of control dependence to data dependence. In Principles of Programming 

Languages, POPL, pages 177–189, 1983. 

[ALSU06] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: 

Principles, Techniques, and Tools (2nd Edition). Addison Wesley, 2006. 

[Amd67] Gene M. Amdahl. Validity of the single processor approach to achieving 

large scale computing capabilities. In Proceedings of the spring joint computer 

conference, AFIPS, pages 483–485, New York, NY, USA, April 1967. ACM. 

183

184 BIBLIOGRAPHY 

[AMG + 99] Eduard Ayguade, MarcGonzalez, Marc Gonzalez, Jesus Labarta, Xavier Martorell, 

Nacho Navarro, and Jose Oliver. NanosCompiler: A research platform 

for OpenMP extensions. In In First European Workshop on OpenMP, pages 

27–31, 1999. 

[API03] Kubilay Atasu, Laura Pozzi, and Paolo Ienne. Automatic application-specific 

instruction-set extensions under microarchitectural constraints. International 

Journal of Parallel Programming, 31(6):411–428, 2003. 

[AR97] Rumen Andonov and Sanjay V. Rajopadhye. Optimal orthogonal tiling of 2-d 

iterations. Journal of Parallel Distributed Computing, 45(2):159–165, September 

1997. 

[ASU86] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Princiles, 

Techniques, and Tools. Addison-Wesley, 1986. 

[ATNW09] Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André 

Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous 

multicore architectures. In European Conference on Parallel Processing, Euro- 

Par, pages 863–874, 2009. 

[Aß96] Uwe Aßmann. How to uniformly specify program analysis and transformation 

with graph rewrite systems. In Proceedings of the 6th International Conference 

on Compiler Construction, CC, pages 121–135, London, UK, 1996. 

Springer-Verlag. 

[Bac57] John W. Backus. The FORTRAN Automatic Coding System for the IBM 704 

EDPM. International Business Machines Corporation (IBM), 1957. 

[Bas04] Cédric Bastoul. Code generation in the polyhedral model is easier than you 

think. In International Conference on Parallel Architecture and Compilation 

Techniques, PACT, pages 7–16, Juan-les-Pins, France, 2004. IEEE Computer 

Society Press. 

[BB09] Francois Bodin and Stephane Bihan. Heterogeneous multicore parallel programming 

for graphics processing units. Scientific Programming, 17:325–336, 

December 2009. 

[BBK + 08] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, 

J. Ramanujam, Atanas Rountev, and P. Sadayappan. A compiler framework 

for optimization of affine loop nests for GPGPUs. In Proceedings of the 22nd 

annual international conference on Supercomputing, ICS, pages 225–234, New 

York, NY, USA, 2008. ACM. 

[BDH + 10] André Rigland Brodtkorb, Christopher Dyken, Trond Runar Hagen, Jon M. 

Hjelmervik, and Olaf O. Storaasli. State-of-the-art in heterogeneous computing. 

Scientific Programming, 18(1):1–33, 2010. 

[Ber66] Arthur J. Bernstein. Analysis of programs for parallel processing. Transactions 

on Electronic Computers, pages 757 –762, 1966.

BIBLIOGRAPHY 185 

[BFH + 04] Ian Buck, Tim Foley, Daniel Reiter Horn, Jeremy Sugerman, Kayvon Fatahalian, 

Mike Houston, and Pat Hanrahan. Brook for GPUs: stream computing 

on graphics hardware. Transactions on Graphics, 23(3):777–786, 2004. 

[BGGT02] Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. Automatic detection 

of saturation and clipping idioms. In International Workshop on Languages 

and Compilers for Parallel Computing, LNCS, pages 61–74. Springer- 

Verlag, 2002. 

[BGS94] David F. Bacon, Susan L. Graham, and Oliver J. Sharp. Compiler transformations 

for high-performance computing. Computing Surveys, 26(4):345–420, 

1994. 

[Bik04] Aart J. C. Bik. The Software Vectorization Handbook: Applying Intel Multimedia 

Extensions for Maximum Performance. Intel Press, 2004. 

[BJK + 95] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. 

Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An efficient multithreaded 

runtime system. In Journal of Parallel and Distributed Computing, JPDC, 

pages 207–216, New York, NY, USA, 1995. ACM. 

[BL10] Nicolas Benoit and Stéphane Louise. Extending GCC with a multi-grain 

parallelism adaptation framework for MPSoCs. In GCC for Research Opportunities 

Workshop, January 2010. 

[Ble89] Guy E. Blelloch. Scans as primitive parallel operations. Transactions on 

Computers, 38(11):1526–1538, November 1989. 

[BLE + 08] Philippe Bonnot, Fabrice Lemonnier, Gilbert Edelin, Gérard Gaillat, Olivier 

Ruch, and Pascal Gauget. Definition and SIMD implementation of a multiprocessing 

architecture approach on FPGA. In Design Automation and Test 

in Europe, DATE, pages 610–615. IEEE Computer Society Press, 2008. 

[BN84] Andrew D. Birrell and Bruce Jay Nelson. Implementing remote procedure 

calls. Transactions on Computer Systems, 2:39–59, 1984. 

[Bre10] Tony M. Brewer. Instruction set innovations for the Convey HC-1 computer. 

IEEE Micro, 30(2):70–79, 2010. 

[BSB + 01] Tracy D. Braun, Howard Jay Siegel, Noah Beck, Ladislau L. Boloni, Muthucumaru 

Maheswaran, Albert I. Reuther, James P. Robertson, Mitchell D. 

Theys, Bin Yao, Debra Hensgen, and Richard F. Freund. A comparison of 

eleven static heuristics for mapping a class of independent tasks onto heterogeneous 

distributed computing systems. Journal of Parallel Distributed 

Computing, 61:810–837, June 2001. 

[CDL11] Alexandre Cornu, Steven Derrien, and Dominique Lavenier. HLS tools for 

FPGA: Faster development with better performance. In Reconfigurable Computing: 

Architectures, Tools and Applications - 7th International Symposium, 

volume 6578 of LNCS, pages 67–78, Belfast, UK, March 2011. Springer.


[CDM + 10] Hassan Chafi, Zach DeVito, Adriaan Moors, Tiark Rompf, Arvind K. Sujeeth, 

Pat Hanrahan, Martin Odersky, and Kunle Olukotun. Language virtualization 

for heterogeneous parallel computing. In Proceedings of the international 

conference on Object oriented programming systems languages and applications, 

OOPSLA, pages 835–847, New York, NY, USA, 2010. ACM. 

[CDMC + 05] Cristian Coarfa, Yuri Dotsenko, John M. Mellor-Crummey, François Cantonnet, 

Tarek A. El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel G. 

Chavarría-Miranda. An evaluation of global address space languages: coarray 

Fortran and unified parallel C. In SIGPLAN Annual Symposium on 

Principles and Practice of Parallel Programming, PPOPP, pages 36–47, New 

York, NY, USA, 2005. ACM. 

[CGO11] Proceedings of the International Symposium on Code Generation and Optimization 

(CGO), New York, NY, USA, April 2011. ACM. 

[CH89] Pohua P. Chang and W.-W. Hwu. Inline function expansion for compiling 

C programs. In Proceedings of the SIGPLAN Conference on Programming 

language design and implementation, PLDI, pages 246–257, New York, NY, 

USA, June 1989. ACM. 

[Che94] Wasel Chemij. Parallel Computer Taxonomy. PhD thesis, Aberystwyth University, 

1994. 

[CJIA11] Fabien Coelho, Pierre Jouvelot, François Irigoin, and Corinne Ancourt. Data 

and process abstraction in PIPS internal representation. In Workshop on 

Internal Representations, WIR, Chamonix, France, April 2011. 

[Cla96] Philippe Clauss. Counting solutions to linear and nonlinear constraints 

through ehrhart polynomials: applications to analyze and transform scientific 

programs. In Proceedings of the 10th international conference on Supercomputing, 

ICS, pages 278–285, New York, NY, USA, May 1996. ACM. 

[Coe93] Fabien Coehlo. Étude de la Compilation du High Performance Fortran. PhD 

thesis, Université Paris VI, 1993. 

[Con] Embedded Microprocessor Benchmark Consortium. Coremark. http://www. 

coremark.org. 

[Coo04] Keith D. Cooper. Evolving the next generation of compilers. keynote talk at 

CGO’04, 2004. 

[Cre96] Béatrice Creusillet. Array Region Analyses and Applications. PhD thesis, 

MINES ParisTech, 1996. 

[CSY10] Kuan-Hsu Chen, Bor-Yeh Shen, and Wuu Yang. An automatic superword 

vectorization in LLVM. In 16th Workshop on Compiler Techniques for High- 

Performance and Embedded Computing, pages 19–27, Taipei, 2010. 

[Dal09] William J. Dally. The end of denial architecture and the rise of throughput 

computing. In Design Automation Conference, San Francisco, CA, USA, July 

2009.


[Dar99] Alain Darte. On the complexity of loop fusion. In Proceedings of the 1999 International 

Conference on Parallel Architectures and Compilation Techniques, 

PACT, pages 149–, Washington, DC, USA, September 1999. IEEE Computer 

Society Press. 

[DKK + 99] Carole Dulong, Rakesh Krishnaiyer, Dattatraya Kulkarni, Daniel Lavery, Wei 

Li, John Ng, and David Sehr. An overview of the Intel IA-64 compiler. Intel 

Technology Journal, 1999. 

[DKYC10] Gregory Frederick Diamos, Andrew Kerr, Sudhakar Yalamanchili, and 

Nathan Clark. Ocelot: a dynamic optimization framework for bulksynchronous 

applications in heterogeneous systems. In 19th International 

Conference on Parallel Architecture and Compilation Techniques, PACT, 

pages 353–364. ACM, September 2010. 

[DLP03] Jack Dongarra, Piotr Luszczek, and Antoine Petitet. The LINPACK benchmark: 

past, present and future. Concurrency and Computation: Practice and 

Experience, 15(9):803–820, 2003. 

[DMM + ] Steven Derrien, Daniel Ménard, Kevin Martin, Antoine Floch, Antoine Morvan, 

Adeel Pasha, Patrice Quinton, Amit Kumar, and Loïc Cloatre. GeCoS: 

Generic compiler suite. http://gecos.gforge.inria.fr. 

[DSV96] Alain Darte, Georges-André Silber, and Frédéric Vivien. Combining retiming 

and scheduling techniques for loop parallelization and loop tiling, 1996. 

[Dun90] R. Duncan. A survey of parallel computer architectures. Computer, 23(2):5– 

16, February 1990. 

[DUSsH93] Raja Das, Mustafa Uysal, Joel Saltz, and Yuan shin Hwang. Communication 

optimizations for irregular scientific computations on distributed memory architectures. 

Journal of Parallel and Distributed Computing, 22:462–479, 1993. 

[ER08] Eric Eide and John Regehr. Volatiles are miscompiled, and what to do about 

it. In International Workshop on Embedded Systems, pages 255–264, 2008. 

[Ero95] Ana Maria Erosa. A goto-elimination method and its implementation for the 

McCat C compiler. Thesis (m.s.), McGill University, Montreal, Canada, May 

1995. 

[FLR98] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation 

of the Cilk-5 multithreaded language. In Proceedings of the SIGPLAN 

Conference on Program Language Design and Implementation, PLDI, pages 

212–223, 1998. 

[Fly72] Michael J. Flynn. Some computer organizations and their effectiveness. 

Transactions on Computers, C-21(9):948–960, 1972. 

[FO03] Björn Franke and Michael F. P. O’Boyle. Array recovery and high-level transformations 

for DSP applications. Transactions in Embedded Computing Systems, 

2(2):132–162, 2003.


[Fra03] cois Ferrand Fran˙ Optimization and code parallelization for processors with 

multimedia SIMD instructions. Technical report, Télécom Bretagne, August 

2003. master thesis report. 

[GCB07] Gildas Genest, Richard Chamberlain, and Robin J. Bruce. Programming an 

FPGA-based super computer using a C-to-VHDL compiler: DIME-C. In 

Adaptive Hardware and Systems (AHS), pages 280–286, 2007. 

[GG] J. L. Gustafson and B. S. Greer. ClearSpeed whitepaper: accelerating the 

Intel Math Kernel Library. http://www.clearspeed.com/docs/resources/ 

ClearSpeedIntelWhitepaperFeb07.pdf. 

[GLGP06] Gert Goossens, Dirk Lanneer, Werner Geurts, and Johan Van Praet. Design 

of ASIPs in multi-processor SoCs using the Chess/Checkers retargetable tool 

suite. International Symposium on System-on-Chip, 2006. 

[GNB08] Zhi Guo, Walid Najjar, and Betul Buyukkurt. Efficient hardware code generation 

for FPGAs. Transactions on Architecture and Code Optimization, 

5(1):1–26, 2008. 

[GO03] Etienne Gaudrain and Yann Orlarey. A Faust tutorial. Technical report, 

GRAME, September 2003. 

[GPZ + 01] María Jesús Garzarán, Milos Prvulovic, Ye Zhangy, Josep Torrellas, Alin 

Jula, Hao Yu, and Lawrence Rauchwerger. Architectural support for parallel 

reductions in scalable shared-memory multiprocessors. In Proceedings 

of the 2001 International Conference on Parallel Architectures and Compilation 

Techniques, PACT, pages 243–, Washington, DC, USA, 2001. IEEE 

Computer Society Press. 

[GS04] Brian J. Gough and Richard M. Stallman. An Introduction to GCC. Network 

Theory Ltd., 2004. 

[GZA + 11] Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin 

Größlinger, and Louis-Noël Pouchet. Polly - polyhedral optimization in 

LLVM. In First International Workshop on Polyhedral Compilation Techniques, 

IMPACT, 2011. 

[HEL + 09] Manuel Hohenauer, Felix Engel, Rainer Leupers, Gerd Ascheid, and Heinrich 

Meyr. A SIMD optimization framework for retargetable compilers. Transactions 

on Architecture and Code Optimization, 6(1), 2009. 

[HF11] Matt J. Harvey and Gianni De Fabritiis. Swan: A tool for porting CUDA 

programs to OpenCL. Computer Physics Communications, 182(4):1093–1099, 

2011. 

[HP06] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth 

Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San 

Francisco, CA, USA, 2006. 

[HRF + 10] Everton Hermann, Bruno Raffin, François Faure, Thierry Gautier, and 

Jérémie Allard. Multi-GPU and multi-CPU parallelization for interactive


physics simulations. In Euro-Par, volume 6272 of LNCS, pages 235–246. 

Springer, 2010. 

[HRTV11] Robert Hundt, Easwaran Raman, Martin Thuresson, and Neil Vachharajani. 

MAO – an extensible micro-architectural optimizer. In CGO [CGO11]. 

[Ier06] Roberto Ierusalimschy. Programming in Lua, Second Edition. Lua.Org, 2006. 

[IJT91] François Irigoin, Pierre Jouvelot, and Rémi Triolet. Semantical interprocedural 

parallelization: an overview of the PIPS project. In International 

Conference on Supercomputing, ICS, pages 244–251, 1991. 

[iLJE03] Sang ik Lee, Troy A. Johnson, and Rudolf Eigenmann. Cetus - an extensible 

compiler infrastructure for source-to-source transformation. In 16th International 

Workshop on Languages and Compilers for Parallel Computing, volume 

2958 of LCPC, pages 539–553, College Station, TX, USA, 2003. 

[INR] INRIA. Aladdin-g5k. https://www.grid5000.fr. 

[IS99] Liviu Iftode and Jaswinder Pal Singh. Shared Virtual Memory: Progress 

and Challenges. In Proceedings of the IEEE, pages 498–507. IEEE Computer 

Society Press, 1999. 

[ISCKG10] François Irigoin, Frédérique Silber-Chaussumier, Ronan Keryell, and Serge 

Guelton. PIPS tutorial at PPoPP 2010. http://pips4u.org/doc/tutorial, 

2010. 

[ISO99] ISO. ISO/IEC 9899 Programming languages —C. ISO, 1999. 

[ISO08] ISO. ISO/IEC TR 18037:2008 Programming languages —C —Extensions to 

support embedded processors. ISO, 2008. 

[IT88] François Irigoin and Rémi Triolet. Supernode partitioning. In Proceedings 

of the 15th SIGPLAN-SIGACT symposium on Principles of programming 

languages, POPL, pages 319–329, New York, NY, USA, 1988. ACM. 

[JD89] Pierre Jouvelot and Babak Dehbonei. A unified semantic approach for the 

vectorization and parallelization of generalized reductions. In International 

Conference on Supercomputing, ICS, pages 186–194, 1989. 

[JM99] Simon Peyton Jones and Simon Marlow. Secrets of the Glasgow Haskell 

compiler inliner. In Journal of Functional Programming, page 2002, 1999. 

[JMH + 05] Weihua Jiang, Chao Mei, Bo Huang, Jianhui Li, Jiahua Zhu, Binyu Zang, 

and Chuanqi Zhu. Boosting the performance of multimedia applications using 

SIMD instructions. In International Conference on Compiler Construction, 

CC, pages 59–75, 2005. 

[JPJ + 11] Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, 

Stephen R. Beard, and David I. August. Automatic CPU–GPU communication 

management and optimization. In Proceedings of the 32nd SIGPLAN 

conference on Programming language design and implementation, PLDI, 

pages 142–151, New York, NY, USA, June 2011. ACM.


[JRR99] Simon L. Peyton Jones, Norman Ramsey, and Fermin Reig. C--: A portable 

assembly language that supports garbage collection. In Principles and Practice 

of Declarative Programming, International Conference, LNCS, pages 1– 

28, Paris, France, September 1999. Springer. 

[KBM07] Volodymyr V. Kindratenko, Robert J. Brunner, and Adam D. Myers. Mitrion- 

C application development on SGI Altix 350/RC100. In International Symposium 

on Field-Programmable Custom Computing Machines, FCCM, pages 

239–250. IEEE Computer Society Press, 2007. 

[Ker03] Brian Kernighan. Interview with brian kernighan. Linux Journal, July 2003. 

http://www.linuxjournal.com/article/7035. 

[KK92] Ken Kennedy and Kathryn S. M Kinley. Optimizing for parallelism and data 

locality. In International Conference on Supercomputing, ICS, pages 323–334, 

New York, NY, USA, 1992. ACM. 

[KKS00] Ki-Il Kum, Jiyang Kang, and Wonyong Sung. AUTOSCALER for C: an 

optimizing floating-point to integer C program converter for fixed-point digital 

signal processors. Circuits and Systems II: Analog and Digital Signal 

Processing, 47(9):840–848, September 2000. 

[KOWG10] Khronos OpenCL Working Group. The OpenCL Specification, version 1.1, 

2010. 

[KRS90] Clyde Kruskal, Larry Rudolph, and Marc Snir. Efficient parallel algorithms 

for graph problems. Algorithmica, 5:43–64, 1990. 

[KS99] Kazuhiro Kusano and Mitsuhisa Sato. A comparison of automatic parallelizing 

compiler and improvements by compiler directives. In International 

Symposium on High Performance Computing, ISHPC, pages 95–108, London, 

UK, 1999. Springer-Verlag. 

[KSA + 10] Kazuhiko Komatsu, Katsuto Sato, Yusuke Arai, Kentaro Koyama, Hiroyuki 

Takizawa, and Hiroaki Kobayashi. Evaluating performance and portability 

of OpenCL programs. In The Fifth International Workshop on Automatic 

Performance Tuning, June 2010. 

[LA00] Samuel Larsen and Saman P. Amarasinghe. Exploiting superword level parallelism 

with multimedia instruction sets. In Programming Language Design 

and Implementation, PLDI, pages 145–156, 2000. 

[LA03] Chris Lattner and Vikram Adve. Architecture for a next-generation GCC. 

In Proceedings of First Annual GCC Developers’ Summit, Ottawa, Canada, 

May 2003. 

[LA04] Chris Lattner and Vikram Adve. LLVM: A compilation framework for lifelong 

program analysis & transformation. In International Symposium on Code 

Generation and Optimization, CGO, Palo Alto, California, 2004. 

[Lam74] Leslie Lamport. The parallel execution of do loops. communications of the 

ACM, 17(2):83–93, 1974.


[Lat11] Chris Lattner. LLVM, chapter 11. Amy Brown and Greg Wilson (editors), 

2011. http://www.aosabook.org. 

[Lei92] F. Thomson Leighton. Introduction to parallel algorithms and architectures: 

array, trees, hypercubes. Morgan Kaufmann Publishers Inc., San Francisco, 

CA, USA, 1992. 

[LF80] Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. Journal 

of the ACM, 27:831–838, 1980. 

[LRW05] James Lebak, Albert Reuther, and Edmund Wong. Polymorphous computing 

architecture kernel-level benchmarks. Technical Report Project Report PCA- 

KERNEL-1, MIT Lincoln Laboratory, Lexington, MA, 2005. 

[LVM + 10] Allen Leung, Nicolas Vasilache, Benoît Meister, Muthu Baskaran, David 

Wohlford, Cédric Bastoul, and Richard Lethin. A mapping path for multigpgpu 

accelerated computers from a portable high level programming abstraction. 

In Proceedings of the 3rd Workshop on General-Purpose Computation 

on Graphics Processing Units, GPGPU, pages 51–61, New York, NY, USA, 

2010. ACM. 

[LWFK02] S.M. Loo, B.E. Wells, N. Freije, and J. Kulick. Handel-C for rapid prototyping 

of VLSI coprocessors for real time systems. In Proceedings of the Thirty- 

Fourth Southeastern Symposium on System Theory, pages 6–10, 2002. 

[MAB + 10] Harm Munk, Eduard Ayguadé, Cédric Bastoul, Paul Carpenter, Zbigniew 

Chamski, Albert Cohen, Marco Cornero, Philippe Dumont, Marc Duranton, 

Mohammed Fellahi, Roger Ferrer, Razya Ladelsky, Menno Lindwer, Xavier 

Martorell, Cupertino Miranda, Dorit Nuzman, Andrea Ornstein, Antoniu 

Pop, Sebastian Pop, Louis-Noël Pouchet, Alex Ramírez, David Ródenas, Erven 

Rohou, Ira Rosen, Uzi Shvadron, Konrad Trifunović, and Ayal Zaks. 

Acotes project: Advanced compiler technologies for embedded streaming. International 

Journal of Parallel Programming, 2010. Special issue on European 

HiPEAC network of excellence members projects. To appear. 

[Mas92] Vadim Maslov. Delinearization: an efficient way to break multiloop dependence 

equations. In In Proceedings of the SIGPLAN Conference on Programming 

Language Design and Implementation, PLDI, pages 152–161, 1992. 

[McB94] Oliver A. McBryan. An overview of message passing environments. Parallel 

Computing, 20:417–444, 1994. 

[MMG08] Peter Messmer, Paul J. Mullowney, and Brian E. Granger. GPULib: GPU 

computing in high-level languages. Computing in Science and Engineering, 

10:70–73, 2008. 

[Moo65] Gordon E. Moore. Cramming more components onto integrated circuits. 

Electronics, 38(8), April 1965. 

[Muc97] Steven S. Muchnick. Advanced compiler design and implementation, chapter 

13. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.


[Ngu02] Thi Viet Nga Nguyen. Efficient and Effective Software Verifications for Scientific 

Applications using Static Analysis and Code Instrumentation. PhD 

thesis, MINES ParisTech, 2002. 

[NMRW02] George C. Necula, Scott McPeak, Shree Prakash Rahul, and Westley Weimer. 

CIL: Intermediate language and tools for analysis and transformation of C 

programs. In Compiler Construction, volume 2304 of LNCS, pages 213–228. 

Springer, April 2002. 

[Nov06] Diego Novillo. GCC - an architectural overview, current status and future 

directions. Ottawa Linux Symposium, July 2006. 

[NRD + 11] Dorit Nuzman, Ira Rosen, Sergei Dyshel, Ayal Zaks, Erven Rohou, Kevin 

Williams, Albert Cohen, and David Yuste. Vapor SIMD: Auto-vectorize once, 

run everywhere. In CGO [CGO11]. 

[NSL + 11] Chris J. Newburn, Byoungro So, Zhenying Liu, Michael D. McCool, Anwar M. 

Ghuloum, Stefanus Du Toit, Zhi-Gang Wang, Zhaohui Du, Yongjian Chen, 

Gansha Wu, Peng Guo, Zhanglin Liu, and Dan Zhang. Intel’s Array Building 

Blocks: A retargetable, dynamic compiler and embedded language. In CGO 

[CGO11], pages 224–235. 

[NVI10] NVIDIA. PTX: Parallel Thread Execution ISA Version 2.1, NVIDIA compute 

edition, April 2010. 

[NVI11] NVIDIA. NVIDIA CUDA Reference Manual 3.2. http://www.nvidia.com/ 

object/cuda_develop.html, 2011. 

[OOVV05] Karina Olmos, Karina Olmos, Eelco Visser, and Eelco Visser. Composing 

source-to-source data-flow transformations with rewriting strategies and dependent 

dynamic rewrite rules. In 14th International Conference on Compiler 

Construction, volume 3443 of LNCS, pages 204–220. Springer-Verlag, 2005. 

[Ope11] OpenMP Architecture Review Board. OpenMP Application Program Interface, 

2011. 

[Orf95] Sophocles J. Orfanidis. Introduction to signal processing. Prentice-Hall, Inc., 

Upper Saddle River, NJ, USA, 1995. 

[PAB + 06] Dac Pham, Hans-Werner Anderson, Erwin Behnen, Mark Bolliger, Sanjay 

Gupta, H. Peter Hofstee, Paul E. Harvey, Charles R. Johns, James A. Kahle, 

Atsushi Kameyama, John M. Keaty, Bob Le, Sang Lee, Tuyen V. Nguyen, 

John G. Petrovick, Mydung Pham, Juergen Pille, Stephen D. Posluszny, 

Mack W. Riley, Joseph Verock, James D. Warnock, Steve Weitzel, and Dieter 

F. Wendel. Key features of the design methodology enabling a multi-core 

SoC implementation of a first-generation CELL processor. In Asia and South 

Pacific Design Automation Conference, ASP-DAC, pages 871–878, 2006. 

[Pat10] David A. Patterson. The trouble with multicore. IEEE Spectrum, 2010. 

[PBB10] Louis-Noel Pouchet, Cédric Bastoul, and Uday Bondhugula. PoCC: the Polyhedral 

Compiler Collection, 2010. http://pocc.sf.net.


[PBdD11] Artur Pietrek, Florent Bouchez, and Benoît Dupont de Dinechin. Tirex: A 

textual target-level intermediate representation for compiler exchange. In 

Workshop on Intermediate Representations, WIR, Chamonix, France, April 

2011. 

[PBSB04] Gilles Pokam, Stéphane Bihan, Julien Simonnet, and François Bodin. 

SWARP: a retargetable preprocessor for multimedia instructions. Concurrency 

and Computation: Practice and Experience, 16(2-3):303–318, 2004. 

[PBV06] E. Moscu Panainte, K.L.M. Bertels, and S. Vassiliadis. Interprocedural compiler 

optimization for partial run-time reconguration. Journal of VLSI Signal 

Processing, pages 161–172, 2006. 

[PBV07] Elena Moscu Panainte, Koen Bertels, and Stamatis Vassiliadis. The Molen 

compiler for reconfigurable processors. Transactions in Embedded Computing 

Systems, 2007. 

[PEH + 93] David A. Padua, Rudolf Eigenmann, Jay Hoeflinger, Paul Petersen, Peng Tu, 

Stephen Weatherford, and Keith Faigin. Polaris: A new-generation parallelizing 

compiler for MPPs. Technical Report 1306, University of Illinois, Center 

for Supercomputing Research and Development, 1993. 

[PGS + 09] Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming 

Chen, Jason Cong, and Wen mei W. Hwu. FCUDA: Enabling efficient compilation 

of CUDA kernels onto FPGAs. In Proceedings of the 7th Symposium on 

Application Specific Processors, pages 35–42. IEEE Computer Society Press, 

July 2009. 

[PH09] David A. Patterson and John L. Hennessy. Computer organization and design: 

the hardware/software interface. Morgan Kaufmann Publishers, 2009. 

[Qui00] Daniel J. Quinlan. ROSE: Compiler support for object-oriented frameworks. 

Parallel Processing Letters, 10(2/3):215–226, 2000. 

[RBR + 05] Noam Rinetzky, Jörg Bauer, Thomas W. Reps, Shmuel Sagiv, and Reinhard 

Wilhelm. A semantics for procedure local heaps and its abstractions. 

In Proceedings of the 32nd SIGPLAN-SIGACT Symposium on Principles of 

Programming Languages, pages 296–309, New York, NY, USA, January 2005. 

ACM. 

[RCH + 10] Gabe Rudy, Chun Chen, Mary Hall, Malick Murtaza Khan, and Jacqueline 

Chame. A programming language interface to describe transformations 

and code generation. In The 23rd International Workshop on Languages and 

Compilers for Parallel Computing, LCPC, pages 136–150, Berlin, Heidelberg, 

2010. Springer-Verlag. 

[RNZ07] Ira Rosen, Dorit Nuzman, and Ayal Zaks. Loop-aware SLP in GCC - two 

years later. In GCC summit, 2007. 

[Roj04] Juan Rojas. Multimedia Macros for Portable Optimized Programs. PhD 

thesis, Northeastern University, 2004.


[SA94] Mark Segal and Kurt Akeley. The OpenGL graphics interface. Technical 

report, Silicon Graphics Computer Systems, 1994. 

[Sar97] Vivek Sarkar. Automatic selection of high-order transformations in the 

IBM XL FORTRAN compilers. IBM Journal of Research and Development, 

41(3):233–264, 1997. 

[SCH03] Jaewook Shin, Jacqueline Chame, and Mary W. Hall. Exploiting superwordlevel 

locality in multimedia extension architectures. Journal of Instruction- 

Level Parallelism, 5, 2003. 

[Sch09] David Schleef. Oil Runtime Compiler. http://code.entropywave.com/ 

projects/orc, 2009. 

[SCS + 08] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, 

Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert 

Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 

Larrabee: a many-core x86 architecture for visual computing. Transactions 

on Graphics, 27:18:1–18:15, August 2008. 

[SHC05] Jaewook Shin, Mary W. Hall, and Jacqueline Chame. Superword-level parallelism 

in the presence of control flow. In International Symposium on Code 

Generation and Optimization, CGO, pages 165–175, 2005. 

[SHG08] Shubhabrata Sengupta, Mark Harris, and Michael Garland. Efficient parallel 

scan algorithms for GPUs. Technical report, NVIDIA, 2008. 

[Sin98] Satnam Singh. Accelerating Adobe Photoshop with reconfigurable logic. In 

Symposium on FPGA Custom Computing Machines, pages 18–26. IEEE Computer 

Society Press, 1998. 

[SQ03] Markus Schordan and Daniel J. Quinlan. A source-to-source architecture for 

user-defined optimizations. In László Böszörményi and Peter Schojer, editors, 

Modular Programming Language, volume 2789 of LNCS, pages 214–223, 2003. 

[SSM08] Jay Smith, Howard Jay Siegel, and Anthony A. Maciejewski. A stochastic 

model for robust resource allocation in heterogeneous parallel and distributed 

computing systems. In International Parallel & Distributed Processing Symposium, 

IPDPS, pages 1–5. IEEE Computer Society Press, 2008. 

[Ste97] Robert Stephens. A survey of stream processing. Acta Informatica, 34:491– 

541, 1997. 

[Sto08] O. Olaf Storaasli. Accelerating genome sequencing 100–1000x with FPGAs. 

Many-core and Reconfigurable Supercomputing Conference, April 2008. 

[TCE + 10] Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, 

Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna 

Upadrasta. GRAPHITE two years after: First lessons learned from 

real-world polyhedral compilation. In GCC Research Opportunities Workshop, 

GROW, Pisa Italie, 2010.


[TM08] Donald Thomas and Philip Moorby. The Verilog Hardware Description Language. 

Springer Publishing Company, Incorporated, 5 edition, 2008. 

[TNC + 09] Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 

Polyhedral-model guided loop-nest auto-vectorization. International Conference 

on Parallel Architectures and Compilation Techniques, pages 327–337, 

2009. 

[vECGS92] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik 

Schauser. Active messages: a mechanism for integrated communication and 

computation. SIGARCH Computer Architecture News, 20:256–266, 1992. 

[War99] Martin P. Ward. Assembler to C migration using the FermaT transformation 

system. In International Conference on Software Maintenance, ICSM, pages 

67–76, 1999. 

[WC03] Ge Wang and Perry R. Cook. Chuck: a concurrent, on-the-fly audio programming 

language. In International Computer Music Conference, ICMC, pages 

219–226, 2003. 

[WFW + 94] Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman P. Amarasinghe, 

Jennifer M. Anderson, Steve W. K. Tjiang, Shih wei Liao, Chau wen 

Tseng, Mary W. Hall, Monica S. Lam, and John L. Hennessy. SUIF: An infrastructure 

for research on parallelizing and optimizing compilers. SIGPLAN 

Notices, 29:31–37, 1994. 

[WGN + 02] Oliver Wahlen, Tilman Glökler, Achim Nohl, Andreas Hoffmann, Rainer Leupers, 

and Heinrich Meyr. Application specific compiler/architecture codesign: 

a case study. SIGPLAN Notices, 37:185–193, 2002. 

[Whe01] David A. Wheeler. SLOCCount. http://www.dwheeler.com/sloccount, 

January 2001. 

[Whi09] Tom White. Hadoop: The Definitive Guide. O’Reilly, June 2009. 

[Wik09] Wikibooks, editor. GNU C Compiler Internals. http://en.wikibooks.org/ 

wiki/GNU_C_Compiler_Internals, 2006-2009. 

[WL91a] Michael E. Wolf and Monica S. Lam. A data locality optimizing algorithm. In 

SIGPLAN conference on Programming Language Design and Implementation, 

PLDI, pages 30–44, New York, NY, USA, 1991. ACM. 

[WL91b] Michael E. Wolf and Monica S. Lam. A loop transformation theory and an 

algorithm to maximize parallelism. Transactions on Parallel and Distributed 

Systems, 2(4):452–471, 1991. 

[WM95] Wm. A. Wulf and Sally A. Mckee. Hitting the memory wall: Implications of 

the obvious. Computer Architecture News, 23:20–24, March 1995. 

[Wol94] Wayne H. Wolf. Hardware-software co-design of embedded systems. In proceedings 

of the IEEE, pages 967–989, 1994. 

[Wol96] Michael Wolfe. High performance compilers for parallel computing. Addison- 

Wesley, 1996.


[Wol10] Michael Wolfe. Implementing the PGI accelerator model. In Proceedings of 

the 3rd Workshop on General-Purpose Computation on Graphics Processing 

Units, GPGPU, pages 43–50, New York, NY, USA, 2010. ACM. 

[Wol11] Michael Wolfe. Compilers and more: Programming at exascale. HPC Wire, 

March 2011. http://www.hpcwire.com/hpcwire/2011-03-08/compilers_ 

and_more_programming_at_exascale.html. 

[WW94] David Walker and David W. Walker. The design of a standard message passing 

interface for distributed memory concurrent computers. Parallel Computing, 

20:657–673, 1994. 

[Yi11] Qing Yi. Automated programmable control and parameterization of compiler 

optimizations. In CGO [CGO11]. 

[YRR + 10] Tomofumi Yuki, Lakshminarayanan Renganarayanan, Sanjay V. Rajopadhye, 

Charles Anderson, Alexandre E. Eichenberger, and Kevin O’Brien. Automatic 

creation of tile size selection models. In International Symposium on 

Code Generation and Optimization, CGO’08, pages 190–199, New York, NY, 

USA, April 2010. ACM. 

[ZC91] Hans Zima and Barbara Chapman. Supercompilers for parallel and vector 

computers. ACM, New York, NY, USA, 1991. 

[ZC98] Julien Zory and Fabien Coelho. Using algebraic transformations to optimize 

expression evaluation in scientific codes. In Proceeding of International 

Conference on Parallel Architectures and Compiler Techniques, PACT, pages 

376–384, 1998. 

[Zim97] Reto Zimmermann. Binary adder architectures for cell-based VLSI and their 

synthesis. PhD thesis, Swiss Federal Institute of Technology (ETH), Zürich , 

Switzerland, 1997. 

[ZO01] Matthias Zenger and Martin Odersky. Implementing extensible compilers. In 

ECOOP workshop on multiparadigm programming with object-oriented languages, 

pages 61–80, 2001. 

[ZPS + 96] V. Zivojnovic, S. Pees, C. Schlager, M. Willems, R. Schoenen, and H. Meyr. 

DSP processor/compiler co-design: a quantitative approach. In Proceedings 

of the 9th international symposium on System synthesis, ISSS, pages 108–, 

Washington, DC, USA, 1996. IEEE Computer Society Press. 

[ZWZD93] Songnian Zhou, Jingwen Wang, Xiaohu Zheng, and Pierre Delisle. Utopia: a 

load sharing facility for large, heterogeneous distributed computer systems. 

Software – Practice and Experiments, 23:1305–1336, December 1993.

Personal Bibliography 

[AAC + 11] Mehdi Amini, Corinne Ancourt, Fabien Coelho, Béatrice Creusillet, Serge 

Guelton, François Irigoin, Pierre Jouvelot, Ronan Keryell, and Pierre Villalon. 

PIPS Is not (only) Polyhedral Software. In First International Workshop on 

Polyhedral Compilation Techniques, IMPACT, Chamonix, France, April 2011. 

[ACSGK11] Corinne Ancourt, Frédérique Chaussumier-Silber, Serge Guelton, and Ronan 

Keryell. PIPS: An interprodedural, extensible, source-to-source compiler infrastructure 

for code transformations and instrumentations. tutorial at International 

Symposium on Code Generation and Optimization, April 2011. 

Chamonix, France. 

[CSGIK10] Frédérique Chaussumier-Silber, Serge Guelton, François Irigoin, and Ronan 

Keryell. PIPS: An interprodedural, extensible, source-to-source compiler infrastructure 

for code transformations and instrumentations. tutorial at Principles 

and Practice of Parallel Programming, January 2010. Bangalore, India. 

[DGG + 07] Vincent Danjean, Roland Gillard, Serge Guelton, Jean-Louis Roch, and 

Thomas Roche. Adaptive loops with KAAPI on multicore and grid: applications 

in symmetric cryptography. In Parallel Symbolic Computation, PASCO, 

pages 33–42, 2007. 

[GAKC11] Serge Guelton, Mehdi Amini, Ronan Keryell, and Béatrice Creusillet. PyPS, 

a programmable pass manager. poster at International Workshop on Languages 

and Compilers for Parallel Computing, September 2011. Fort Collins, 

Colorado, USA. 

[GGK11] Serge Guelton, Adrien Guinet, and Ronan Keryell. Building retargetable 

and efficient compilers for multimedia instruction sets. poster at Parallel 

Architectures and Compilation Techniques, October 2011. Galveston, Texas, 

USA. 

[GGPV09] Serge Guelton, Thierry Gautier, Jean-Louis Pazat, and Sébastien Varrette. 

Dynamic Adaptation Applied to Sabotage Tolerance. In Proceedings of 

the 17th Euromicro International Conference on Parallel, Distributed and 

Network-Based Processing, PDP, pages 237–244, Weimar, Germany, February 

2009. 

197

198 PERSONAL BIBLIOGRAPHY 

[GIK10] Serge Guelton, François Irigoin, and Ronan Keryell. Automatic and sourceto-source 

code generation for vector hardware accelerators. poster at Colloque 

National du GDR SOC-SIP, June 2010. Cergy, France. 

[GKI11] Serge Guelton, Ronan Keryell, and François Irigoin. Compilation pour cible 

hétérogènes: automatisation des analyses, transformations et décisions nécessaires. 

In 20ème Rencontres Françaises du Parallélisme, Renpar, Saint Malo, 

France, May 2011. 

[Gue09] Serge Guelton. A genetic and source-to-source approach to iterative compilation. 

poster at ACM Student Research Competition Posters, Parallel Architectures 

and Compilation Techniques, September 2009. Raleigh, North 

Carolina, USA. 

[Gue10] Serge Guelton. Automatic source-to-source code generation for vector hardware 

accelerators. poster at International Workshop on Languages and Compilers 

for Parallel Computing, October 2010. Houston, Texas, USA. 

[Gue11] Serge Guelton. Building Source-to-Source compilers for Heterogenous targets. 

PhD thesis, Télécom Bretagne, 2011. 

[GV09] Serge Guelton and Sébastien Varrette. Une approche génétique et source 

à source de l’optimisation de code. In 19ème Rencontres francophones du 

parallélisme, Renpar, Toulouse, France, September 2009. 

[PVG + 08a] Christine Plumejeaud, Jean-Marc Vincent, Claude Grasland, Sandro Bimonte, 

Hélène Mathian, Serge Guelton, Joël Boulier, and Jérôme Gensel. 

Hypersmooth: A system for interactive spatial analysis via potential maps. 

In The 8th International Symposium on Web and Wireless Geographical Information 

Systems, W2GIS, pages 4–16, 2008. 

[PVG + 08b] Christine Plumejeaud, Jean-Marc Vincent, Claude Grasland, Jérôme Gensel, 

Hélène Mathian, Serge Guelton, and Joël Boulier. Hypersmooth : calcul et 

visualisation de cartes de potentiel interactives. CoRR, abs/0802.4191, 2008. 

[TVa + 12] Massimo Torquati, Marco Vanneschi, Mehdi amini, Serge Guelton, Ronan 

Keryell, Vincent Lanore, François-Xavier Pasquier, Michel Barreteau, Rémi 

Barrère, Claudia-Teodora Petrisor, Éric Lenormand, Claudia Cantini, and 

Filippo De Stefani. An innovative compilation tool-chain for embedded multicore 

architectures. In Embedded World Conference, February 2012.

Index 

terapix, 16, 17, 19, 73, 77, 78, 123, 133, 

141, 144, 146–151, 153–155, 160, 163 

array linearization, 144, 150 

C99, 34 

common subexpression elimination, 85, 167 

Compilation flow, 39 

compilation flow, 42, 45, 67, 134, 152 

constant propagation, 51 

convex array region, 87 

convex array regions, 120, 168 

data transfers, 29, 104, 114, 120, 123, 141 

dead code elimination, 120, 167 

directive generation, 49, 134 

distributed memory, 29, 114 

flatten code, 150 

forward substitution, 46, 84, 167 

fuzz testing, 63 

goto elimination, 84 

header substitution, 59, 99, 108, 152 

inlining, 46, 53, 83, 84 

instruction selection, 79 

invariant code motion, 167 

iteration clamping, 144 

loop fusion, 49, 52, 87, 134, 167 

loop interchange, 102, 167 

loop normalization, 144 

loop rerolling, 108 

loop tiling, 52, 102, 167 

loop unrolling, 46, 50, 102 

loop unswitching, 104 

199 

memory footprint reduction, 121, 163 

n address code generation, 150 

outlining, 18, 84, 87, 159, 162 

parallelism detection, 134 

parallelism extraction, 49 

pass manager, 46, 47, 134 

privatization, 134 

reduction detection, 49, 134 

redundant load-store elimination, 113, 124, 

163 

scalar renaming, 100 

split update operator, 150 

statement isolation, 114, 120, 141, 159, 163 

strength reduction, 150 

symbolic tiling, 122, 159 

variable length array, 34, 141

Building Source-to-Source Compilers for Heterogeneous Targets

Create successful ePaper yourself

Delete template?

Save as template?