21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Aggressive Function Inl<strong>in</strong><strong>in</strong>g: Prevent<strong>in</strong>g Loop<br />

Block<strong>in</strong>gs <strong>in</strong> the Instruction Cache<br />

Yosi Ben Asher, Omer Boehm, Daniel Citron,<br />

Gadi Haber, Moshe Klausner, Roy Lev<strong>in</strong>, and Yousef Shajrawi<br />

IBM Research Lab <strong>in</strong> Haifa, Israel<br />

<strong>Computer</strong> <strong>Science</strong> Department Haifa University, Haifa Israel<br />

{omerb,citron,haber,klausner}@il.ibm.com<br />

Abstract. Aggressive function <strong>in</strong>l<strong>in</strong><strong>in</strong>g can lead to significant improvements<br />

<strong>in</strong> execution time. This potential is reduced by extensive <strong>in</strong>struction<br />

cache (Icache) misses caused by subsequent code expansion. It is<br />

very difficult to predict which <strong>in</strong>l<strong>in</strong><strong>in</strong>gs cause Icache conflicts, as the exact<br />

location of code <strong>in</strong> the executable depends on complet<strong>in</strong>g the <strong>in</strong>l<strong>in</strong><strong>in</strong>g<br />

first. In this work we propose a new method for selective <strong>in</strong>l<strong>in</strong><strong>in</strong>g called<br />

“Icache Loop Block<strong>in</strong>gs” (ILB). In ILB we only allow <strong>in</strong>l<strong>in</strong><strong>in</strong>gs that do<br />

not create multiple <strong>in</strong>l<strong>in</strong>ed copies of the same function <strong>in</strong> hot execution<br />

cycles. This prevents any <strong>in</strong>crease <strong>in</strong> the Icache footpr<strong>in</strong>t. This method<br />

is significantly more aggressive than previous ones, experiments show it<br />

is also better.<br />

Results on a server level processor and on an embedded CPU, runn<strong>in</strong>g<br />

SPEC CINT2000, show an improvement of 10% <strong>in</strong> the execution time<br />

of the ILB scheme <strong>in</strong> comparison to other <strong>in</strong>l<strong>in</strong><strong>in</strong>g methods. This was<br />

achieved without bloat<strong>in</strong>g the size of the hot code executed at any s<strong>in</strong>gle<br />

po<strong>in</strong>t of execution, which is crucial for the embedded processor doma<strong>in</strong>.<br />

We have also considered the synergy between code reorder<strong>in</strong>g and<br />

<strong>in</strong>l<strong>in</strong><strong>in</strong>g focus<strong>in</strong>g on how <strong>in</strong>l<strong>in</strong><strong>in</strong>g can help code reorder<strong>in</strong>g. This aspect<br />

of <strong>in</strong>l<strong>in</strong><strong>in</strong>g has not been studied <strong>in</strong> previous works.<br />

1 Introduction<br />

Function <strong>in</strong>l<strong>in</strong><strong>in</strong>g [1] is a known optimization where the compiler or post l<strong>in</strong>k<br />

tool replaces a call to a function by its body, directly substitut<strong>in</strong>g the values<br />

passed as parameters. Function <strong>in</strong>l<strong>in</strong><strong>in</strong>g can improve <strong>in</strong>struction schedul<strong>in</strong>g as it<br />

<strong>in</strong>creases the size of basic blocks. Other optimizations such as global schedul<strong>in</strong>g,<br />

dead code elim<strong>in</strong>ation, constant propagation, and register allocation may also<br />

benefit from function <strong>in</strong>l<strong>in</strong><strong>in</strong>g. In order to optimize the code that was generated<br />

by the <strong>in</strong>l<strong>in</strong><strong>in</strong>g operation, <strong>in</strong>l<strong>in</strong><strong>in</strong>g must be executed before most of the backend<br />

optimizations.<br />

There is a special relation between <strong>in</strong>l<strong>in</strong><strong>in</strong>g and embedded systems. Embedded<br />

CPUs have relatively small branch history tables compared to servers. Aggressive<br />

<strong>in</strong>l<strong>in</strong><strong>in</strong>g can improve the branch prediction <strong>in</strong> embedded systems, compensat<strong>in</strong>g<br />

for their relatively small number of entries. The reason is that return <strong>in</strong>structions<br />

are implemented with branch-via-register <strong>in</strong>structions which are typically<br />

P. Stenström et al. (Eds.): HiPEAC 2008, LNCS <strong>4917</strong>, pp. 384–397, 2008.<br />

c○ Spr<strong>in</strong>ger-Verlag Berl<strong>in</strong> Heidelberg 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!