21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Aggressive Function Inl<strong>in</strong><strong>in</strong>g: Prevent<strong>in</strong>g Loop Block<strong>in</strong>gs 385<br />

responsible for most of the branch mis-predictions. Inl<strong>in</strong><strong>in</strong>g elim<strong>in</strong>ates many<br />

return <strong>in</strong>structions thus “free<strong>in</strong>g” significant amount of entries <strong>in</strong> the branch<br />

history tables. 1 F<strong>in</strong>ally, elim<strong>in</strong>at<strong>in</strong>g the code of the function’s prologue and epilogue<br />

may further reduce execution time.<br />

In spite of these potentials, <strong>in</strong>l<strong>in</strong><strong>in</strong>g is not used aggressively and is usually<br />

applied to restricted cases. The reason is that aggressive function <strong>in</strong>l<strong>in</strong><strong>in</strong>g can<br />

cause code bloat and consequently <strong>in</strong>struction cache (Icache) conflicts; thus,<br />

degrad<strong>in</strong>g performance. In particular, different copies of the same function may<br />

compete for space <strong>in</strong> the Icache. If a function is frequently called from different<br />

call sites, duplicat<strong>in</strong>g it can cause more cache misses due to frequent references<br />

to these multiple copies or to code segments that are activated through them.<br />

The ma<strong>in</strong> problem with <strong>in</strong>l<strong>in</strong><strong>in</strong>g is that the f<strong>in</strong>al <strong>in</strong>structions’ locations (addresses<br />

<strong>in</strong> the f<strong>in</strong>al executable) are determ<strong>in</strong>ed only after <strong>in</strong>l<strong>in</strong><strong>in</strong>g is completed.<br />

Consequently, it is not possible to determ<strong>in</strong>e which <strong>in</strong>l<strong>in</strong><strong>in</strong>gs will eventually lead<br />

to Icache conflicts. Many optimizations that modify the code and change its location<br />

are applied after <strong>in</strong>l<strong>in</strong><strong>in</strong>g (such as schedul<strong>in</strong>g and constant propagation).<br />

An important optimization that dramatically alters code locations and is applied<br />

after <strong>in</strong>l<strong>in</strong><strong>in</strong>g is code reorder<strong>in</strong>g [7]. Code reorder<strong>in</strong>g groups sequences of<br />

hot basic blocks (frequently executed) <strong>in</strong> ordered “cha<strong>in</strong>s” that are mapped to<br />

consecutive addresses <strong>in</strong> the Icache. Thus, code reorder<strong>in</strong>g can reduce or repair<br />

part of the damage caused by aggressive <strong>in</strong>l<strong>in</strong><strong>in</strong>g decisions. However, it does not<br />

elim<strong>in</strong>ate the need to carefully select the <strong>in</strong>l<strong>in</strong>ed function calls.<br />

Even sophisticated versions [2] of code reorder<strong>in</strong>g that use cache color<strong>in</strong>g<br />

techniques can not repair the damage caused by excessive <strong>in</strong>l<strong>in</strong><strong>in</strong>g. This <strong>in</strong>herent<br />

circular dependency leads compiler architects to use heuristics for function<br />

<strong>in</strong>l<strong>in</strong><strong>in</strong>g. Current <strong>in</strong>l<strong>in</strong><strong>in</strong>g methods use a comb<strong>in</strong>ation of these basic methods and<br />

they differ only <strong>in</strong> the criterion/threshold of when to use a given basic method:<br />

– <strong>in</strong>l<strong>in</strong>e only very small functions, basically preserv<strong>in</strong>g the orig<strong>in</strong>al code size.<br />

– <strong>in</strong>l<strong>in</strong>e a function if it only has a s<strong>in</strong>gle caller. Thus, preserv<strong>in</strong>g or even reduc<strong>in</strong>g<br />

the orig<strong>in</strong>al code size. This is extended to <strong>in</strong>l<strong>in</strong><strong>in</strong>g only dom<strong>in</strong>ant calls,<br />

i.e., call sites that account for the majority of calls for a given function.<br />

– <strong>in</strong>l<strong>in</strong>e only hot calls, not necessary the dom<strong>in</strong>ant ones.<br />

– <strong>in</strong>l<strong>in</strong>e if a consecutive cha<strong>in</strong> of basic blocks does not exceed the size of the<br />

L1 Icache.<br />

– <strong>in</strong>l<strong>in</strong>e calls with constant parameters whose substitution followed by constant<br />

propagation and dead code elim<strong>in</strong>ation will improve performance. This criterion<br />

was proposed and used <strong>in</strong> [5] for the related optimization of function<br />

clon<strong>in</strong>g, i.e., create a specialized clone of a function for a specific call site2 .<br />

We have experimented with these rules <strong>in</strong> IBM’s FDPR-Pro tool which is a post<br />

l<strong>in</strong>k optimizer [7] and discovered that they are too restrictive and can prevent<br />

many possible <strong>in</strong>l<strong>in</strong><strong>in</strong>gs that could lead to significant code improvement.<br />

1 This is important <strong>in</strong> processors with deep pipel<strong>in</strong>es or m<strong>in</strong>or branch prediction support.<br />

The latter attribute is prevalent <strong>in</strong> lead<strong>in</strong>g high-end embedded processors.<br />

2 Clon<strong>in</strong>g is less efficient than <strong>in</strong>l<strong>in</strong><strong>in</strong>g s<strong>in</strong>ce it does not <strong>in</strong>crease the size of consecutive<br />

cha<strong>in</strong>s of hot basic blocks.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!