01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6 J. Mische et al.<br />

4.1 Instruction Fetch<br />

The execution <strong>of</strong> an <strong>in</strong>struction typically occupies one pipel<strong>in</strong>e stage for only<br />

one cycle and issu<strong>in</strong>g multiple <strong>in</strong>structions from one thread is only reasonable, if<br />

there are enough <strong>in</strong>structions available. Hence the number <strong>of</strong> <strong>in</strong>structions that is<br />

fetched per cycle must be equal or greater than the number <strong>of</strong> <strong>in</strong>structions that<br />

can be issued concurrently. As the number <strong>of</strong> concurrent <strong>in</strong>structions is equal<br />

to the number <strong>of</strong> pipel<strong>in</strong>es, this number must be multiplied with the maximum<br />

<strong>in</strong>struction length to get the required fetch bandwidth.<br />

Assum<strong>in</strong>g a zero cycle memory latency, it takes two cycles from the decision,<br />

that a new fetch must be <strong>in</strong>itiated (<strong>in</strong> the issue stage) until the arrival <strong>of</strong> the data<br />

at the <strong>in</strong>struction w<strong>in</strong>dow (aga<strong>in</strong> <strong>in</strong> the issue stage). Therefore the <strong>in</strong>struction<br />

w<strong>in</strong>dows (IW) must be large enough to hold at least two times the fetch width.<br />

Each additional memory latency further <strong>in</strong>creases the size <strong>of</strong> the IW by the fetch<br />

width. If the <strong>in</strong>struction width varies and the <strong>in</strong>structions are not aligned to the<br />

borders <strong>of</strong> the fetch words, the size must be further <strong>in</strong>creased.<br />

A concrete example with the CarCore architecture: There are two pipel<strong>in</strong>es<br />

and an <strong>in</strong>struction can be 16 or 32 bits wide. Accord<strong>in</strong>gly, fetch<strong>in</strong>g 64 bits should<br />

provide at least enough <strong>in</strong>structions for one cycle. If <strong>in</strong> cycle t + 0 the RTI issues<br />

two <strong>in</strong>structions to the pipel<strong>in</strong>es it removes at most 64 bit from the IW and<br />

recognises that it should be refilled and <strong>in</strong>itiates a fetch. Dur<strong>in</strong>g cycle t +1the<br />

memory is accessed and the RTI must take the next 64 bits from the IW. In<br />

cycle t + 2 the fetched data arrives at the RTI, so the data can be directly issued<br />

to the pipel<strong>in</strong>es. But 128 bits are still not enough, as TriCore <strong>in</strong>structions must<br />

only be aligned to 16 bit boundaries, consequently four <strong>in</strong>structions could cover<br />

three 64 bit words and the m<strong>in</strong>imum IW size is 192 bits.<br />

With the proposed fetch width and <strong>in</strong>struction w<strong>in</strong>dows size optimal execution<br />

<strong>of</strong> the highest priority thread (HPT) can be guaranteed. But what about the<br />

other threads? The HPT only fully occupies the fetch stage, if there is code<br />

that uses every pipel<strong>in</strong>e <strong>in</strong> every cycle and if these <strong>in</strong>structions are <strong>of</strong> maximum<br />

length. As the evaluation shows, this is almost never the case. Whenever the<br />

IW <strong>of</strong> a thread is full, the fetch logic tries to fetch for the thread with the next<br />

highest priority. Aga<strong>in</strong>, the evaluation shows that empty IWs are only a m<strong>in</strong>or<br />

reason for not execut<strong>in</strong>g a lower priority thread.<br />

There are two possibilities to optimise the fetch<strong>in</strong>g: The first one called ENOUGH<br />

exactly counts how much <strong>in</strong>structions are <strong>in</strong> the IW and how they are mapped<br />

to pipel<strong>in</strong>es. If there are enough <strong>in</strong>structions to cover two cycles, further fetches<br />

to this thread are delayed, no matter if the IW is already full or not. The AHEAD<br />

logic stops fetch<strong>in</strong>g when it recognizes a branch somewhere with<strong>in</strong> the IW. This<br />

optimisation is only applicable if there is no branch prediction and if there are<br />

at least two pipel<strong>in</strong>e stages between fetch and branch decision.<br />

An example for the CarCore architecture with three stages between fetch and<br />

branch decision (RTI, decode and execute): If <strong>in</strong> cycle t +0 only the branch<br />

is <strong>in</strong> the IW, the RTI issues it and removes the <strong>in</strong>struction from the IW. In<br />

cycle t +1 the AHEAD logic recognises that there is no longer a branch <strong>in</strong> the<br />

IW and permits to fetch the next <strong>in</strong>struction. In cycle t + 2 the next <strong>in</strong>struction

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!