29.01.2015 Views

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

Embedded Software for SoC - Grupo de Mecatrônica EESC/USP

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

364 Chapter 27<br />

i<strong>de</strong>ntifier and a mapper (<strong>de</strong>scribed in the previous section) that i<strong>de</strong>ntifies the<br />

RB to be used by instructions operating on that packet. Static mapping of ports<br />

and RB’s is a good (though not optimal at all times) scheme that can be used<br />

to reduce hardware complexity. The Flow i<strong>de</strong>ntifier table and the mapper are<br />

accessible to all microengines of the NPU and to the memory controller, which<br />

is responsible <strong>for</strong> filling an entry on the arrival of a new packet.<br />

The flow id field is initialized to a <strong>de</strong>fault value (say 0) which maps to the<br />

<strong>de</strong>fault reuse buffer This field is updated once the actual flow id is<br />

computed based on any of the schemes mentioned previously. One scheme<br />

<strong>for</strong> implementing flow based IR is to tag packets that arrive at the router or<br />

NPU. This tag serves as a packet i<strong>de</strong>ntifier, which after mapping, becomes<br />

the RB i<strong>de</strong>ntifier and is used by instructions operating on a packet to query<br />

the appropriate RB. Another scheme would be one in which the packet id of<br />

the packet currently being processed by a thread is obtained by tracking the<br />

memory being accessed by read instructions (we assume that a packet is stored<br />

in contiguous memory). Each thread stores the packet id of the packet currently<br />

being processed, the thread id and the flow id to which the packet<br />

belongs. The selection of the RB based on the flow id is ma<strong>de</strong> possible by<br />

augmenting the Reor<strong>de</strong>r Buffer (RoB) with the thread id and the RB id (the<br />

mapper gives the flow id to RB id mapping). Instructions belonging to<br />

different threads (even those in other processors) access the Flow i<strong>de</strong>ntifier<br />

table which indicates the RB to be queried by that instruction. The flow id<br />

field indicates the <strong>de</strong>fault RB initially. After a certain amount of processing,<br />

the thread that <strong>de</strong>termines the output port (<strong>for</strong> output-port based<br />

scheme) updates the flow id entry in the Flow i<strong>de</strong>ntifier table <strong>for</strong> the packet<br />

being processed. This in<strong>for</strong>mation is known to all other threads operating on<br />

the same packet through the centralized Flow i<strong>de</strong>ntifier table. Once the flow<br />

id is known, the mapper gives the exact RB id (RB to be used) which is stored<br />

in thread registers as well as in the RoB. When the processing of the packet<br />

is complete, it is removed from memory i.e. it is <strong>for</strong>war<strong>de</strong>d to the next router<br />

or sent to the host processor <strong>for</strong> local consumption. This action is again<br />

initiated by some thread and actually carried out by the memory controller.<br />

At this instant the flow id field in the Flow i<strong>de</strong>ntifier table <strong>for</strong> the packet is<br />

reset to the <strong>de</strong>fault value. In summary, IR is always exploited with instructions<br />

querying either the <strong>de</strong>fault RB or a RB specified by the flow id. However,<br />

the centralized Flow i<strong>de</strong>ntifier table could be the potential bottleneck in the<br />

system. It is also possible to use a hashing function that sends all packets of<br />

the same flow to the same processing engine in which case, the hardware complexity<br />

in <strong>de</strong>termining the RB to be queried reduces significantly. The<br />

drawback of this scheme is that the processing engines could be non-uni<strong>for</strong>mly<br />

loa<strong>de</strong>d resulting in per<strong>for</strong>mance <strong>de</strong>gradation.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!