15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The above analysis demonstrates that in all three cases, new architecture-level and circuit-level innovations<br />

are required to develop a network processor that meets the OC-768 performance goal.<br />

Two basic approaches to speeding up network packet processing inside a network processor are<br />

pipelining and parallelization. A deeper pipeline reduces the cycle time of the pipeline and, therefore,<br />

improves the system throughput. However, because packet classification, packet forwarding, and packet<br />

scheduling each exhibit complicated internal dependencies and sometimes iterative structures, it is<br />

difficult to pipeline these functions effectively. Parallelization can be applied at different granularities.<br />

Finer granularity parallelism is more difficult to exploit but potentially leads to higher performance gain.<br />

In the case of network packet processing, because processing of one packet is independent of processing<br />

of another packet, packet-level parallelism appears to be the right granularity that strikes a good balance<br />

between performance gain and implementation complexity. Typically, a thread is dedicated to the processing<br />

of one packet, and different threads can run in parallel on distinct hardware engines. To reap<br />

further performance improvement by exploiting instruction-level parallelism, researchers and companies<br />

have proposed to run concurrent packet-processing threads on a simultaneous multi-threading processor<br />

[5,11] to mask as many pipeline stalls as possible. Such multithreading processors require the support<br />

of multiple hardware contexts and fast context switching.<br />

Compared with generic CPU workloads, network packet processing requires much more frequent bitlevel<br />

manipulation, such as header field extraction and header checksum computation. In a standard<br />

RISC processor, extracting an arbitrary range of bits from a 32-bit word requires at least three instructions,<br />

and performing a byte-wide summing of the four bytes within a word takes at least 13 instructions.<br />

Therefore, commercial network processors [6] include special bit-level manipulation and 1’s complement<br />

instructions to speed up header field extraction and replacement, as well as packet checksumming<br />

computation.<br />

Caching is arguably the most effective and most often used technique in modern computer system<br />

design. One place in network processor design to which caching can be effectively applied is packet<br />

classification. Since multiple packets travel on a network connection in its lifetime, in theory each<br />

intermediate network device only needs to perform packet classification once for the first packet and<br />

reuses the resulting classification decision for all subsequent packets. This corresponds to temporal locality<br />

if one treats the set of all possible values of the header fields used in packet classification as an address<br />

space. Empirical studies [7,8] show that network packet streams indeed exhibit substantial temporal<br />

locality but very little spatial locality. In addition, unlike CPU cache, the classification results for neighboring<br />

points in this address space tend to be identical. Therefore, network processor cache can be designed<br />

to cache address ranges rather than just address space points, as in standard cache. Chiueh and Pradhan [8]<br />

showed that caching ranges of classification address space can increase the effective coverage of a network<br />

processor cache by several orders of magnitude as compared to conventional caches that cache individual<br />

addresses.<br />

Another alternative to speed up packet classification is through special content-addressable memory<br />

(CAM) [13]. Commercial CAMs support ternary comparison logic (0, 1, and X or don’t-care). Classification<br />

rules are pre-stored in the CAMs. Given an input packet, the selective portion of its packet header<br />

is compared against all the stored classification patterns in parallel, and a priority decoder picks the<br />

highest priority among the matched rules if there are multiple of them. Although CAM can identify<br />

relevant packet classification rules at wire speed, two problems are associated with applying CAM to the<br />

packet classification problem. First, to support range match, e.g., source port number 130–202, one has<br />

to break a range rule to multiple range rules, each covering a range whose size is a multiple of 2. This is<br />

because CAMs only support don’t-care match but not arbitrary arithmetic comparison. For example,<br />

the range 129–200 needs to be broken down into eight ranges: 130–130, 131–132, 133–136, 137–144,<br />

145–160, 161–192, 193–200, and 201–202. For classification rules with multiple range fields, the need<br />

for range decomposition can significantly increase the number of CAM entries required. Second, because<br />

CAMs are hardwired memory with built-in width, it cannot easily support matching of variable-length<br />

fields such as URL, or accommodate changing packet classification rules after network devices are put<br />

into field use.<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!