Soft-Core Processor Design - CiteSeer

More documents

Recommendations

Info

• QsortInt: uses the qsort algorithm to sort 100 integers. It is equivalent to the Qsort benchmark from the MiBench set. The only difference is the size of the dataset. The UT Nios benchmark set also includes some test benchmarks. Test benchmarks are sequences of assembler instructions that test the performance of specific architectural features. By analyzing how the performance of test benchmarks depends on various parameters, it is possible to get better insight into the performance of real applications. The test benchmarks are: • Loops: runs a tenfold nested loop with two addition operations inside the innermost loop, representing applications with many branches. • Memory: performs 20 consecutive addition/subtraction operations on elements of a one-dimensional array residing in the data memory, representing memory intensive applications. • Pipeline: performs a sequence of ADD, ADDI, MOV, and SUB instructions, where each instruction uses the result of the previous instruction. The benchmark tests the implementation of the data forwarding logic in the pipeline. • Pipeline-Memory: performs a sequence of pairs of LD and ADD instructions, where the ADD instruction uses the result of the LD. The benchmark tests how data forwarding of the operands coming from the memory is implemented. Two variants of this benchmark are used, performing the load operations from the program and data memory. Load instructions typically access the program memory to access the compiler generated data. The toy and test benchmarks are run multiple times in a loop to produce reasonable run times that can be measured. All benchmarks were compiled using gcc for Nios (version 2.9-nios-010801-20030227) included in the GNUPro toolkit, with the compiler optimization level 3. The time was measured using a timer peripheral provided in the SOPC component library. The timer peripheral provides a C library routine nr_timer_milliseconds(), which returns the number of milliseconds since the first call to this procedure. We measure the performance in milliseconds for all benchmarks, except for the Bitcount benchmark, whose performance is reported in bits/ms. The Bitcount benchmark randomly generates its input dataset, so its run time varies slightly depending on the system configuration. 5.1.2. Development Tools The following set of cores and tools was used to obtain the results presented in this chapter. The Altera Nios 3.0 implementation of the Nios architecture was used to compare the 50
performance of the UT Nios and Altera Nios. The Altera Nios 3.0 comes with SOPC Builder 3.0 and the GNUPro Toolkit. Quartus II, version 3.0, Web Edition was used to synthesize all of the designs. Version 3.0 includes the <strong>Design</strong> Space Explorer (DSE) version 1.0. DSE provides an interface for automated exploration of various design compilation parameters. Among others, there is a seed parameter that influences the initial placement used by the placement algorithm. Our experiments show that the variation in compilation results from seed to seed is significant. We use the DSE to sweep through a number of seeds to obtain better compilation results. The influence of a seed on the compilation results will be discussed in more detail in the next chapter. The Nios Development Kit, Stratix Edition [50] contains a development board with the EP1S10F780C6ES Stratix FPGA, which has 10,570 LEs and 920 Kbits of on-chip memory. In our experiments, we use both the on-chip memory and a 1 MB off-chip SRAM memory. Since the on-chip memory is synchronous it has one cycle latency. The off-chip SRAM is a zero-latency memory, but it shares the connection to the Stratix device with other components on the board. Hence, it has to be connected to the Avalon bus by using a tri-state bridge. Since both inputs and outputs of the bridge are registered, the Nios processor sees a memory with the latency of two cycles. The board is connected to a PC through a ByteBlasterMV cable [58] for downloading the FPGA configuration into the Stratix device on the board. There is also a serial connection for communication with the board using a terminal program provided in the GNUPro toolkit. The terminal program communicates with the GERMS monitor running on the Nios processor. The monitor program is used to download the compiled programs into the memory, run the programs and communicate with a running program. All system configurations were run using a 50 MHz clock generated on the development board. The results were prorated to include the maximum frequency (Fmax) the system can run at. The Fmax was determined by using the DSE seed sweep function to obtain the best Fmax over 10 predefined seeds. The systems were not run at the maximum frequency because every change in the design requires finding a new seed value that produces the best Fmax. This is the case even if the best Fmax did not change significantly. We have verified that the systems run correctly at the Fmax obtained. We have also run several applications on a system running at the Fmax obtained, and verified that the difference between the prorated results and real results is negligible. Quartus II was configured to use the Stratix device available on the development board, the appropriate pins were assigned, as described in [59], and unused pins were reserved as tri-stated inputs. Other Quartus options were left at their default values, unless otherwise stated. 51
Page 1 and 2:
SOFT-CORE PROCESSOR DESIGN by Franj
Page 3 and 4:
Acknowledgments First, I would like
Page 5 and 6: 5.1.2. Development Tools ..........
Page 7 and 8: Chapter 1 Introduction Since their
Page 9 and 10: Chapter 2 Background Soft-core proc
Page 11 and 12: uilt using techniques proven to be
Page 13 and 14: logic and I/O blocks [11]. Since th
Page 15 and 16: timing-driven [11]. Although simula
Page 17 and 18: the HDL coding style. To ensure tha
Page 19 and 20: 3.1. Nios Architecture The Nios ins
Page 21 and 22: esult of a read operation from thes
Page 23 and 24: satisfied the instruction that foll
Page 25 and 26: Most instructions take 5 cycles to
Page 27 and 28: contents of the register window wil
Page 29 and 30: needed, the master asserts the flus
Page 31 and 32: There are several ways in which use
Page 33 and 34: code) is provided [47]. Both printf
Page 35 and 36: memory address has to be set in the
Page 37 and 38: parameters include the general-purp
Page 39 and 40: Similarly, the control-flow instruc
Page 41 and 42: the logic resources may be more cri
Page 43 and 44: simple dual-port mode, which means
Page 45 and 46: prefetch program counter (PPC), whi
Page 47 and 48: There are two ways to resolve data
Page 49 and 50: individual bits (e.g. flags), and g
Page 51 and 52: LOAD state, except that a memory wr
Page 53 and 54: Chapter 5 Performance This chapter
Page 55: • Qsort: uses the well known qsor
Page 59 and 60: 5.2.1. Performance Dependence on th
Page 61 and 62: Speedup Over Buffer Size 1 1.6 1.4
Page 63 and 64: underflow and overflow exceptions a
Page 65 and 66: Slowdown Over 29 Available Register
Page 67 and 68: Recursion Level # of recursive call
Page 69 and 70: Total # of Memory Accesses Performe
Page 71 and 72: System SRAM ONCHIP Size of the Regi
Page 73 and 74: a fixed access time, since it is no
Page 75 and 76: Speedup of the Pipeline Optimized f
Page 77 and 78: Improvement of UT Nios over Altera
Page 79 and 80: Improvement of UT Nios over Altera
Page 81 and 82: Number of Processors LEs (% increas
Page 83 and 84: pipelined implementation. Control-f
Page 85 and 86: not mean that, for example, the ins
Page 87 and 88: There are many paths in each group,
Page 89 and 90: FPGA design flow is a random functi
Page 91 and 92: each be connected to only a single
Page 93 and 94: • The UT Nios design is analyzed
Page 95 and 96: [13] C. Blum and A. Roli, “Metahe
Page 97 and 98: [36] Microchip Technology, “PIC16
Page 99: [60] Altera Corporation, “AN 184:
show all

Soft-Core Processor Design - CiteSeer

Create successful ePaper yourself

Delete template?

Save as template?