27.06.2013 Views

EE 675 Advanced Microprocessors ARM – A little history

EE 675 Advanced Microprocessors ARM – A little history

EE 675 Advanced Microprocessors ARM – A little history

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>EE</strong> <strong>675</strong><br />

<strong>Advanced</strong> <strong>Microprocessors</strong><br />

<strong>ARM</strong> Organization and Implementation<br />

Dr. Khurram Waheed<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 1<br />

<strong>ARM</strong> <strong>–</strong> A <strong>little</strong> <strong>history</strong><br />

• First <strong>ARM</strong> processor developed on 3 micron technology in ‘83-<br />

’85<br />

• This course is mainly based on the <strong>ARM</strong>6/7 architecture<br />

developed between ‘90-’95.<br />

• Digital Equipment Corporation (then Compaq, now HP)<br />

developed the Strong<strong>ARM</strong> processor which has a very high<br />

performance.<br />

• More recent developments are: <strong>ARM</strong>8 and <strong>ARM</strong>9E (1999), and<br />

• a <strong>ARM</strong> processor without clock - the asynchronous AMULET<br />

from U. of Manchester (Steve Furber’s group)<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 2<br />

1


<strong>ARM</strong> organization<br />

• Two main blocks: datapath and<br />

decoder<br />

• Register bank<br />

<strong>–</strong> r0 to r15<br />

<strong>–</strong> 2 read ports,<br />

<strong>–</strong> 1 write port<br />

<strong>–</strong> 1 read, 1 write port reserved for r15 (pc)<br />

• Barrel shifter <strong>–</strong> shift or rotate one<br />

operand for any number of bits<br />

• ALU <strong>–</strong> performs the arithmetic and<br />

logic functions required<br />

• Memory address register +<br />

incrementer<br />

• Memory data registers<br />

• Instruction decoder and associated<br />

control logic<br />

address register<br />

register<br />

bank<br />

incrementer<br />

multiply<br />

register<br />

barrel<br />

shifter<br />

data out register<br />

instruction<br />

decode<br />

&<br />

control<br />

D[31:0]<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 3<br />

A<br />

L<br />

U<br />

b<br />

u<br />

s<br />

A[31:0]<br />

<strong>ARM</strong> <strong>–</strong> Internal Organization<br />

3-stage Pipeline<br />

• Data register holds read/write data<br />

from/to memory<br />

• Instruction decoder decodes<br />

machine code instructions to<br />

produce control signals to datapath<br />

• Data processing instructions take a<br />

single cycle: data values are read on<br />

the A-bus & B-bus, the results from<br />

ALU is written back into register<br />

bank<br />

P<br />

C<br />

A<br />

b<br />

u<br />

s<br />

ALU<br />

address register<br />

register<br />

bank<br />

incrementer<br />

multiply<br />

register<br />

barrel<br />

shifter<br />

data out register<br />

PC<br />

B<br />

b<br />

u<br />

s<br />

control<br />

data in register<br />

instruction<br />

decode<br />

&<br />

control<br />

D[31:0]<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 4<br />

A<br />

L<br />

U<br />

b<br />

u<br />

s<br />

A[31:0]<br />

P<br />

C<br />

A<br />

b<br />

u<br />

s<br />

ALU<br />

PC<br />

B<br />

b<br />

u<br />

s<br />

control<br />

data in register<br />

2


Pipelining<br />

• The maximum processing rate is determined by the propagation<br />

delay of the computational logic <strong>–</strong> in abstract:<br />

Function 1 Function 2<br />

Time T1 Time T2<br />

• Above we can process one input every T1+T2<br />

a register<br />

Function 1 Function 2<br />

Time T1 Time T2<br />

• Above we can process one input every max{T1,T2}, but each<br />

input still takes T1+T2 to be completely processed<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 5<br />

Three-stage pipeline<br />

• <strong>ARM</strong> uses a 3-stage instruction pipeline<br />

<strong>–</strong> Fetch: fetch instruction code from memory into the instruction pipeline<br />

<strong>–</strong> Decode: instruction decoded to obtain control signals for the datapath ready<br />

for the next stage<br />

<strong>–</strong> Execute: instruction “owns” the datapath - register read; shifting; ALU results<br />

generated and write-back<br />

• Results for each stage stored in registers<br />

• The consequence is that the clock period is much shorter than without<br />

pipelining<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 6<br />

3


Pipeline: how it works<br />

• All instructions occupy the datapath for one or more<br />

adjacent cycles<br />

• For each cycle that an instruction occupies the datapath,<br />

it occupies the decode logic in the immediately<br />

preceding cycle<br />

• During the first datapath cycle each instruction issues<br />

a fetch for the next instruction but one<br />

• Branch instruction flush and refill the instruction<br />

pipeline<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 7<br />

<strong>ARM</strong> single-cycle instruction<br />

pipeline<br />

1<br />

2<br />

3<br />

instruction<br />

fetch decode execute<br />

fetch decode execute<br />

fetch decode execute<br />

time<br />

• At any time, 3 different instructions may occupy each of the 3-stages of<br />

pipeline<br />

• It may take three cycles to complete a single-cycle instruction. This is<br />

said to have a three cycle latency<br />

• Once a pipeline fills, the processor completes a single-cycle instruction<br />

every clock cycle. Therefore the throughput is one instruction per<br />

cycle.<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 8<br />

4


<strong>ARM</strong> single-cycle instruction<br />

pipeline<br />

fetch<br />

sub r2,r3,r6<br />

cmp r2,#3<br />

decode<br />

fetch<br />

execute add<br />

decode<br />

fetch<br />

1 2 3<br />

add r0,r1,#5<br />

execute sub<br />

decode execute cmp<br />

time<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 9<br />

1<br />

2<br />

3<br />

4<br />

<strong>ARM</strong> multi-cycle instruction<br />

pipeline<br />

fetch ADD decode execute<br />

fetch STR decode calc. addr.<br />

Decode logic is always generating<br />

the control signals for the datapath<br />

to use in the next cycle<br />

data xfer<br />

fetch ADD decode execute<br />

fetch ADD decode execute<br />

5<br />

instruction<br />

fetch ADD<br />

time<br />

decode execute<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 10<br />

5


PC behavior <strong>–</strong> Pipeline<br />

• As a consequence of pipeline, the PC (r15) needs to run<br />

ahead of current instruction<br />

• Instruction fetches the next instruction but one during<br />

their first cycle, i.e., PC points 8 bytes ahead of current<br />

instruction.<br />

• So a user using the PC in a program must account for the<br />

pipeline effects<br />

• The situation is more complex in cycles later than the<br />

first cycle.<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 11<br />

<strong>ARM</strong> multi-cycle LDMIA (load<br />

multiple) instruction<br />

ldmia<br />

r0,{r2,r3}<br />

sub r2,r3,r6<br />

cmp r2,#3<br />

Instruction delayed<br />

fetch decode ex ld r2 ex ld r3<br />

fetch<br />

decode ex sub<br />

Decode stage occupied<br />

since ldmia must continue to<br />

remember decoded instruction<br />

fetch decode ex cmp<br />

time<br />

sub fetched at normal time but<br />

not decoded until LDMIA is finishing<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 12<br />

6


Control stalls: due to branches<br />

• Branches often introduce stalls (branch penalty)<br />

<strong>–</strong> Stall time may depend on whether branch is taken<br />

• May have to squash instructions that already<br />

started executing<br />

• Don’t know what to fetch until condition is<br />

evaluated<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 13<br />

bne foo<br />

sub<br />

r2,r3,r6<br />

foo add<br />

r0,r1,r2<br />

<strong>ARM</strong> pipelined branch<br />

Decision not made until the third clock cycle<br />

fetch decode ex bne<br />

fetch decode<br />

ex bne<br />

ex bne<br />

Two cycles of work thrown<br />

away if bne takes place<br />

fetch decode ex add<br />

time<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 14<br />

7


Expanding the pipeline<br />

• 3-stage pipeline till <strong>ARM</strong>7 is very cost-effective<br />

• Better pipeline architectures required for better<br />

performance<br />

Ninst × CPI<br />

T =<br />

inst<br />

f<br />

• N inst is constant<br />

• So, only two options<br />

<strong>–</strong> Increase the clock rate, f clk requires more pipeline stages<br />

and simpler logic per stage<br />

<strong>–</strong> Reduce the average number of clock cycles per instruction,<br />

CPI requires instructions to occupy fewer pipeline slots and<br />

reduce the stalls in the pipeline Memory Bandwidth<br />

Bottlenecks<br />

clk<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 15<br />

5-stage Pipeline<br />

• Measures to take care of memory bottlenecks<br />

<strong>–</strong> Use of separate code and data memories<br />

<strong>–</strong> Increase the stages in the pipeline reduces the<br />

processor load/clock cycle<br />

• The above steps allow a RISC processor to work<br />

at a higher clock rate.<br />

• Use of separate instruction and data caches<br />

connected to a single DRAM greatly reduces<br />

core’s CPI<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 16<br />

8


<strong>ARM</strong>9TDMI: 5-stage pipeline<br />

• Fetch<br />

• Decode<br />

<strong>–</strong> instruction is decoded<br />

<strong>–</strong> register operands read<br />

(3 read ports)<br />

• Execute<br />

<strong>–</strong> an operand is shifted and the<br />

ALU result generated, or<br />

<strong>–</strong> address is computed<br />

• Buffer/data<br />

<strong>–</strong> data memory is accessed (load,<br />

store)<br />

• Write-back<br />

<strong>–</strong> write to register file<br />

next<br />

pc<br />

pc + 4<br />

B, BL<br />

MOV pc<br />

SUBS pc<br />

LDR pc<br />

register write write-back<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 17<br />

+4<br />

pc+ 8<br />

LDM/<br />

STM post-<br />

+4 index<br />

pre-index<br />

mux<br />

load/store<br />

address<br />

r15<br />

ALU<br />

I-cache<br />

I decode<br />

register read<br />

<strong>ARM</strong>9TDMI: Data Forwarding<br />

ADD r3, r2, r1, LSL #3<br />

ADD r5, r5, r3, LSL r2<br />

ADD r3, r2, r1, LSL #3<br />

ADD r8, r9, r10<br />

ADD r5, r5, r3, LSL r2<br />

LD r3, [r2]<br />

ADD r1, r2, r3<br />

Data Forwarding<br />

Stall?<br />

r3 := r2 + 8 x r1<br />

r5 := r5 + 2 r2 x r3<br />

r3 := r2 + 8 x r1<br />

r8 := r9 + r10<br />

r5 := r5 + 2 r2 x r3<br />

r3 := mem[r2]<br />

r1 := r2 + r3<br />

next<br />

pc<br />

pc + 4<br />

B, BL<br />

MOV pc<br />

SUBS pc<br />

LDR pc<br />

mul<br />

shift<br />

I-cache<br />

reg<br />

shift<br />

byte repl.<br />

D-cache<br />

rot/sgn ex<br />

rot/sgn ex<br />

fetch<br />

instruction<br />

decode<br />

immediate<br />

fields<br />

forwarding<br />

paths<br />

execute<br />

buffer/<br />

data<br />

register write write-back<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 18<br />

+4<br />

pc+8<br />

LDM/<br />

STM post-<br />

+4 index<br />

pre-index<br />

mux<br />

load/store<br />

address<br />

r15<br />

ALU<br />

I decode<br />

register read<br />

mul<br />

shift<br />

reg<br />

shift<br />

byte repl.<br />

D-cache<br />

fetch<br />

instruction<br />

decode<br />

immediate<br />

fields<br />

forwarding<br />

paths<br />

execute<br />

buffer/<br />

data<br />

9


<strong>ARM</strong>9TDMI: PC generation<br />

• 3-stage pipeline<br />

<strong>–</strong> PC behavior:<br />

operands are read in execution<br />

stage<br />

r15 = PC + 8<br />

• 5-stage pipeline<br />

<strong>–</strong> operands are read in decode<br />

stage and r15 = PC + 4?<br />

<strong>–</strong> incompatibilities between 3stage<br />

and 5-stage<br />

implementations =><br />

unacceptable<br />

<strong>–</strong> to avoid this 5-stage pipeline<br />

<strong>ARM</strong>s emulate the behavior of<br />

the older 3-stage designs<br />

next<br />

pc<br />

pc + 4<br />

B, BL<br />

MOV pc<br />

SUBS pc<br />

LDR pc<br />

register write write-back<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 19<br />

+4<br />

pc+ 8<br />

LDM/<br />

STM post-<br />

+4 index<br />

pre-index<br />

mux<br />

load/store<br />

address<br />

r15<br />

ALU<br />

I-cache<br />

I decode<br />

register read<br />

Data processing instruction<br />

datapath activity (Ex)<br />

• Reg-Reg<br />

<strong>–</strong> Rd = Rnop Rm<br />

<strong>–</strong> r15 = AR + 4<br />

AR = AR + 4<br />

• Reg-Imm<br />

<strong>–</strong> Rd = Rnop Imm<br />

<strong>–</strong> r15 = AR + 4<br />

AR = AR + 4<br />

address register<br />

increment<br />

Rd<br />

PC<br />

registers<br />

Rn<br />

Rm<br />

mult<br />

as instruction<br />

as ins.<br />

data out data in i. pipe<br />

(a) register <strong>–</strong> register operations<br />

mul<br />

shift<br />

reg<br />

shift<br />

byte repl.<br />

D-cache<br />

rot/sgn ex<br />

address register<br />

fetch<br />

instruction<br />

decode<br />

immediate<br />

fields<br />

forwarding<br />

paths<br />

increment<br />

Rd<br />

PC<br />

registers<br />

Rn<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 20<br />

mult<br />

as instruction<br />

as ins.<br />

[7:0]<br />

data out data in i. pipe<br />

execute<br />

buffer/<br />

data<br />

(b) register <strong>–</strong> immediate operations<br />

10


STR (store register) datapath<br />

activity (Ex1, Ex2)<br />

• Compute address<br />

(Ex1)<br />

<strong>–</strong> AR = Rn op Disp<br />

<strong>–</strong> r15 = AR + 4<br />

• Store data (Ex2)<br />

<strong>–</strong> AR = PC<br />

<strong>–</strong> mem[AR] =<br />

Rd<br />

<strong>–</strong> If autoindexing<br />

=><br />

Rn = Rn +/- 4<br />

address register<br />

registers<br />

Rn<br />

mult<br />

increment<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 21<br />

lsl #0<br />

= A / A + B / A - B<br />

PC<br />

[11:0]<br />

data out data in i. pipe<br />

Rn<br />

address register<br />

PC<br />

registers<br />

mult<br />

increment<br />

Rd<br />

shifter<br />

= A+ B / A-B<br />

byte? data in i. pipe<br />

(a) 1 st cycle <strong>–</strong> compute address (b) 2 nd cycle <strong>–</strong> store data & auto-index<br />

The first two (of three) cycles of a<br />

branch instruction<br />

• Compute target<br />

address<br />

<strong>–</strong> AR = PC + Disp,lsl #2<br />

• Save return address<br />

(if required)<br />

<strong>–</strong> r14 = PC<br />

<strong>–</strong> AR = AR+ 4<br />

address register<br />

registers<br />

PC<br />

increment<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 22<br />

mult<br />

lsl #2<br />

address register<br />

R14<br />

registers<br />

PC<br />

increment<br />

= A + B<br />

= A<br />

[23:0]<br />

data out data in i. pipe data out data in i. pipe<br />

(a) 1st cycle <strong>–</strong> compute branch target (b) 2nd Third cycle: do a small<br />

correction to the value<br />

stored in the link register in<br />

order that it points to<br />

directly at the instruction<br />

which follows the branch?<br />

cycle <strong>–</strong> save return address<br />

mult<br />

shifter<br />

11


• Datapath<br />

<strong>ARM</strong> Implementation<br />

<strong>–</strong> RTL (Register Transfer Level) description<br />

• Control unit<br />

<strong>–</strong> FSM (Finite State Machine) description<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 23<br />

2-phase non-overlapping clock<br />

scheme<br />

• Most <strong>ARM</strong>s do not operate on edge-sensitive registers<br />

• Instead the design is based around 2-phase non-overlapping clocks<br />

which are generated internally from a single clock signal<br />

• Data movement is controlled by passing the data alternatively<br />

through latches which are open during phase 1 or latches during<br />

phase 2<br />

phase 1<br />

1 clock cycle<br />

phase 2<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 24<br />

12


<strong>ARM</strong> datapath timing<br />

• Register read<br />

<strong>–</strong> Register read buses -- dynamic, precharged during phase 2<br />

<strong>–</strong> During phase 1 selected registers discharge the read buses<br />

which become valid early in phase 1<br />

• Shift operation<br />

<strong>–</strong> second operand passes through barrel shifter<br />

• ALU operation<br />

<strong>–</strong> ALU has input latches which are open in phase 1,<br />

allowing the operands to begin combining in ALU<br />

as soon as they are valid, but they close at the end of phase 1<br />

so that the phase 2 precharge does not get through to the ALU<br />

<strong>–</strong> ALU processes the operands during the phase 2, producing the<br />

valid output towards the end of the phase<br />

<strong>–</strong> the result is latched in the destination register<br />

at the end of phase 2<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 25<br />

<strong>ARM</strong> datapath timing (cont’d)<br />

register<br />

read<br />

time<br />

shift time<br />

phase 1<br />

read bus valid<br />

ALU operands<br />

latched<br />

shift out valid<br />

ALU time<br />

phase 2<br />

precharge<br />

invalidates<br />

buses<br />

register<br />

write time<br />

ALU out<br />

Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay<br />

+ Register write set-up time<br />

+ Phase 2 to phase 1 non-overlap time<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 26<br />

13


The original <strong>ARM</strong>1 ripple-carry<br />

adder<br />

• Carry logic: use CMOS AOI (And-Or-Invert) gate<br />

• Even bits use circuit show below<br />

• Odd bits use the dual circuit with inverted inputs and<br />

outputs and AND and OR gates swapped around<br />

• Worst case path: 32 gates long<br />

A<br />

B<br />

Cin<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 27<br />

Cin[0]<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 28<br />

Cout<br />

<strong>ARM</strong>2 4-bit carry look-ahead<br />

scheme<br />

• Carry Generate (G)<br />

Carry Propagate (P)<br />

• Cout[3] =Cin[0].P + G<br />

• Use AOI and alternate AND/OR gates<br />

• Worst case: 8 gates long<br />

A[3:0]<br />

B[3:0]<br />

G<br />

P<br />

Cout[3]<br />

4-bit<br />

adder<br />

logic<br />

sum<br />

sum[3:0]<br />

14


The <strong>ARM</strong>2 ALU logic for one<br />

result bit<br />

• ALU functions<br />

<strong>–</strong> data operations (add, sub, ...)<br />

<strong>–</strong> address computations for memory accesses<br />

<strong>–</strong> branch target computations<br />

fs:<br />

<strong>–</strong> bit-wise logical<br />

NB<br />

operations<br />

bus<br />

5 0 1 2 3<br />

carry<br />

logic<br />

G<br />

4<br />

<strong>–</strong> ... ALU<br />

NA<br />

bus<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 29<br />

<strong>ARM</strong>2 ALU function codes<br />

fs5 fs4 fs3 fs2 fs1 fs0 ALU output<br />

0 0 0 1 0 0 A and B<br />

0 0 1 0 0 0 A and not B<br />

0 0 1 0 0 1 A xor B<br />

0 1 1 0 0 1 A plus not B plus carry<br />

0 1 0 1 1 0 A plus B plus carry<br />

1 1 0 1 1 0 not A plus B plus carry<br />

0 0 0 0 0 0 A<br />

0 0 0 0 0 1 A or B<br />

0 0 0 1 0 1 B<br />

0 0 1 0 1 0 not B<br />

0 0 1 1 0 0 zero<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 30<br />

P<br />

bus<br />

15


The <strong>ARM</strong>6 carry-select adder<br />

scheme<br />

• Compute sums of<br />

various fields of<br />

the word<br />

for carry-in of zero<br />

and carry-in of one<br />

• Final result is<br />

selected by using<br />

the correct carry-in<br />

value to control a<br />

multiplexer<br />

a,b[3:0]<br />

sum[3:0] sum[7:4]<br />

Worst case:<br />

O(log 2 [word width]) gates long<br />

+ +, +1 +, +1<br />

c s s+1<br />

mux<br />

sum[15:8]<br />

sum[31:16]<br />

a,b[31:28]<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 31<br />

mux<br />

mux<br />

Note: Be careful! Fan-out on some of these<br />

gates is high so direct comparison with previous<br />

schemes is not applicable.<br />

The <strong>ARM</strong>6 ALU organization<br />

• Not easy to merge the arithmetic and logic<br />

functions =>a separate logic unit runs in parallel<br />

with the adder, and multiplexor selects the output<br />

invert A<br />

function<br />

logic/arithmetic<br />

A operand latch B operand latch<br />

XOR gates XOR gates<br />

logic functions<br />

result mux<br />

zero detect<br />

adder<br />

invert B<br />

C in<br />

C<br />

V<br />

result<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 32<br />

N<br />

Z<br />

16


<strong>ARM</strong>9 carry arbitration encoding<br />

• Carry arbitration adder<br />

ai bi Ci vi, wi<br />

0<br />

1<br />

1<br />

0<br />

v<br />

i<br />

w<br />

i<br />

0<br />

1<br />

0<br />

1<br />

= a<br />

i<br />

i<br />

0<br />

1<br />

u<br />

u<br />

+ b<br />

= a ⋅ b<br />

i<br />

i<br />

0, 0<br />

1, 1<br />

1, 0<br />

1, 0<br />

ai<br />

0<br />

1<br />

0(1)<br />

0(1)<br />

0(1)<br />

1(0)<br />

1(0)<br />

1(0)<br />

0(1)<br />

1(0)<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 33<br />

bi<br />

0<br />

1<br />

ai-1<br />

-<br />

-<br />

0<br />

1<br />

bi-1<br />

-<br />

-<br />

0<br />

1<br />

Ci<br />

0<br />

1<br />

0<br />

1<br />

u<br />

vi, wi<br />

The cross-bar switch barrel shifter<br />

• Shifter delay is critical since it contributes<br />

directly to the datapath cycle time<br />

• Cross-bar switch matrix (32 x 32)<br />

• Principle for 4x4 matrix<br />

in[3]<br />

in[2]<br />

in[1]<br />

in[0]<br />

right 3right<br />

2 right 1<br />

out[0] out[1] out[2] out[3]<br />

no shift<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 34<br />

left 1<br />

left 2<br />

left 3<br />

0, 0<br />

1, 1<br />

0, 0<br />

1, 1<br />

1, 0<br />

17


The cross-bar switch barrel shifter<br />

(cont’d)<br />

• Precharged logic is used => each switch is a single NMOS<br />

transistor<br />

• Precharging sets all outputs to logic 0, so those which are not<br />

connected to any input during switching remain at 0 giving the zero<br />

filling required by the shift semantics<br />

• For rotate right, the right shift diagonal is enabled + complementary<br />

shift left diagonal (e. g., ‘right 1’ + ‘left 3’)<br />

• Arithmetic shift right: use sign-extension => separate logic is used<br />

to decode the shift amount and discharge those outputs<br />

appropriately<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 35<br />

Multiplier design<br />

• All <strong>ARM</strong>s apart form the first prototype have included<br />

support for integer multiplication<br />

<strong>–</strong> older <strong>ARM</strong> cores include low-cost multiplication hardware<br />

that supports only the 32-bit result multiply and<br />

multiply-accumulate<br />

<strong>–</strong> recent <strong>ARM</strong> cores have high-performance multiplication<br />

hardware and support 64-bit result multiply and<br />

multiply-accumulate<br />

• Low cost implementation<br />

<strong>–</strong> Use the datapath iteratively, employing the barrel shifter<br />

and ALU to generate 2-bit product in each clock cycle<br />

<strong>–</strong> use early termination to stop the iterations when there are no<br />

more ones in the multiply register<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 36<br />

18


The 2-bit multiplication<br />

algorithm, Nth cycle<br />

• Control settings for the Nth cycle of the multiplication<br />

• Use existing shifter and ALU + additional hardware<br />

<strong>–</strong> dedicated two-bits-per-cycle shift register for the multiplier and<br />

a few gates for the Booth’s algorithm control logic<br />

(overhead is a few per cent on the area of <strong>ARM</strong> core)<br />

Carry-in Multiplier Shift ALU Carry-out<br />

0 x0 LSL#2N A+0 0<br />

x1 LSL#2N A+B 0<br />

x2 LSL#(2N+1) A<strong>–</strong> B 1<br />

x3 LSL#2N A<strong>–</strong> B 1<br />

1 x0 LSL#2N A+B 0<br />

x1 LSL#(2N+1) A+B 0<br />

x2 LSL#2N A<strong>–</strong> B 1<br />

x3 LSL#2N A+0 1<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 37<br />

High speed multiplication<br />

• Where multiplication performance is very important,<br />

more hardware resources must be dedicated<br />

<strong>–</strong> in some embedded systems the <strong>ARM</strong> core is used to perform<br />

real-time digital signal processing (DSP) <strong>–</strong><br />

DSP programs are typically multiplication intensive<br />

• Use intermediate results which include<br />

partial sums and partial carries<br />

<strong>–</strong> Carry-save adders are used for this<br />

• These two binary results are added together at the end of<br />

multiplication<br />

<strong>–</strong> The main ALU is used for this<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 38<br />

19


Carry-propagate (a) and carrysave<br />

(b) adder structures<br />

• Carry propagate adder takes two conventional (irredundant) binary<br />

numbers as inputs and produces a binary sum<br />

• Carry save adder takes one binary and one redundant (partial sum<br />

and partial carry) input and produces a sum in redundant binary<br />

representation (sum and carry)<br />

A B Cin A B Cin<br />

(a) +<br />

+<br />

Cout S Cout S<br />

A B Cin A B Cin<br />

(b) +<br />

+<br />

Cout S Cout S<br />

A B Cin<br />

+<br />

Cout S<br />

A B Cin<br />

+<br />

Cout S<br />

A B Cin<br />

+<br />

Cout S<br />

A B Cin<br />

+<br />

Cout S<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 39<br />

<strong>ARM</strong> high-speed multiplier<br />

organization<br />

• CSA has 4 layers of adders each handling 2 multiplier bits =><br />

multiply 8-bits per clock cycle<br />

• Partial sum and carry are cleared at the beginning or initialized to<br />

accumulate a value<br />

• Multiplier is shifted right 8-bits per cycle in the ‘Rs’ register<br />

• Carry sum and carry are rotated right 8 bits per cycle<br />

• Performance: up to 4 clock cycles (early termination is possible)<br />

• Complexity: 160 bits in shift registers, 128 bits of carry-save<br />

adder logic (up to 10% of simpler cores)<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 40<br />

20


<strong>ARM</strong> high-speed multiplier<br />

organization<br />

initialization for MLA<br />

rotate sum and<br />

carry 8 bits/cycle<br />

partial sum<br />

partial carry<br />

registers<br />

Rs >> 8 bits/cycle<br />

carry-save adders<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 41<br />

Rm<br />

ALU (add partials)<br />

<strong>ARM</strong>2 register cell circuit<br />

• Asymmetric cross-coupled pair of MOS inverters<br />

• Feedback inverter is weak to minimize resistance to new<br />

value for the register<br />

• Newer cores use the complementary MOS technology<br />

ALU bus<br />

A bus<br />

B bus<br />

write<br />

read<br />

A<br />

Register cell structure upto <strong>ARM</strong>6<br />

read<br />

B<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 42<br />

21


<strong>ARM</strong> register bank floorplan<br />

• Enable lines run vertically and data busses run horizontally<br />

• Decoders are more complex that the register cells but horizontal<br />

pitch is matched to register cells<br />

Vdd<br />

Vss<br />

ALU<br />

bus<br />

PC<br />

bus<br />

INC<br />

bus<br />

PC<br />

A bus read decoders<br />

B bus read decoders<br />

write decoders<br />

register cells<br />

ALU<br />

bus<br />

A bus<br />

B bus<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 43<br />

<strong>ARM</strong> core datapath buses<br />

• Datapath pitch is chosen as a compromise between the complex<br />

functions (ALU) and simpler functions (barrel shifter)<br />

• Space is also allocated for the passage of passenger buses<br />

Ad<br />

PC inc<br />

shift out<br />

W<br />

instruction<br />

Din<br />

A B<br />

address register<br />

incrementer<br />

register bank<br />

multiplier<br />

ALU<br />

shifter<br />

data in<br />

instruction pipe<br />

data out<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 44<br />

22


<strong>ARM</strong> control logic structure<br />

• Three structural components<br />

<strong>–</strong> Instruction decoder PLA: uses some instruction bits<br />

and an internal cycle counter to define the class of<br />

operation during the next cycle<br />

<strong>–</strong> Distributed Secondary Control: selects other<br />

instruction bits and/or processor state information to<br />

control the datapath<br />

<strong>–</strong> Decentralized Control Units: for specific instructions<br />

that take a variable number of cycles to complete<br />

(load/store, multiply, coprocessor etc.)<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 45<br />

<strong>ARM</strong> control logic structure<br />

address<br />

control<br />

decode<br />

PLA<br />

register<br />

control<br />

instruction<br />

cycle<br />

count<br />

ALU<br />

control<br />

coprocessor<br />

multiply<br />

control<br />

load/store<br />

multiple<br />

shifter<br />

control<br />

<strong>EE</strong> <strong>675</strong> @ SDSU 46<br />

23

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!