EE 675 Advanced Microprocessors ARM – A little history

EE 675 

Advanced Microprocessors 

ARM Organization and Implementation 

Dr. Khurram Waheed 

EE 675 @ SDSU 1 

ARM – A little history 

• First ARM processor developed on 3 micron technology in ‘83- 

’85 

• This course is mainly based on the ARM6/7 architecture 

developed between ‘90-’95. 

• Digital Equipment Corporation (then Compaq, now HP) 

developed the StrongARM processor which has a very high 

performance. 

• More recent developments are: ARM8 and ARM9E (1999), and 

• a ARM processor without clock - the asynchronous AMULET 

from U. of Manchester (Steve Furber’s group) 


1

ARM organization 

• Two main blocks: datapath and 

decoder 

• Register bank 

– r0 to r15 

– 2 read ports, 

– 1 write port 

– 1 read, 1 write port reserved for r15 (pc) 

• Barrel shifter – shift or rotate one 

operand for any number of bits 

• ALU – performs the arithmetic and 

logic functions required 

• Memory address register + 

incrementer 

• Memory data registers 

• Instruction decoder and associated 

control logic 

address register 

register 

bank 

incrementer 

multiply 

register 

barrel 

shifter 

data out register 

instruction 

decode 

& 

control 

D[31:0] 


A 

L 

U 

b 

u 

s 

A[31:0] 

ARM – Internal Organization 

3-stage Pipeline 

• Data register holds read/write data 

from/to memory 

• Instruction decoder decodes 

machine code instructions to 

produce control signals to datapath 

• Data processing instructions take a 

single cycle: data values are read on 

the A-bus & B-bus, the results from 

ALU is written back into register 

bank 

P 

C 

A 

b 

u 

s 

ALU 


register 

bank 

incrementer 

multiply 

register 

barrel 

shifter 

data out register 

PC 

B 

b 

u 

s 

control 

data in register 

instruction 

decode 

& 

control 

D[31:0] 


A 

L 

U 

b 

u 

s 

A[31:0] 

P 

C 

A 

b 

u 

s 

ALU 

PC 

B 

b 

u 

s 

control 

data in register 

2

Pipelining 

• The maximum processing rate is determined by the propagation 

delay of the computational logic – in abstract: 

Function 1 Function 2 

Time T1 Time T2 

• Above we can process one input every T1+T2 

a register 

Function 1 Function 2 

Time T1 Time T2 

• Above we can process one input every max{T1,T2}, but each 

input still takes T1+T2 to be completely processed 


Three-stage pipeline 

• ARM uses a 3-stage instruction pipeline 

– Fetch: fetch instruction code from memory into the instruction pipeline 

– Decode: instruction decoded to obtain control signals for the datapath ready 

for the next stage 

– Execute: instruction “owns” the datapath - register read; shifting; ALU results 

generated and write-back 

• Results for each stage stored in registers 

• The consequence is that the clock period is much shorter than without 

pipelining 


3

Pipeline: how it works 

• All instructions occupy the datapath for one or more 

adjacent cycles 

• For each cycle that an instruction occupies the datapath, 

it occupies the decode logic in the immediately 

preceding cycle 

• During the first datapath cycle each instruction issues 

a fetch for the next instruction but one 

• Branch instruction flush and refill the instruction 

pipeline 


ARM single-cycle instruction 

pipeline 

1 

2 

3 

instruction 

fetch decode execute 



time 

• At any time, 3 different instructions may occupy each of the 3-stages of 

pipeline 

• It may take three cycles to complete a single-cycle instruction. This is 

said to have a three cycle latency 

• Once a pipeline fills, the processor completes a single-cycle instruction 

every clock cycle. Therefore the throughput is one instruction per 

cycle. 


4

ARM single-cycle instruction 

pipeline 

fetch 

sub r2,r3,r6 

cmp r2,#3 

decode 

fetch 

execute add 

decode 

fetch 

1 2 3 

add r0,r1,#5 

execute sub 

decode execute cmp 

time 


1 

2 

3 

4 

ARM multi-cycle instruction 

pipeline 

fetch ADD decode execute 

fetch STR decode calc. addr. 

Decode logic is always generating 

the control signals for the datapath 

to use in the next cycle 

data xfer 



5 

instruction 

fetch ADD 

time 

decode execute 


5

PC behavior – Pipeline 

• As a consequence of pipeline, the PC (r15) needs to run 

ahead of current instruction 

• Instruction fetches the next instruction but one during 

their first cycle, i.e., PC points 8 bytes ahead of current 

instruction. 

• So a user using the PC in a program must account for the 

pipeline effects 

• The situation is more complex in cycles later than the 

first cycle. 


ARM multi-cycle LDMIA (load 

multiple) instruction 

ldmia 

r0,{r2,r3} 

sub r2,r3,r6 

cmp r2,#3 

Instruction delayed 

fetch decode ex ld r2 ex ld r3 

fetch 

decode ex sub 

Decode stage occupied 

since ldmia must continue to 

remember decoded instruction 

fetch decode ex cmp 

time 

sub fetched at normal time but 

not decoded until LDMIA is finishing 


6

Control stalls: due to branches 

• Branches often introduce stalls (branch penalty) 

– Stall time may depend on whether branch is taken 

• May have to squash instructions that already 

started executing 

• Don’t know what to fetch until condition is 

evaluated 


bne foo 

sub 

r2,r3,r6 

foo add 

r0,r1,r2 

ARM pipelined branch 

Decision not made until the third clock cycle 

fetch decode ex bne 

fetch decode 

ex bne 

ex bne 

Two cycles of work thrown 

away if bne takes place 

fetch decode ex add 

time 


7

Expanding the pipeline 

• 3-stage pipeline till ARM7 is very cost-effective 

• Better pipeline architectures required for better 

performance 

Ninst × CPI 

T = 

inst 

f 

• N inst is constant 

• So, only two options 

– Increase the clock rate, f clk requires more pipeline stages 

and simpler logic per stage 

– Reduce the average number of clock cycles per instruction, 

CPI requires instructions to occupy fewer pipeline slots and 

reduce the stalls in the pipeline Memory Bandwidth 

Bottlenecks 

clk 


5-stage Pipeline 

• Measures to take care of memory bottlenecks 

– Use of separate code and data memories 

– Increase the stages in the pipeline reduces the 

processor load/clock cycle 

• The above steps allow a RISC processor to work 

at a higher clock rate. 

• Use of separate instruction and data caches 

connected to a single DRAM greatly reduces 

core’s CPI 


8

ARM9TDMI: 5-stage pipeline 

• Fetch 

• Decode 

– instruction is decoded 

– register operands read 

(3 read ports) 

• Execute 

– an operand is shifted and the 

ALU result generated, or 

– address is computed 

• Buffer/data 

– data memory is accessed (load, 

store) 

• Write-back 

– write to register file 

next 

pc 

pc + 4 

B, BL 

MOV pc 

SUBS pc 

LDR pc 

register write write-back 


+4 

pc+ 8 

LDM/ 

STM post- 

+4 index 

pre-index 

mux 

load/store 

address 

r15 

ALU 

I-cache 

I decode 

register read 

ARM9TDMI: Data Forwarding 

ADD r3, r2, r1, LSL #3 

ADD r5, r5, r3, LSL r2 

ADD r3, r2, r1, LSL #3 

ADD r8, r9, r10 

ADD r5, r5, r3, LSL r2 

LD r3, [r2] 

ADD r1, r2, r3 

Data Forwarding 

Stall? 

r3 := r2 + 8 x r1 

r5 := r5 + 2 r2 x r3 

r3 := r2 + 8 x r1 

r8 := r9 + r10 

r5 := r5 + 2 r2 x r3 

r3 := mem[r2] 

r1 := r2 + r3 

next 

pc 

pc + 4 

B, BL 

MOV pc 

SUBS pc 

LDR pc 

mul 

shift 

I-cache 

reg 

shift 

byte repl. 

D-cache 

rot/sgn ex 

rot/sgn ex 

fetch 

instruction 

decode 

immediate 

fields 

forwarding 

paths 

execute 

buffer/ 

data 



+4 

pc+8 

LDM/ 

STM post- 

+4 index 

pre-index 

mux 

load/store 

address 

r15 

ALU 

I decode 

register read 

mul 

shift 

reg 

shift 

byte repl. 

D-cache 

fetch 

instruction 

decode 

immediate 

fields 

forwarding 

paths 

execute 

buffer/ 

data 

9

ARM9TDMI: PC generation 

• 3-stage pipeline 

– PC behavior: 

operands are read in execution 

stage 

r15 = PC + 8 

• 5-stage pipeline 

– operands are read in decode 

stage and r15 = PC + 4? 

– incompatibilities between 3stage 

and 5-stage 

implementations => 

unacceptable 

– to avoid this 5-stage pipeline 

ARMs emulate the behavior of 

the older 3-stage designs 

next 

pc 

pc + 4 

B, BL 

MOV pc 

SUBS pc 

LDR pc 



+4 

pc+ 8 

LDM/ 

STM post- 

+4 index 

pre-index 

mux 

load/store 

address 

r15 

ALU 

I-cache 

I decode 

register read 

Data processing instruction 

datapath activity (Ex) 

• Reg-Reg 

– Rd = Rnop Rm 

– r15 = AR + 4 

AR = AR + 4 

• Reg-Imm 

– Rd = Rnop Imm 


AR = AR + 4 


increment 

Rd 

PC 

registers 

Rn 

Rm 

mult 

as instruction 

as ins. 

data out data in i. pipe 

(a) register – register operations 

mul 

shift 

reg 

shift 

byte repl. 

D-cache 

rot/sgn ex 


fetch 

instruction 

decode 

immediate 

fields 

forwarding 

paths 

increment 

Rd 

PC 

registers 

Rn 


mult 

as instruction 

as ins. 

[7:0] 


execute 

buffer/ 

data 

(b) register – immediate operations 

10

STR (store register) datapath 

activity (Ex1, Ex2) 

• Compute address 

(Ex1) 

– AR = Rn op Disp 


• Store data (Ex2) 

– AR = PC 

– mem[AR] = 

Rd 

– If autoindexing 

=> 

Rn = Rn +/- 4 


registers 

Rn 

mult 

increment 


lsl #0 

= A / A + B / A - B 

PC 

[11:0] 


Rn 


PC 

registers 

mult 

increment 

Rd 

shifter 

= A+ B / A-B 

byte? data in i. pipe 

(a) 1 st cycle – compute address (b) 2 nd cycle – store data & auto-index 

The first two (of three) cycles of a 

branch instruction 

• Compute target 

address 

– AR = PC + Disp,lsl #2 

• Save return address 

(if required) 

– r14 = PC 

– AR = AR+ 4 


registers 

PC 

increment 


mult 

lsl #2 


R14 

registers 

PC 

increment 

= A + B 

= A 

[23:0] 

data out data in i. pipe data out data in i. pipe 

(a) 1st cycle – compute branch target (b) 2nd Third cycle: do a small 

correction to the value 

stored in the link register in 

order that it points to 

directly at the instruction 

which follows the branch? 

cycle – save return address 

mult 

shifter 

11

• Datapath 

ARM Implementation 

– RTL (Register Transfer Level) description 

• Control unit 

– FSM (Finite State Machine) description 


2-phase non-overlapping clock 

scheme 

• Most ARMs do not operate on edge-sensitive registers 

• Instead the design is based around 2-phase non-overlapping clocks 

which are generated internally from a single clock signal 

• Data movement is controlled by passing the data alternatively 

through latches which are open during phase 1 or latches during 

phase 2 

phase 1 

1 clock cycle 

phase 2 


12

ARM datapath timing 

• Register read 

– Register read buses -- dynamic, precharged during phase 2 

– During phase 1 selected registers discharge the read buses 

which become valid early in phase 1 

• Shift operation 

– second operand passes through barrel shifter 

• ALU operation 

– ALU has input latches which are open in phase 1, 

allowing the operands to begin combining in ALU 

as soon as they are valid, but they close at the end of phase 1 

so that the phase 2 precharge does not get through to the ALU 

– ALU processes the operands during the phase 2, producing the 

valid output towards the end of the phase 

– the result is latched in the destination register 

at the end of phase 2 


ARM datapath timing (cont’d) 

register 

read 

time 

shift time 

phase 1 

read bus valid 

ALU operands 

latched 

shift out valid 

ALU time 

phase 2 

precharge 

invalidates 

buses 

register 

write time 

ALU out 

Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay 

+ Register write set-up time 

+ Phase 2 to phase 1 non-overlap time 


13

The original ARM1 ripple-carry 

adder 

• Carry logic: use CMOS AOI (And-Or-Invert) gate 

• Even bits use circuit show below 

• Odd bits use the dual circuit with inverted inputs and 

outputs and AND and OR gates swapped around 

• Worst case path: 32 gates long 

A 

B 

Cin 


Cin[0] 


Cout 

ARM2 4-bit carry look-ahead 

scheme 

• Carry Generate (G) 

Carry Propagate (P) 

• Cout[3] =Cin[0].P + G 

• Use AOI and alternate AND/OR gates 

• Worst case: 8 gates long 

A[3:0] 

B[3:0] 

G 

P 

Cout[3] 

4-bit 

adder 

logic 

sum 

sum[3:0] 

14

The ARM2 ALU logic for one 

result bit 

• ALU functions 

– data operations (add, sub, ...) 

– address computations for memory accesses 

– branch target computations 

fs: 

– bit-wise logical 

NB 

operations 

bus 

5 0 1 2 3 

carry 

logic 

G 

4 

– ... ALU 

NA 

bus 


ARM2 ALU function codes 

fs5 fs4 fs3 fs2 fs1 fs0 ALU output 

0 0 0 1 0 0 A and B 

0 0 1 0 0 0 A and not B 

0 0 1 0 0 1 A xor B 

0 1 1 0 0 1 A plus not B plus carry 

0 1 0 1 1 0 A plus B plus carry 

1 1 0 1 1 0 not A plus B plus carry 

0 0 0 0 0 0 A 

0 0 0 0 0 1 A or B 

0 0 0 1 0 1 B 

0 0 1 0 1 0 not B 

0 0 1 1 0 0 zero 


P 

bus 

15

The ARM6 carry-select adder 

scheme 

• Compute sums of 

various fields of 

the word 

for carry-in of zero 

and carry-in of one 

• Final result is 

selected by using 

the correct carry-in 

value to control a 

multiplexer 

a,b[3:0] 

sum[3:0] sum[7:4] 

Worst case: 

O(log 2 [word width]) gates long 

+ +, +1 +, +1 

c s s+1 

mux 

sum[15:8] 

sum[31:16] 

a,b[31:28] 


mux 

mux 

Note: Be careful! Fan-out on some of these 

gates is high so direct comparison with previous 

schemes is not applicable. 

The ARM6 ALU organization 

• Not easy to merge the arithmetic and logic 

functions =>a separate logic unit runs in parallel 

with the adder, and multiplexor selects the output 

invert A 

function 

logic/arithmetic 

A operand latch B operand latch 

XOR gates XOR gates 

logic functions 

result mux 

zero detect 

adder 

invert B 

C in 

C 

V 

result 


N 

Z 

16

ARM9 carry arbitration encoding 

• Carry arbitration adder 

ai bi Ci vi, wi 

0 

1 

1 

0 

v 

i 

w 

i 

0 

1 

0 

1 

= a 

i 

i 

0 

1 

u 

u 

+ b 

= a ⋅ b 

i 

i 

0, 0 

1, 1 

1, 0 

1, 0 

ai 

0 

1 

0(1) 

0(1) 

0(1) 

1(0) 

1(0) 

1(0) 

0(1) 

1(0) 


bi 

0 

1 

ai-1 

- 

- 

0 

1 

bi-1 

- 

- 

0 

1 

Ci 

0 

1 

0 

1 

u 

vi, wi 

The cross-bar switch barrel shifter 

• Shifter delay is critical since it contributes 

directly to the datapath cycle time 

• Cross-bar switch matrix (32 x 32) 

• Principle for 4x4 matrix 

in[3] 

in[2] 

in[1] 

in[0] 

right 3right 

2 right 1 

out[0] out[1] out[2] out[3] 

no shift 


left 1 

left 2 

left 3 

0, 0 

1, 1 

0, 0 

1, 1 

1, 0 

17

The cross-bar switch barrel shifter 

(cont’d) 

• Precharged logic is used => each switch is a single NMOS 

transistor 

• Precharging sets all outputs to logic 0, so those which are not 

connected to any input during switching remain at 0 giving the zero 

filling required by the shift semantics 

• For rotate right, the right shift diagonal is enabled + complementary 

shift left diagonal (e. g., ‘right 1’ + ‘left 3’) 

• Arithmetic shift right: use sign-extension => separate logic is used 

to decode the shift amount and discharge those outputs 

appropriately 


Multiplier design 

• All ARMs apart form the first prototype have included 

support for integer multiplication 

– older ARM cores include low-cost multiplication hardware 

that supports only the 32-bit result multiply and 

multiply-accumulate 

– recent ARM cores have high-performance multiplication 

hardware and support 64-bit result multiply and 

multiply-accumulate 

• Low cost implementation 

– Use the datapath iteratively, employing the barrel shifter 

and ALU to generate 2-bit product in each clock cycle 

– use early termination to stop the iterations when there are no 

more ones in the multiply register 


18

The 2-bit multiplication 

algorithm, Nth cycle 

• Control settings for the Nth cycle of the multiplication 

• Use existing shifter and ALU + additional hardware 

– dedicated two-bits-per-cycle shift register for the multiplier and 

a few gates for the Booth’s algorithm control logic 

(overhead is a few per cent on the area of ARM core) 

Carry-in Multiplier Shift ALU Carry-out 

0 x0 LSL#2N A+0 0 

x1 LSL#2N A+B 0 

x2 LSL#(2N+1) A– B 1 

x3 LSL#2N A– B 1 

1 x0 LSL#2N A+B 0 

x1 LSL#(2N+1) A+B 0 

x2 LSL#2N A– B 1 

x3 LSL#2N A+0 1 


High speed multiplication 

• Where multiplication performance is very important, 

more hardware resources must be dedicated 

– in some embedded systems the ARM core is used to perform 

real-time digital signal processing (DSP) – 

DSP programs are typically multiplication intensive 

• Use intermediate results which include 

partial sums and partial carries 

– Carry-save adders are used for this 

• These two binary results are added together at the end of 

multiplication 

– The main ALU is used for this 


19

Carry-propagate (a) and carrysave 

(b) adder structures 

• Carry propagate adder takes two conventional (irredundant) binary 

numbers as inputs and produces a binary sum 

• Carry save adder takes one binary and one redundant (partial sum 

and partial carry) input and produces a sum in redundant binary 

representation (sum and carry) 

A B Cin A B Cin 

(a) + 

+ 

Cout S Cout S 

A B Cin A B Cin 

(b) + 

+ 

Cout S Cout S 

A B Cin 

+ 

Cout S 

A B Cin 

+ 

Cout S 

A B Cin 

+ 

Cout S 

A B Cin 

+ 

Cout S 


ARM high-speed multiplier 

organization 

• CSA has 4 layers of adders each handling 2 multiplier bits => 

multiply 8-bits per clock cycle 

• Partial sum and carry are cleared at the beginning or initialized to 

accumulate a value 

• Multiplier is shifted right 8-bits per cycle in the ‘Rs’ register 

• Carry sum and carry are rotated right 8 bits per cycle 

• Performance: up to 4 clock cycles (early termination is possible) 

• Complexity: 160 bits in shift registers, 128 bits of carry-save 

adder logic (up to 10% of simpler cores) 


20

ARM high-speed multiplier 

organization 

initialization for MLA 

rotate sum and 

carry 8 bits/cycle 

partial sum 

partial carry 

registers 

Rs >> 8 bits/cycle 

carry-save adders 


Rm 

ALU (add partials) 

ARM2 register cell circuit 

• Asymmetric cross-coupled pair of MOS inverters 

• Feedback inverter is weak to minimize resistance to new 

value for the register 

• Newer cores use the complementary MOS technology 

ALU bus 

A bus 

B bus 

write 

read 

A 

Register cell structure upto ARM6 

read 

B 


21

ARM register bank floorplan 

• Enable lines run vertically and data busses run horizontally 

• Decoders are more complex that the register cells but horizontal 

pitch is matched to register cells 

Vdd 

Vss 

ALU 

bus 

PC 

bus 

INC 

bus 

PC 

A bus read decoders 

B bus read decoders 

write decoders 

register cells 

ALU 

bus 

A bus 

B bus 


ARM core datapath buses 

• Datapath pitch is chosen as a compromise between the complex 

functions (ALU) and simpler functions (barrel shifter) 

• Space is also allocated for the passage of passenger buses 

Ad 

PC inc 

shift out 

W 

instruction 

Din 

A B 


incrementer 

register bank 

multiplier 

ALU 

shifter 

data in 

instruction pipe 

data out 


22

ARM control logic structure 

• Three structural components 

– Instruction decoder PLA: uses some instruction bits 

and an internal cycle counter to define the class of 

operation during the next cycle 

– Distributed Secondary Control: selects other 

instruction bits and/or processor state information to 

control the datapath 

– Decentralized Control Units: for specific instructions 

that take a variable number of cycles to complete 

(load/store, multiply, coprocessor etc.) 


ARM control logic structure 

address 

control 

decode 

PLA 

register 

control 

instruction 

cycle 

count 

ALU 

control 

coprocessor 

multiply 

control 

load/store 

multiple 

shifter 

control 


23

EE 675 Advanced Microprocessors ARM – A little history

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?