Technion Presentation - OCP-IP

Technion Presentation 

Drew Wingard 

Founder, CTO 

June 2012

Industry Leader in System IP 

■ Answering the SoC challenge 

• Help designers integrate entire 

systems onto one piece of silicon 

• Any IP on any chip at any time 

■ Proven technology for 15 

years 

• >1 Billion units shipped 

• >150 designs taped out 

■ Key Designs with leaders 

• 7 of top 10 SoC semi companies 

• 4 of top 10 systems companies 

■ Pioneering Technology Leader 

• Pioneer and World’s #1 supplier of 

on-chip networks for advanced 

SoCs 

• Highest efficiency memory 

subsystems 

6/24/2012 

Technion Presentation 2

Enabling Billons of Connected Devices 

6/24/2012 

1+ Billion Chips Shipped with Sonics Technology 

• High-growth technology Leader 

• Headquartered in Silicon Valley 

• Pioneer of on-chip network IP 

• Advanced memory subsystems 


Advanced On-Chip Network Solutions 

Complete product 

portfolio for every SoC 

■ SonicsGN 

• High speed routed network 

■ SonicsSX TM 

• Low-Latency and High-Speed 

• Advanced Features: 2D, IMT 

■ SonicsLX TM 

• High-speed cost effective 

■ SNAP TM 

• High efficiency AMBA buses 

■ SonicsExpress TM 

• Clock & Power domain crossing 

■ MemMax TM, MemMax AMP 

• Increases DRAM efficiency 

• QOS scheduling of DRAM access 

■ Sonics3220 

• Low power and low area 

• Non-blocking Peripheral Interconnect 

6/24/2012 

Complex SoC Using Sonics Products 


Our Communications Architecture 

■Separation 

■Abstraction 

■Optimization 

■Independence 

Master 

CPU 

DMA DMA DRAM Controller 

Core Function 

Communication 

Socket 

Master Slave 

Core Function 

Communication 

16 128 

Core Function 

Slave 

Communication 

Slave Bus Network Slave Bus Master Bus 

Master Bus 

Agent Agent Agent Agent 

Adapter 

Adapter Adapter 

Adapter 

Internal Fabric 

SMART Interconnect 

6/24/2012 Technion Presentation 5

The Intelligence is in the Agents 

■ Agents provide… 

• Protocol conversion 

o Agent adapts to IP core 

• Decoupling of IP cores from fabric 

o Provide local, isolated environment 

• Layered services 

■ Agent services 

• Power management 

• Security management 

• Error management 

• QoS 

• Burst, width, and command 

conversion 

Fabric 

6/24/2012 Technion Presentation 6 

I 

T 

INITIATOR SOCKETS 

I 

T 

I 

Initiator Agents (IA) 

Target Agents (TA) 

T 

I 

T 

TARGET SOCKETS 

I 

T

Sonics View on NoC 

■NoC adapts key abstractions from 

networking 

• Layering (orthogonalization of concerns) 

oSocket-based design 

oOptimized fabric protocols 

oHigher level services (security, QoS, error 

mgmt.) 

• Distributed design & implementation 

■We call this decoupling 

Technion 

Presentation 7 6/24/2012

To 1 GHz and Beyond…

Why So Fast? 

■ We’re fully converged! 

• Computing 

• Graphics 

• Video/Audio 

■ Everything runs user 

applications 

■ Apps need Giga’s 

• 1-2 GHz multicore CPUs 

• 100+ GFLOP multicore GPUs 

• 15-50 GB/sec DRAM 

■ At consumer pricing 

■ … and something to 

integrate it all! 

Consumer Electronics: 

“Wish List” 2011 

Rank Rank User 

Product 

Ages 6‐12 Ages 13+ Apps 

iPad 1 1 � 

Computer 4 2 � 

iPhone 3 7 � 

Tablet (non‐iPad) 5 5 � 

TV 9 4 � 

iPod Touch 2 12 � 

Kinect for Xbox 360 7 9 

E‐Reader 13 3 � 

Smartphone (non‐iPhone) 10 8 � 

Blu‐Ray Player 12 6 � 

Nintendo 3DS 6 16 � 

PlayStation 3 11 11 � 

Nintendo DS* 8 15 � 

Nintendo Wii 16 10 � 

Xbox 360 14 13 � 

PlayStation Move 17 14 

Other Mobile Phone 15 17 � 

PlayStation Portable 18 18 � 

6/24/2012 Technion Presentation 

Source: Nielsen, November 2011 

9

Are We Ready? 

■ TSMC is ready… 

• 28nm HPM, with full ecosystem enablement 

■ ARM is ready… 

• Cortex-A15 CPU: 1-2.5 GHz, 1-4 cores/cluster 

• Mali-T658 GPU: 350 GFLOPs, 1-8 cores/cluster 

■ DRAM vendors are ready… 

• DDR3/4: 1600-3200 Mb/sec/pin � 6-50 GB/sec, 1-4 channels 

• LPDDR2/3: 800-1600 Mb/sec/pin � 3-25 GB/sec, 1-4 channels 

• Wide IO: 200-266 Mb/sec/pin � 13-17 GB/sec, 4 channels 

■ But what about the middle? 

Cortex-A15 Mali-658 Video Audio Camera Display USB 

Tablet SoC 

DDR 


? 

DDR DDR DDR 

…

Integration: Giga-SoC Performance Needs 

■ GHz on-chip networks 

• Peak processor-to-multichannel DRAM bandwidth 

• Enable cache-to-cache transfers (e.g. AMBA ® ACE) 

• Example: SonicsGN 

■ Scalable multichannel DRAM support 

• 4 channels and above 

• With effective load balancing 

• Example: Sonics Interleaved Multichannel Technology 

■ High efficiency DRAM scheduling 

• ~85% sustained throughput 

• With CPU latency priority and guaranteed bandwidths 

• Example: Sonics MemMax ® 


Example Tablet Application Processor 

533MHz 

Cortex-A15 Quad core 

CPU 

Sonics MemMax 

Memory 

Scheduler 

DRAM 

Cont. 

CPU 

CPU CPU 

533MHz 

1333MHz 1066MHz 533MHz 

L2 Cache 

133MHz 

ROM 

Secure ROM 

Cortex-A7 Quad core 

133MHz 

CPU 

Security SRAM 

Sonics MemMax 

Memory 

Scheduler 

DRAM 

Cont. 

CPU 

CPU CPU 

533MHz 

267MHz 

1066MHz 1066MHz 

L2 Cache 

DMA 

267MHz 

533MHz 

Mali-T658 

Quad core 

GPU 

GPU 

GPU GPU 

Coherency Fabric 

HDMI 

SonicsGN On-chip Network 

133MHz 

Ethernet 

133MHz 

PCIe 

267MHz 


267MHz 

LCD Controller 

400MHz 

Audio 

Video Codec 

267MHz 200MHz 

Power 

Domains 

12 

Sonics3220 Peripheral Network 

133MHz 

USB 

200MHz 

Cam 1 

133MHz 

SATA 

APB Peripherals 

133MHz 

Cam 2 

1066MHz

But What About Power? 

■ Convergence drives 

massive SoC integration 

• Thin is in! 

■ All these Giga’s cost 

power 

• But most devices run from 

batteries 

■ Result: cannot afford to 

power entire SoC at once 

• “Dark silicon” 

• Power only those subsystems 

needed for current apps 

• And only as long as needed 

Consumer Electronics: 

“Wish List” 2011 

Rank Rank Battery 

Product 

Ages 6‐12 Ages 13+ Powered 

iPad 1 1 � 

Computer 4 2 � 

iPhone 3 7 � 

Tablet (non‐iPad) 5 5 � 

TV 9 4 

iPod Touch 2 12 � 

Kinect for Xbox 360 7 9 

E‐Reader 13 3 � 

Smartphone (non‐iPhone) 10 8 � 

Blu‐Ray Player 12 6 

Nintendo 3DS 6 16 � 

PlayStation 3 11 11 

Nintendo DS* 8 15 � 

Nintendo Wii 16 10 

Xbox 360 14 13 

PlayStation Move 17 14 

Other Mobile Phone 15 17 � 

PlayStation Portable 18 18 � 

Source: Nielsen, November 2011 


■ General techniques 

• Stop/start subsystem clocks 

• Dynamic clock frequency 

• On/off voltage domains 

• Dynamic voltage/frequency domains (DVFS) 

■ IP-specific techniques 

• ARM big.LITTLE (use optimum IP for loading) 

■ Power managers implement the techniques 

• Software: flexible, but slow 

• Hardware: very responsive, but less flexible 

difficulty 

Managing Dark Silicon… 


Example Tablet Application Processor 

533MHz 

Cortex A15 Quad core 

CPU 

Sonics MemMax 

Memory 

Scheduler 

DRAM 

Cont. 

CPU 

CPU CPU 

533MHz 

1333MHz 1066MHz 533MHz 

L2 Cache 

133MHz 

ROM 

Secure ROM 

Cortex A7 Quad core 

133MHz 

CPU 

Security SRAM 

Sonics MemMax 

Memory 

Scheduler 

DRAM 

Cont. 

CPU 

CPU CPU 

533MHz 

267MHz 

1066MHz 1066MHz 

L2 Cache 

DMA 

267MHz 

533MHz 

Mali-T658 

Quad core 

GPU 

GPU 

GPU GPU 

Coherency Fabric 

HDMI 

SonicsGN On-chip Network 

133MHz 

Ethernet 

133MHz 

PCIe 

267MHz 


267MHz 

LCD Controller 

400MHz 

Audio 

Video Codec 

267MHz 200MHz 

Power 

Domains 

15 

Sonics3220 Peripheral Network 

133MHz 

USB 

200MHz 

Cam 1 

133MHz 

SATA 

APB Peripherals 

133MHz 

Cam 2 

1066MHz

Managing Power with SonicsGN 

■ Flexible power 

domain support 

• Asynch/mesochronous 

• Isolation/level shifters 

■ HW-controlled safe 

shutdown 

■ Automatic wakeup 

■ Benefits: 

• More domains 

• Quicker shutdown 

• Faster wakeup 

■ Keep more dark, 

more of the time 

50% SoC Power Reduction! 


Summary 

■ GHz, GFLOPs and GB/sec are consumer design 

points 

• And your next SoC will need them! 

■ SoC integration must exploit that performance 

• GHz on-chip networks: SonicsGN 

• Multichannel DRAM optimization: Sonics IMT 

• High efficiency DRAM scheduling: Sonics MemMax 

■ … while improving battery life 

• Automatic hardware power management, with software 

policies 

■ Integration requirements 

• Twice the frequency 

• One half the SoC power 


SonicsGN Introduction

SonicsGN (SGN) Benefits 

■ Only Network-on-chip available at 2X 

the speed of competing solutions 

■ Lowest System Power 

■ Highly Optimized area 

■ Ideal for 28nm process node and 

below 

■ Advanced tooling – simplified network 

capture environment along with 

powerful analysis capability 

■ Exceeds the fabric performance 

requirements for processors at 

speeds up to 1-3GHz 

6/24/2012 

On-Chip Network IP for Complex SoCs Design 


19

SGN Topology Choices 

Shallow Network 

� A routed network topology IP 

6/24/2012 

■ Complement other 

Sonics IP components 


On-Chip Network – Under the Hood 

• Routed Network 

• Virtual channels 

• Clock Crossing 

• Multiple power 

domains 

• Bit conversion 

6/24/2012 


21 21

Dolphin vs. Piranha 

Fabric Protocol Dolphin Piranha 

SGN Benefits 

First product SonicsMX SonicsGN 

Introduction 2004 2011 

System architecture 

Fabric + decoupled Fabric + decoupled 

> 150 tape-outs, > 1 Billion chips 

agents 

agents 

Switching element 

Crossbar, 

shared link 

Router Higher frequency 

Flow control Single cycle Credit-based Higher frequency/better clock crossing 

Maximum switch depth 4 Unlimited Scalability 

Separate request/response networks? Yes Yes 

Source routed? Yes Yes 

Header/payload separation? Yes Yes 

Header/payload transmission Parallel Serialized Routing congestion and scalability 

Blocking ID-based concurrency? Yes Yes 

Non-blocking concurrency Threads Virtual channels Mapping flexibility and lower area 

Internal buffering Per thread Shareable Lower area 

Per link data widths with integrated width 

conversion? 

Yes Yes 

Divided synch., 

Clock domain crossing Divided synch. Mesochronous, 

Asynchronous 

Easier timing closure and partitioning 

Power domain crossing External 

Dynamic voltage, 

switched voltage 

Lower power 

Power management support Single domain Many domain Per domain control for lower power 

6/24/2012 


Design Capture 

Master core 

Socket 

Interface 

6/24/2012 

NoC Block Diagram View Power Partitioning View 

Request network 

Response network 

Slave Core 

Socket Interface 


Power Domains 

23

NoC Floor Plan Interactions 

M2 

T2 

Matching Floor Plan Mismatched Floor Plan 

M1 M3 

M4 M5 

T1 T3 T5 T6 

M6 

T4 

M4 M5 

■ NoC benefits greatly impacted by mismatched floor plan 

■ Wiring congestion worse than crossbar! 

■ Timing convergence challenges 

■ Need technology to protect “logical” topology after floor plan 

determined 

• Architect analyzes logical topology 

• SoC team determines physical topology 

• Virtual channels: key to independence! 


M6 

T6 

T2 

M3 

T1 T5 T3 

M2 

M1 

T4

Power Partitioning 


SGN Power Management Interface 

■ Power management bundle 

• 2 signal handshake 

• 2 signal wakeup control 

(optional) 

• Activity status signal (optional) 

■ Distributed interconnect 

power manager (IPM) 

• Each IA determines path for 

incoming requests 

• Identifies required PM domains 

• Determines if request 

o can proceed 

o must terminate immediately 

o must request wake-up 


System Power Manager 

Interconnect Power Manager

High Performance Consumer 

SoC Memory Subsystems

Outline 

■ Multicore consumer SoC background 

■ DRAM subsystem challenges 

■ Solution aspects 

■ Putting it all together 


Consumer SoC Examples 

■ What are some key applications for consumer SoCs? 

■ Key characteristic: relentless push for higher quality 

user experiences – at minimum system cost! 


Concurrency in Consumer SoCs 

Consumer MPSoCs process data in parallel, but 

communicate… 

H.264 Decode 

Bitstream Entropy 

iScan 

Recon- 

Loop Decoded 

Entropy 

iScan 

iTrans Recon- 

Loop 

Decoding 

iQuant iTrans 

struction 

Filter Frames 

Decoding 

iQuant 

struction 

Filter 

Intra 

Intra 

Prediction 

Prediction 

MC 

MC 

Prediction 

Prediction 

Transport demux 

Audio Decode 

Video Out 



H.264 Decode 

Bitstream Entropy 

iScan 

Recon- 

Loop Decoded 

Entropy 

iScan 

iTrans Recon- 

Loop 

Decoding 

iQuant iTrans 

struction 

Filter Frames 

Decoding 

iQuant 

struction 

Filter 

Intra 

Intra 

Prediction 

Prediction 

MC 

MC 

Prediction 

Prediction 

DRAM 

Transport demux 

Audio Decode 

Video Out 



■ Assertion: video SoC applications have >> 

50% of system traffic to/from external DRAM 

■ Consumer volumes and price points demand 

cheapest DRAM configurations that 

support required performance 

■ Implications: 

• SoC architecture is mostly a fan-in tree to external 

DRAM 

• Maximizing delivered DRAM throughput and 

utilization are key 

o Fewer DRAMs 

o Lower speed grades 


Outline 



• Massive connectivity to DRAM 

• Achieving high DRAM utilization 

• QoS 

• Widely varying DRAM access styles 

• Increasing access granularity 




SoC Architecture Trends 

■ Massive feature integration 

• Driven largely by Moore’s Law (supply) and 

convergence (demand) 

■ Continued movement of complexity to software 

■ Distributed architectures 

• Higher scalability (and independence?) 

■ Multiple processors 

• (Multicore) CPU 

• DSP 

• Special purpose (MPEG, GFX, …) 

■ Distributed DMA 

• Removes centralized DMA bottleneck 

• Simplifies driver software integration 

CPU MPEG DSP 

DRAM 

Controller 

Massive Connectivity to DRAM 

3D GFX MAC 

Video 

I/O 

Comm 

I/O 

System On Chip 



• 3 traditional processors 

• ~90 DRAM connections 

• Already 7 years ago! 


Utilization Rate 

■ Utilization Rate: the percentage of DRAM data cycles 

that transfer data that is useful to the system 

■ Example – at 85% utilization, 2 DDR3-1600 parts in a 

x32 configuration (e.g. 2 x16 DDR3 DRAMs) deliver: 

• (85%) x (1600 Mbits/sec/pin) x (32 pins) / (8 bits/Byte) = 5.44 

GBytes/sec 

■ Things which reduce DRAM utilization: 

• Refresh cycles 

• RD/WR data bus turnaround 

• Page misses 

• Partial bursts (lengths < BL) 

• Unaligned bursts 

• Command bus conflicts (across banks) 

• QoS optimizations! 

■ DRAM schedulers arbitrate among a set of system 

transactions to optimize DRAM utilization rates 

• Perhaps QoS, too! 


DRAM Utilization is Traffic-Dependent 

■ System transactions targeting DRAM limit the 

peak utilization per initiator, but exploiting the 

parallelism between initiators allows an 

intelligent scheduler to optimize utilization 

■ Utilization-related traffic characteristics: 

• Burst lengths, address sequences and address 

alignment 

• Address relationships across transactions (e.g. 2D) 

• Number of outstanding transactions 

• Read/write mix 

• Ordering constraints 

• Time-domain behavior (i.e. isochronous vs. 

bursty/asynchronous) 


How Traffic Impacts DRAM Utilization 

Traffic Characteristic 

Utilization Impact 

Burst lengths/sequences/alignment X X X X X 

Refresh cycles 

Addressing across transactions X X X X 

# of outstanding transactions X X X X 


RD/WR turnaround 

Read/write mix X 

Ordering constraints X X X X 

Time-domain behavior X X X X 

Assumes refresh is 

independent from traffic 

Page misses 

Partial bursts 

Unaligned bursts 

Command conflicts 

QoS optimizations

High DRAM Utilization vs. QoS 

■ High DRAM utilization (throughput) and 

Quality of Service are in conflict 

• Utilization prefers long DRAM bursts 

o DRAM operates most efficiently 

• QoS demands short DRAM bursts 

o Provide low latency service for CPUs 

o Control buffering requirements for real-time users 

■ Consumer SoCs use the DRAM scheduler to 

tune the trade-off between utilization and QoS 


Locality Challenges: DRAM Access Styles 

■ Exploiting spatial locality is key for high utilization 

• CPUs tend to stay with an O/S page (multiple DRAM pages) 

• Much processing and I/O DMA uses long incrementing bursts 

• Image processing is tougher 

■ Two-dimensional bursts 

• 2D transaction using 

a single read or write 

command 

• Popular for HD video 

and graphics 

2-D Data Object in Memory 

0x08F0 0x08F8 0x0900 0x0908 

0x10F0 0x10F8 0x1100 0x1108 

0x18F0 0x18F8 0x1900 0x1908 

0x20F0 0x20F8 0x2100 0x2108 

0x28F0 0x28F8 0x2900 0x2908 

2048 Bytes/row 

Video Frame Buffer 


1080 rows/frame

Locality Challenges: DRAM Access Styles 

■ Exploiting spatial locality is key for high utilization 

• CPUs tend to stay with an O/S page (multiple DRAM pages) 

• Much processing and I/O DMA uses long incrementing bursts 

• Image processing is tougher 

■ Two-dimensional bursts 

• 2D transaction using 

a single read or write 

command 

• Popular for HD video 

and graphics 

■ Address tiling 

• Rearrange DRAM 

address organization 

to exploit locality 

• Avoids page misses 

• One size doesn’t 

fit all! 

DRAM Page 0 DRAM Page 0 

(on 2-D DRAM Data bank Object 0) (on in DRAM Memory bank 1) 

0x08F0 0x08F8 0x0900 0x0908 

0x10F0 0x10F8 0x1100 0x1108 

2-D Data 

Object in 

Memory 

0x18F0 0x18F8 0x1900 0x1908 

0x20F0 0x20F8 0x2100 0x2108 

0x28F0 0x28F8 0x2900 0x2908 

DRAM Page 0 

(on DRAM bank 2) 

DRAM Page N 


DRAM Page 0 


DRAM Page N 


DRAM Page 1 


DRAM Page 1 


DRAM Page N+1 



256B/page 



16 rows/page 

1080 rows/frame

Access Granularity: 

DRAM Burst Sizes Growing Too Large 

DRAM Words (BL) or 

DDR Width (Bytes) 

10 

8 

6 

4 

DDR 

2 

DDR2 

0 

0 

2003 2004 2005 2006 2007 2008 2009 

DDR BL DDR2 BL DDR3 BL 

DDR Width (Bytes) DRAM Burst 

DDR3 


70 

60 

50 

40 

30 

20 

10 

Minimum DRAM Burst (Bytes) 

64 Bytes 

8 Bytes 

Data Transfers Shorter Than Burst Size Lose Efficiency 

Many SoC Data Objects

Example: Analytic Traffic Characterization 

Traffic Flow 

Burst 

Length 

(dword) Height Aligned? 

Best-case Transfer 

Efficiency Page misses per DDR Burst 

Tiled? N Y N Y N Y 

16 32 64 Burst (Bytes) 16 16 32 32 64 64 

8 8 8 BL 8 8 8 8 8 8 

2 4 8 Data Width (B) 2 2 4 4 8 8 

Vid decode Wr 2 8 N 53% 36% 22% 53% 7% 73% 9% 89% 11% 

Vid decode Rd 10 3 N 85% 74% 59% 17% 6% 30% 10% 47% 16% 

Vid back end 32 1 Y 100% 100% 100% 6% 6% 13% 13% 25% 25% 

Source: Customer (HDTV) System Dataflow 

■ As DRAM burst size increases, efficiency drops 

substantially for short and/or unaligned traffic 

• MPEG macro-block fetch is an easy example 

• Long burst traffic stays efficient 

• CPU traffic loses efficiency if DDR burst size > cache line size 

■ 2D traffic generates many page misses for 

row/bank/column DRAM address organization 

• Address tiling reduces 2D page misses substantially 

• Long burst traffic is (again) tolerant 

• But, traditional CPU traffic prefers row/bank/column! 


Outline 




• High concurrency interconnect networks 

• Single ported DRAM controller protocols 

• High utilization scheduling with QoS 

• Scalable multichannel 

• Flexible address tiling 



Congestion 

Problem 

On-chip 

Off-chip 

Star Topology Memory Subsystems 

Traditional Approach 

CPU MME DSP GFX 

Addr/ 

Cmd 

DRAM 

Subsys. 

D 

R 

A 

M 

D 

R 

A 

M 

D 

R 

A 

M 

Data 

FIFO’s 

■ Initiators present requests in 

parallel to multi-port scheduler 

■ FIFO’s at initiators provide 

• Rate decoupling 

• Service jitter tolerance 

■ DRAM subsystem needs no 

FIFO, only pipelining 

■ System performance limited 

only by traffic & scheduler 

But: 

■ LOTS of wires/congestion 

■ Lots of small/inefficient FIFO’s 

■ Large part of system must be 

BW matched to DRAM 

Imagine Dedicated Links From > 100 

DRAM-connected Cores! 


Single-ported Memory Subsystems 

Shared Interconnect Approach 

CPU MME DSP GFX 

On-chip 

Off-chip 

Shared Interconnect 

DRAM 

Subsys. 

Addr/ 

Cmd 

D 

R 

A 

M 

D 

R 

A 

M 

D 

R 

A 

M 

Data 

■ Interconnect presents requests in 

series to single-port scheduler 

■ Saves wires/congestion 

But: 

■ Interconnect arbitration impacts 

scheduling 

• Risks lower utilization 

• May not meet deadlines 

■ Where do FIFO’s live? 

■ How much of system is BWmatched 

to DRAM? 

■ System performance also limited 

by communication system 


Single-port DRAM Protocols 

■ Interconnect and subsystem must support multiple 

outstanding requests (cover DRAM pipeline depth) 


In-order Protocol 

■ Interface protocol supports multiple outstanding burst 

requests, but all service matches request order 

In-Order 

. . . 

DRAM 

Controller 

Response Resp. 

DRAM DRAM DRAM 

Scheduler Network 

O-O-O/Blocking FC 

. . . 

Example: 

VSIA BVCI 

(AMBA AHB 

needs 

multi-port) 

Sched. 

Controller 




Network 

Scheduler 

O-O-O/Non-Blocking FC 

■ Simplest scheme (lowest hardware 

. . . 

cost) 

■ Service order determined by 

interconnect arbitration 

■ Scheduler can only optimize pipeline 

(looking ahead for page misses to 

other banks) 

■ High efficiency requires long bursts, 




Per-thread flow control 

Sched. 

leads to high latency Controller (poor QoS)

Out-of-order Protocol with Blocking FC 

■ Interface protocol provides ordering tags to allow scheduler to 

reorder some requests, but flow control is shared across all tags 

In-Order 

. . . 

DRAM 

Controller 




Head of line 

blocking 

Example: 

AMBA AXI 


. . . 

Sched. 

Controller 



Network 

Scheduler 


■ Interconnect presents requests 

. . . 

in order 

■ Scheduler queues requests & 

chooses order to optimize 

throughput and QoS 

But 

■ Bursty flows can fill queues, 






hurting Sched. latency & BW for others 

■ Full queues Controller block into network 

■ Frequency scales poorly with 

queue depth

Out-of-order Protocol with Non-blocking FC 

■ Interface protocol provides per-thread ID’s and flow 

control, enabling re-ordering while preventing blocking 

In-Order 

. . . 

. . . 

■ Interconnect maps initiator threads into 





target threads 

■ Scheduler queues requests & chooses 

order to optimize throughput and QoS 

on per-thread basis 

■ Non-blocking (per-thread) flow control 

minimizes inter-thread interactions 




Network 

Scheduler 

Sched. 

DRAM 

■ Per-thread Controller queues inherently Controller ordered, 

implemented as compiled SRAM 

■ Result: lower latency, BW guarantees, 

higher guaranteed throughput 


. . . 

Sched. 

Controller 





Example: 

OCP Threads

Single-port DRAM Subsystem Protocols 

Ordering/ 

flow control 

In-order/ 

blocking 

Out-of-order/ 

blocking 

Out-of-order/ 

non-blocking 

Peak BW limited by DRAM DRAM DRAM 

Ordering flexibility None High High 

Queuing None Shared Per-thread 

Compiled RAM-friendly No No Yes 

Init. BW==DRAM BW Yes Yes No 

DRAM efficiency Medium High High 

Max. CPU latency High Medium Low 

Data interleaving None Minor High 


Optimizing for High Utilization 

Goal Approach 

Minimize RD/WR turnarounds Group reads and writes 

Hide page misses Bank scheduling 

Avoid page misses to banks with conflicting 

transfers in flight Bank state tracking 

Maximize page/bank scheduling opportunities at 

minimum area Expose intrinsic traffic concurrency to scheduler 

Ensure write data bursts at DDR rate Buffer write data in DDR clock domain 

Ensure read data absorbed at DDR rate Buffer read data in DDR clock domain 

Isolate SoC architecture from DDR clock Asynchronous FIFO 

Prevent low-bandwidth initiators from stalling DDR Decoupling FIFO 

Make CPU highest priority, interleave bursts & 

Minimize CPU latency 

ensure paths cannot block 

Protect groups of DDR bursts against higher 

Eliminate page/bank thrashing 

priority traffic 

Demote QoS traffic overusing bandwidth & fair 

Protect against best-effort traffic starvation best-effort arbitration 


QoS-based Arbitration 

■ Initiator data flow threads mapped to DRAM threads by interconnect 

• e.g. 40 data flows sharing 8 DRAM threads in a digital video system 

■ Independent threads assigned to QoS level 

■ Non-blocking, multi-threaded fabric and DRAM interfaces allow: 

• Higher priority requests to interleave with & respond before others 

• Guaranteed BW threads to minimize buffering & receive latency 

guarantees 

• High DRAM utilization 

Thread QoS 

Level 

Bandwidth 

Allocation ? 

Priority Yes 

Bandwidth Yes 

QoS Model 

Low latency while within BW allocation, 

best-effort otherwise 

Guaranteed BW while within BW 

allocation, best-effort otherwise 

Best-effort No N/A 


Multichannel Solves Access Granularity 

CPU DSP MME 

SonicsMX SMART Interconnect 

SonicsSX SMART Interconnect 

64Mx16x4 

MemMax 

64 

DDR3 DDR3 DDR3 DDR3 

CPU DSP MME 


From Single to 

Multichannel 

CPU DSP MME 


DDR2 DDR3 DDR3 

Channels 1 1 2 

Data Width (B) 4 4 2 

Effective BW 100% 84% 100% 

Source: Customer (HDTV) System Dataflow 

Constant Frequency/Ideal Load Balancing 

SonicsSX SMART Interconnect 

MemMax 

32 

DDR3 DDR3 

64Mx16x2 

CPU DSP MME 


MemMax 

32 

DDR3 DDR3 

64Mx16x2 


Multichannel DRAM System Challenges 

Application View 

Address 

Space 

Region 1 

Hole 1 

Region 2 

Region 3 

Hole 2 

2 Channels 

No Interleave 

Region 1 

Hole 1 

Ch. 1 

Region 2 

Ch. 2 

Region 3 

Hole 2 

Key Problems: 

■ Load balancing 

2 Channels 

Interleaved 

Region • Must 1 balance Region memory 1 traffic 

Hole 1 

1 

2 

1 

2 

Region 2 

1 

2 

1 

2 

Region 3 

Hole 2 

4 Channels 

Interleaved 

Hole 1 

1 

2 

3 

4 

Region 2 

1 

2 

3 

4 

Region 3 

Hole 2 

Physical Organization 

evenly among channels 

■ Maintaining throughput 

• Multiple channels cause 

throughput & ordering issues 

for pipelined memories 

Software and IP cores 

must manage multiple 

channels explicitly 


Interleaved Multichannel Technology (IMT*): 

Seamless Transition to Multichannel 

*patents pending 

■ Interleaving support requires splitting traffic for 

delivery to proper channel 

• Splitting in memory controller creates performance and routing 

congestion bottleneck 

■ Predictably high performance 

• Automatically spreads traffic across channels to ensure load 

balancing 

• Keeps DRAMs operating at full throughput, without costly 

reorder buffers 

■ Scalable architecture 

• Up to 8 interleaved channels within the same address region 

• Fully distributed to avoid bottlenecks & placement restrictions 

■ Application flexibility 

• Transparent to software and initiator hardware 

• Supports full or partial memory configurations – at run time 

Multichannel Interleaving in the Interconnect 

Higher Performance, Lower Area, More Scalable 


Transparent Multichannel Interleaving 

with Access-Optimized Boundaries 

Application View 

Physical Organization 


Independent Interleaving Size Support

2D Bursts, Address Tiling & Multichannel 

■ Two-dimensional block bursts 

• 2D transaction using a single read/write command 

• Popular for HD video and graphics 


• Rearrange DRAM 

address organization 

to exploit 2D locality 

• Avoids page misses 

• >1 tiling schemes 

active at once 

■ Channels divide 

buffer into columns 

• Network splits 2D 

bursts that cross 

channel edges 

DRAM Page 0 DRAM Page 0 

(on 2-D DRAM Data bank Object 0) (on in DRAM Memory bank 1) 

0x08F0 0x08F8 0x0900 0x0908 

0x10F0 0x10F8 0x1100 0x1108 

2-D Data 

Object in 

Memory 

0x18F0 0x18F8 0x1900 0x1908 

0x20F0 0x20F8 0x2100 0x2108 

0x28F0 0x28F8 0x2900 0x2908 

DRAM Page 0 


DRAM Page N 


DRAM Page 0 


DRAM Page N 


DRAM Page 1 


DRAM Page 1 


DRAM Page N+1 



256B/page 



16 rows/page 

1080 rows/frame

Outline 






DRAM-limited Consumer SoCs: Solution 

Requirements 

■ DRAM subsystems optimized for high 

utilization and good Quality of Service (QoS) 

characteristics 

■ On chip interconnection networks that 

manage the large and increasing numbers of 

DRAM consumers 

• And protect the IP cores from DRAM evolution 

■ Solutions to inefficiencies due to different 

access patterns and access granularities 

■ Analysis tooling to enable SoC architecture 

exploration and performance validation 


MemMax Memory Scheduler 

Multi-threaded & 

multi-tagged with 

non-blocking flow control 

In-order with early 

pre-charge/activate 

Compiled (SRAM) data 

buffers decouple rates 

& cross clock domain 


■ Request grouping 

■ 2D burst support 

■ Guaranteed 

bandwidth QoS 

with demotion 

■ Per-thread buffer 

sizing 

■ Run time 

re-programmable 


Futures

Futures 

■ 3D integration 

■ Power management 

■ Heterogeneous cache coherence 


Wide I/O: TSV-enabled Mobile DRAM 

■ High bandwidth, even at low capacity 

• Like 4 (x32) channels of LPDDR2-800 

• 12.8 GBytes/sec peak bandwidth 

■ Lowest power 

• No PHY (simple drivers/receivers, no PLL/DLL) 

• Low loading (capacitance and inductance) 

• Modest frequency (200 MHz) 

■ Smallest form factor 

• 3D stacked based on thin (50µm) TSV-based die 

■ Minimal change to DRAM design (ex-TSV) 

■ Risks are all TSV-related 

■ Smartphone market will drive volumes 

■ SoC problem: how to spread traffic across channels? 


Example Wide I/O System Solution 

Transport 

Engine 

IA 

TA 

SRAM 

Host 

CPU 

Video 

Decoder 

Video 

Post 

Processor 

Video 

Pre 

Processor 

Graphics 

Engine 

Audio 

Processor 

Sonics Interconnect with IMT 

Streaming 

Processor 

Debug 

Interface 

IA IA IA IA IA IA IA IA 

TA 

MemMax 

0 

Wide I/O 

Controller 

0 

Wide I/O 

DRAM 

TA 

MemMax 

3 

Wide I/O 

Controller 

3 

TA 

Storage 

SOC 


TA 

TA 

GPIO USB FLASH 

■ Concurrency mgmt., channel 

splitting, load balancing 

■ Scheduling for high efficiency 

& QoS 

■ Wide I/O interface 

TA

THANK YOU!

Technion Presentation - OCP-IP

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?