24.12.2012 Views

Technion Presentation - OCP-IP

Technion Presentation - OCP-IP

Technion Presentation - OCP-IP

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Technion</strong> <strong>Presentation</strong><br />

Drew Wingard<br />

Founder, CTO<br />

June 2012


Industry Leader in System <strong>IP</strong><br />

■ Answering the SoC challenge<br />

• Help designers integrate entire<br />

systems onto one piece of silicon<br />

• Any <strong>IP</strong> on any chip at any time<br />

■ Proven technology for 15<br />

years<br />

• >1 Billion units shipped<br />

• >150 designs taped out<br />

■ Key Designs with leaders<br />

• 7 of top 10 SoC semi companies<br />

• 4 of top 10 systems companies<br />

■ Pioneering Technology Leader<br />

• Pioneer and World’s #1 supplier of<br />

on-chip networks for advanced<br />

SoCs<br />

• Highest efficiency memory<br />

subsystems<br />

6/24/2012<br />

<strong>Technion</strong> <strong>Presentation</strong> 2


Enabling Billons of Connected Devices<br />

6/24/2012<br />

1+ Billion Chips Shipped with Sonics Technology<br />

• High-growth technology Leader<br />

• Headquartered in Silicon Valley<br />

• Pioneer of on-chip network <strong>IP</strong><br />

• Advanced memory subsystems<br />

<strong>Technion</strong> <strong>Presentation</strong> 3


Advanced On-Chip Network Solutions<br />

Complete product<br />

portfolio for every SoC<br />

■ SonicsGN<br />

• High speed routed network<br />

■ SonicsSX TM<br />

• Low-Latency and High-Speed<br />

• Advanced Features: 2D, IMT<br />

■ SonicsLX TM<br />

• High-speed cost effective<br />

■ SNAP TM<br />

• High efficiency AMBA buses<br />

■ SonicsExpress TM<br />

• Clock & Power domain crossing<br />

■ MemMax TM, MemMax AMP<br />

• Increases DRAM efficiency<br />

• QOS scheduling of DRAM access<br />

■ Sonics3220<br />

• Low power and low area<br />

• Non-blocking Peripheral Interconnect<br />

6/24/2012<br />

Complex SoC Using Sonics Products<br />

<strong>Technion</strong> <strong>Presentation</strong> 4


Our Communications Architecture<br />

■Separation<br />

■Abstraction<br />

■Optimization<br />

■Independence<br />

Master<br />

CPU<br />

DMA DMA DRAM Controller<br />

Core Function<br />

Communication<br />

Socket<br />

Master Slave<br />

Core Function<br />

Communication<br />

16 128<br />

Core Function<br />

Slave<br />

Communication<br />

Slave Bus Network Slave Bus Master Bus<br />

Master Bus<br />

Agent Agent Agent Agent<br />

Adapter<br />

Adapter Adapter<br />

Adapter<br />

Internal Fabric<br />

SMART Interconnect<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 5


The Intelligence is in the Agents<br />

■ Agents provide…<br />

• Protocol conversion<br />

o Agent adapts to <strong>IP</strong> core<br />

• Decoupling of <strong>IP</strong> cores from fabric<br />

o Provide local, isolated environment<br />

• Layered services<br />

■ Agent services<br />

• Power management<br />

• Security management<br />

• Error management<br />

• QoS<br />

• Burst, width, and command<br />

conversion<br />

Fabric<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 6<br />

I<br />

T<br />

INITIATOR SOCKETS<br />

I<br />

T<br />

I<br />

Initiator Agents (IA)<br />

Target Agents (TA)<br />

T<br />

I<br />

T<br />

TARGET SOCKETS<br />

I<br />

T


Sonics View on NoC<br />

■NoC adapts key abstractions from<br />

networking<br />

• Layering (orthogonalization of concerns)<br />

oSocket-based design<br />

oOptimized fabric protocols<br />

oHigher level services (security, QoS, error<br />

mgmt.)<br />

• Distributed design & implementation<br />

■We call this decoupling<br />

<strong>Technion</strong><br />

<strong>Presentation</strong> 7 6/24/2012


To 1 GHz and Beyond…


Why So Fast?<br />

■ We’re fully converged!<br />

• Computing<br />

• Graphics<br />

• Video/Audio<br />

■ Everything runs user<br />

applications<br />

■ Apps need Giga’s<br />

• 1-2 GHz multicore CPUs<br />

• 100+ GFLOP multicore GPUs<br />

• 15-50 GB/sec DRAM<br />

■ At consumer pricing<br />

■ … and something to<br />

integrate it all!<br />

Consumer Electronics:<br />

“Wish List” 2011<br />

Rank Rank User<br />

Product<br />

Ages 6‐12 Ages 13+ Apps<br />

iPad 1 1 �<br />

Computer 4 2 �<br />

iPhone 3 7 �<br />

Tablet (non‐iPad) 5 5 �<br />

TV 9 4 �<br />

iPod Touch 2 12 �<br />

Kinect for Xbox 360 7 9<br />

E‐Reader 13 3 �<br />

Smartphone (non‐iPhone) 10 8 �<br />

Blu‐Ray Player 12 6 �<br />

Nintendo 3DS 6 16 �<br />

PlayStation 3 11 11 �<br />

Nintendo DS* 8 15 �<br />

Nintendo Wii 16 10 �<br />

Xbox 360 14 13 �<br />

PlayStation Move 17 14<br />

Other Mobile Phone 15 17 �<br />

PlayStation Portable 18 18 �<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong><br />

Source: Nielsen, November 2011<br />

9


Are We Ready?<br />

■ TSMC is ready…<br />

• 28nm HPM, with full ecosystem enablement<br />

■ ARM is ready…<br />

• Cortex-A15 CPU: 1-2.5 GHz, 1-4 cores/cluster<br />

• Mali-T658 GPU: 350 GFLOPs, 1-8 cores/cluster<br />

■ DRAM vendors are ready…<br />

• DDR3/4: 1600-3200 Mb/sec/pin � 6-50 GB/sec, 1-4 channels<br />

• LPDDR2/3: 800-1600 Mb/sec/pin � 3-25 GB/sec, 1-4 channels<br />

• Wide IO: 200-266 Mb/sec/pin � 13-17 GB/sec, 4 channels<br />

■ But what about the middle?<br />

Cortex-A15 Mali-658 Video Audio Camera Display USB<br />

Tablet SoC<br />

DDR<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 10<br />

?<br />

DDR DDR DDR<br />


Integration: Giga-SoC Performance Needs<br />

■ GHz on-chip networks<br />

• Peak processor-to-multichannel DRAM bandwidth<br />

• Enable cache-to-cache transfers (e.g. AMBA ® ACE)<br />

• Example: SonicsGN<br />

■ Scalable multichannel DRAM support<br />

• 4 channels and above<br />

• With effective load balancing<br />

• Example: Sonics Interleaved Multichannel Technology<br />

■ High efficiency DRAM scheduling<br />

• ~85% sustained throughput<br />

• With CPU latency priority and guaranteed bandwidths<br />

• Example: Sonics MemMax ®<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 11


Example Tablet Application Processor<br />

533MHz<br />

Cortex-A15 Quad core<br />

CPU<br />

Sonics MemMax<br />

Memory<br />

Scheduler<br />

DRAM<br />

Cont.<br />

CPU<br />

CPU CPU<br />

533MHz<br />

1333MHz 1066MHz 533MHz<br />

L2 Cache<br />

133MHz<br />

ROM<br />

Secure ROM<br />

Cortex-A7 Quad core<br />

133MHz<br />

CPU<br />

Security SRAM<br />

Sonics MemMax<br />

Memory<br />

Scheduler<br />

DRAM<br />

Cont.<br />

CPU<br />

CPU CPU<br />

533MHz<br />

267MHz<br />

1066MHz 1066MHz<br />

L2 Cache<br />

DMA<br />

267MHz<br />

533MHz<br />

Mali-T658<br />

Quad core<br />

GPU<br />

GPU<br />

GPU GPU<br />

Coherency Fabric<br />

HDMI<br />

SonicsGN On-chip Network<br />

133MHz<br />

Ethernet<br />

133MHz<br />

PCIe<br />

267MHz<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong><br />

267MHz<br />

LCD Controller<br />

400MHz<br />

Audio<br />

Video Codec<br />

267MHz 200MHz<br />

Power<br />

Domains<br />

12<br />

Sonics3220 Peripheral Network<br />

133MHz<br />

USB<br />

200MHz<br />

Cam 1<br />

133MHz<br />

SATA<br />

APB Peripherals<br />

133MHz<br />

Cam 2<br />

1066MHz


But What About Power?<br />

■ Convergence drives<br />

massive SoC integration<br />

• Thin is in!<br />

■ All these Giga’s cost<br />

power<br />

• But most devices run from<br />

batteries<br />

■ Result: cannot afford to<br />

power entire SoC at once<br />

• “Dark silicon”<br />

• Power only those subsystems<br />

needed for current apps<br />

• And only as long as needed<br />

Consumer Electronics:<br />

“Wish List” 2011<br />

Rank Rank Battery<br />

Product<br />

Ages 6‐12 Ages 13+ Powered<br />

iPad 1 1 �<br />

Computer 4 2 �<br />

iPhone 3 7 �<br />

Tablet (non‐iPad) 5 5 �<br />

TV 9 4<br />

iPod Touch 2 12 �<br />

Kinect for Xbox 360 7 9<br />

E‐Reader 13 3 �<br />

Smartphone (non‐iPhone) 10 8 �<br />

Blu‐Ray Player 12 6<br />

Nintendo 3DS 6 16 �<br />

PlayStation 3 11 11<br />

Nintendo DS* 8 15 �<br />

Nintendo Wii 16 10<br />

Xbox 360 14 13<br />

PlayStation Move 17 14<br />

Other Mobile Phone 15 17 �<br />

PlayStation Portable 18 18 �<br />

Source: Nielsen, November 2011<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 13


■ General techniques<br />

• Stop/start subsystem clocks<br />

• Dynamic clock frequency<br />

• On/off voltage domains<br />

• Dynamic voltage/frequency domains (DVFS)<br />

■ <strong>IP</strong>-specific techniques<br />

• ARM big.LITTLE (use optimum <strong>IP</strong> for loading)<br />

■ Power managers implement the techniques<br />

• Software: flexible, but slow<br />

• Hardware: very responsive, but less flexible<br />

difficulty<br />

Managing Dark Silicon…<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 14


Example Tablet Application Processor<br />

533MHz<br />

Cortex A15 Quad core<br />

CPU<br />

Sonics MemMax<br />

Memory<br />

Scheduler<br />

DRAM<br />

Cont.<br />

CPU<br />

CPU CPU<br />

533MHz<br />

1333MHz 1066MHz 533MHz<br />

L2 Cache<br />

133MHz<br />

ROM<br />

Secure ROM<br />

Cortex A7 Quad core<br />

133MHz<br />

CPU<br />

Security SRAM<br />

Sonics MemMax<br />

Memory<br />

Scheduler<br />

DRAM<br />

Cont.<br />

CPU<br />

CPU CPU<br />

533MHz<br />

267MHz<br />

1066MHz 1066MHz<br />

L2 Cache<br />

DMA<br />

267MHz<br />

533MHz<br />

Mali-T658<br />

Quad core<br />

GPU<br />

GPU<br />

GPU GPU<br />

Coherency Fabric<br />

HDMI<br />

SonicsGN On-chip Network<br />

133MHz<br />

Ethernet<br />

133MHz<br />

PCIe<br />

267MHz<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong><br />

267MHz<br />

LCD Controller<br />

400MHz<br />

Audio<br />

Video Codec<br />

267MHz 200MHz<br />

Power<br />

Domains<br />

15<br />

Sonics3220 Peripheral Network<br />

133MHz<br />

USB<br />

200MHz<br />

Cam 1<br />

133MHz<br />

SATA<br />

APB Peripherals<br />

133MHz<br />

Cam 2<br />

1066MHz


Managing Power with SonicsGN<br />

■ Flexible power<br />

domain support<br />

• Asynch/mesochronous<br />

• Isolation/level shifters<br />

■ HW-controlled safe<br />

shutdown<br />

■ Automatic wakeup<br />

■ Benefits:<br />

• More domains<br />

• Quicker shutdown<br />

• Faster wakeup<br />

■ Keep more dark,<br />

more of the time<br />

50% SoC Power Reduction!<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 16


Summary<br />

■ GHz, GFLOPs and GB/sec are consumer design<br />

points<br />

• And your next SoC will need them!<br />

■ SoC integration must exploit that performance<br />

• GHz on-chip networks: SonicsGN<br />

• Multichannel DRAM optimization: Sonics IMT<br />

• High efficiency DRAM scheduling: Sonics MemMax<br />

■ … while improving battery life<br />

• Automatic hardware power management, with software<br />

policies<br />

■ Integration requirements<br />

• Twice the frequency<br />

• One half the SoC power<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 17


SonicsGN Introduction


SonicsGN (SGN) Benefits<br />

■ Only Network-on-chip available at 2X<br />

the speed of competing solutions<br />

■ Lowest System Power<br />

■ Highly Optimized area<br />

■ Ideal for 28nm process node and<br />

below<br />

■ Advanced tooling – simplified network<br />

capture environment along with<br />

powerful analysis capability<br />

■ Exceeds the fabric performance<br />

requirements for processors at<br />

speeds up to 1-3GHz<br />

6/24/2012<br />

On-Chip Network <strong>IP</strong> for Complex SoCs Design<br />

<strong>Technion</strong> <strong>Presentation</strong><br />

19


SGN Topology Choices<br />

Shallow Network<br />

� A routed network topology <strong>IP</strong><br />

6/24/2012<br />

■ Complement other<br />

Sonics <strong>IP</strong> components<br />

<strong>Technion</strong> <strong>Presentation</strong> 20


On-Chip Network – Under the Hood<br />

• Routed Network<br />

• Virtual channels<br />

• Clock Crossing<br />

• Multiple power<br />

domains<br />

• Bit conversion<br />

6/24/2012<br />

<strong>Technion</strong> <strong>Presentation</strong><br />

21 21


Dolphin vs. Piranha<br />

Fabric Protocol Dolphin Piranha<br />

SGN Benefits<br />

First product SonicsMX SonicsGN<br />

Introduction 2004 2011<br />

System architecture<br />

Fabric + decoupled Fabric + decoupled<br />

> 150 tape-outs, > 1 Billion chips<br />

agents<br />

agents<br />

Switching element<br />

Crossbar,<br />

shared link<br />

Router Higher frequency<br />

Flow control Single cycle Credit-based Higher frequency/better clock crossing<br />

Maximum switch depth 4 Unlimited Scalability<br />

Separate request/response networks? Yes Yes<br />

Source routed? Yes Yes<br />

Header/payload separation? Yes Yes<br />

Header/payload transmission Parallel Serialized Routing congestion and scalability<br />

Blocking ID-based concurrency? Yes Yes<br />

Non-blocking concurrency Threads Virtual channels Mapping flexibility and lower area<br />

Internal buffering Per thread Shareable Lower area<br />

Per link data widths with integrated width<br />

conversion?<br />

Yes Yes<br />

Divided synch.,<br />

Clock domain crossing Divided synch. Mesochronous,<br />

Asynchronous<br />

Easier timing closure and partitioning<br />

Power domain crossing External<br />

Dynamic voltage,<br />

switched voltage<br />

Lower power<br />

Power management support Single domain Many domain Per domain control for lower power<br />

6/24/2012<br />

<strong>Technion</strong> <strong>Presentation</strong> 22


Design Capture<br />

Master core<br />

Socket<br />

Interface<br />

6/24/2012<br />

NoC Block Diagram View Power Partitioning View<br />

Request network<br />

Response network<br />

Slave Core<br />

Socket Interface<br />

<strong>Technion</strong> <strong>Presentation</strong><br />

Power Domains<br />

23


NoC Floor Plan Interactions<br />

M2<br />

T2<br />

Matching Floor Plan Mismatched Floor Plan<br />

M1 M3<br />

M4 M5<br />

T1 T3 T5 T6<br />

M6<br />

T4<br />

M4 M5<br />

■ NoC benefits greatly impacted by mismatched floor plan<br />

■ Wiring congestion worse than crossbar!<br />

■ Timing convergence challenges<br />

■ Need technology to protect “logical” topology after floor plan<br />

determined<br />

• Architect analyzes logical topology<br />

• SoC team determines physical topology<br />

• Virtual channels: key to independence!<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 24<br />

M6<br />

T6<br />

T2<br />

M3<br />

T1 T5 T3<br />

M2<br />

M1<br />

T4


Power Partitioning<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 25


SGN Power Management Interface<br />

■ Power management bundle<br />

• 2 signal handshake<br />

• 2 signal wakeup control<br />

(optional)<br />

• Activity status signal (optional)<br />

■ Distributed interconnect<br />

power manager (<strong>IP</strong>M)<br />

• Each IA determines path for<br />

incoming requests<br />

• Identifies required PM domains<br />

• Determines if request<br />

o can proceed<br />

o must terminate immediately<br />

o must request wake-up<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 26<br />

System Power Manager<br />

Interconnect Power Manager


High Performance Consumer<br />

SoC Memory Subsystems


Outline<br />

■ Multicore consumer SoC background<br />

■ DRAM subsystem challenges<br />

■ Solution aspects<br />

■ Putting it all together<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 28


Consumer SoC Examples<br />

■ What are some key applications for consumer SoCs?<br />

■ Key characteristic: relentless push for higher quality<br />

user experiences – at minimum system cost!<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 29


Concurrency in Consumer SoCs<br />

Consumer MPSoCs process data in parallel, but<br />

communicate…<br />

H.264 Decode<br />

Bitstream Entropy<br />

iScan<br />

Recon-<br />

Loop Decoded<br />

Entropy<br />

iScan<br />

iTrans Recon-<br />

Loop<br />

Decoding<br />

iQuant iTrans<br />

struction<br />

Filter Frames<br />

Decoding<br />

iQuant<br />

struction<br />

Filter<br />

Intra<br />

Intra<br />

Prediction<br />

Prediction<br />

MC<br />

MC<br />

Prediction<br />

Prediction<br />

Transport demux<br />

Audio Decode<br />

Video Out<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 30


Concurrency in Consumer SoCs<br />

H.264 Decode<br />

Bitstream Entropy<br />

iScan<br />

Recon-<br />

Loop Decoded<br />

Entropy<br />

iScan<br />

iTrans Recon-<br />

Loop<br />

Decoding<br />

iQuant iTrans<br />

struction<br />

Filter Frames<br />

Decoding<br />

iQuant<br />

struction<br />

Filter<br />

Intra<br />

Intra<br />

Prediction<br />

Prediction<br />

MC<br />

MC<br />

Prediction<br />

Prediction<br />

DRAM<br />

Transport demux<br />

Audio Decode<br />

Video Out<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 31


Concurrency in Consumer SoCs<br />

■ Assertion: video SoC applications have >><br />

50% of system traffic to/from external DRAM<br />

■ Consumer volumes and price points demand<br />

cheapest DRAM configurations that<br />

support required performance<br />

■ Implications:<br />

• SoC architecture is mostly a fan-in tree to external<br />

DRAM<br />

• Maximizing delivered DRAM throughput and<br />

utilization are key<br />

o Fewer DRAMs<br />

o Lower speed grades<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 32


Outline<br />

■ Multicore consumer SoC background<br />

■ DRAM subsystem challenges<br />

• Massive connectivity to DRAM<br />

• Achieving high DRAM utilization<br />

• QoS<br />

• Widely varying DRAM access styles<br />

• Increasing access granularity<br />

■ Solution aspects<br />

■ Putting it all together<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 33


SoC Architecture Trends<br />

■ Massive feature integration<br />

• Driven largely by Moore’s Law (supply) and<br />

convergence (demand)<br />

■ Continued movement of complexity to software<br />

■ Distributed architectures<br />

• Higher scalability (and independence?)<br />

■ Multiple processors<br />

• (Multicore) CPU<br />

• DSP<br />

• Special purpose (MPEG, GFX, …)<br />

■ Distributed DMA<br />

• Removes centralized DMA bottleneck<br />

• Simplifies driver software integration<br />

CPU MPEG DSP<br />

DRAM<br />

Controller<br />

Massive Connectivity to DRAM<br />

3D GFX MAC<br />

Video<br />

I/O<br />

Comm<br />

I/O<br />

System On Chip<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 34


6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 35


• 3 traditional processors<br />

• ~90 DRAM connections<br />

• Already 7 years ago!<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 36


Utilization Rate<br />

■ Utilization Rate: the percentage of DRAM data cycles<br />

that transfer data that is useful to the system<br />

■ Example – at 85% utilization, 2 DDR3-1600 parts in a<br />

x32 configuration (e.g. 2 x16 DDR3 DRAMs) deliver:<br />

• (85%) x (1600 Mbits/sec/pin) x (32 pins) / (8 bits/Byte) = 5.44<br />

GBytes/sec<br />

■ Things which reduce DRAM utilization:<br />

• Refresh cycles<br />

• RD/WR data bus turnaround<br />

• Page misses<br />

• Partial bursts (lengths < BL)<br />

• Unaligned bursts<br />

• Command bus conflicts (across banks)<br />

• QoS optimizations!<br />

■ DRAM schedulers arbitrate among a set of system<br />

transactions to optimize DRAM utilization rates<br />

• Perhaps QoS, too!<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 37


DRAM Utilization is Traffic-Dependent<br />

■ System transactions targeting DRAM limit the<br />

peak utilization per initiator, but exploiting the<br />

parallelism between initiators allows an<br />

intelligent scheduler to optimize utilization<br />

■ Utilization-related traffic characteristics:<br />

• Burst lengths, address sequences and address<br />

alignment<br />

• Address relationships across transactions (e.g. 2D)<br />

• Number of outstanding transactions<br />

• Read/write mix<br />

• Ordering constraints<br />

• Time-domain behavior (i.e. isochronous vs.<br />

bursty/asynchronous)<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 38


How Traffic Impacts DRAM Utilization<br />

Traffic Characteristic<br />

Utilization Impact<br />

Burst lengths/sequences/alignment X X X X X<br />

Refresh cycles<br />

Addressing across transactions X X X X<br />

# of outstanding transactions X X X X<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 39<br />

RD/WR turnaround<br />

Read/write mix X<br />

Ordering constraints X X X X<br />

Time-domain behavior X X X X<br />

Assumes refresh is<br />

independent from traffic<br />

Page misses<br />

Partial bursts<br />

Unaligned bursts<br />

Command conflicts<br />

QoS optimizations


High DRAM Utilization vs. QoS<br />

■ High DRAM utilization (throughput) and<br />

Quality of Service are in conflict<br />

• Utilization prefers long DRAM bursts<br />

o DRAM operates most efficiently<br />

• QoS demands short DRAM bursts<br />

o Provide low latency service for CPUs<br />

o Control buffering requirements for real-time users<br />

■ Consumer SoCs use the DRAM scheduler to<br />

tune the trade-off between utilization and QoS<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 40


Locality Challenges: DRAM Access Styles<br />

■ Exploiting spatial locality is key for high utilization<br />

• CPUs tend to stay with an O/S page (multiple DRAM pages)<br />

• Much processing and I/O DMA uses long incrementing bursts<br />

• Image processing is tougher<br />

■ Two-dimensional bursts<br />

• 2D transaction using<br />

a single read or write<br />

command<br />

• Popular for HD video<br />

and graphics<br />

2-D Data Object in Memory<br />

0x08F0 0x08F8 0x0900 0x0908<br />

0x10F0 0x10F8 0x1100 0x1108<br />

0x18F0 0x18F8 0x1900 0x1908<br />

0x20F0 0x20F8 0x2100 0x2108<br />

0x28F0 0x28F8 0x2900 0x2908<br />

2048 Bytes/row<br />

Video Frame Buffer<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 41<br />

1080 rows/frame


Locality Challenges: DRAM Access Styles<br />

■ Exploiting spatial locality is key for high utilization<br />

• CPUs tend to stay with an O/S page (multiple DRAM pages)<br />

• Much processing and I/O DMA uses long incrementing bursts<br />

• Image processing is tougher<br />

■ Two-dimensional bursts<br />

• 2D transaction using<br />

a single read or write<br />

command<br />

• Popular for HD video<br />

and graphics<br />

■ Address tiling<br />

• Rearrange DRAM<br />

address organization<br />

to exploit locality<br />

• Avoids page misses<br />

• One size doesn’t<br />

fit all!<br />

DRAM Page 0 DRAM Page 0<br />

(on 2-D DRAM Data bank Object 0) (on in DRAM Memory bank 1)<br />

0x08F0 0x08F8 0x0900 0x0908<br />

0x10F0 0x10F8 0x1100 0x1108<br />

2-D Data<br />

Object in<br />

Memory<br />

0x18F0 0x18F8 0x1900 0x1908<br />

0x20F0 0x20F8 0x2100 0x2108<br />

0x28F0 0x28F8 0x2900 0x2908<br />

DRAM Page 0<br />

(on DRAM bank 2)<br />

DRAM Page N<br />

(on DRAM bank 0)<br />

DRAM Page 0<br />

(on DRAM bank 3)<br />

DRAM Page N<br />

(on DRAM bank 1)<br />

DRAM Page 1<br />

(on DRAM bank 0)<br />

DRAM Page 1<br />

(on DRAM bank 2)<br />

DRAM Page N+1<br />

(on DRAM bank 0)<br />

2048 Bytes/row<br />

256B/page<br />

Video Frame Buffer<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 42<br />

16 rows/page<br />

1080 rows/frame


Access Granularity:<br />

DRAM Burst Sizes Growing Too Large<br />

DRAM Words (BL) or<br />

DDR Width (Bytes)<br />

10<br />

8<br />

6<br />

4<br />

DDR<br />

2<br />

DDR2<br />

0<br />

0<br />

2003 2004 2005 2006 2007 2008 2009<br />

DDR BL DDR2 BL DDR3 BL<br />

DDR Width (Bytes) DRAM Burst<br />

DDR3<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 43<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

Minimum DRAM Burst (Bytes)<br />

64 Bytes<br />

8 Bytes<br />

Data Transfers Shorter Than Burst Size Lose Efficiency<br />

Many SoC Data Objects


Example: Analytic Traffic Characterization<br />

Traffic Flow<br />

Burst<br />

Length<br />

(dword) Height Aligned?<br />

Best-case Transfer<br />

Efficiency Page misses per DDR Burst<br />

Tiled? N Y N Y N Y<br />

16 32 64 Burst (Bytes) 16 16 32 32 64 64<br />

8 8 8 BL 8 8 8 8 8 8<br />

2 4 8 Data Width (B) 2 2 4 4 8 8<br />

Vid decode Wr 2 8 N 53% 36% 22% 53% 7% 73% 9% 89% 11%<br />

Vid decode Rd 10 3 N 85% 74% 59% 17% 6% 30% 10% 47% 16%<br />

Vid back end 32 1 Y 100% 100% 100% 6% 6% 13% 13% 25% 25%<br />

Source: Customer (HDTV) System Dataflow<br />

■ As DRAM burst size increases, efficiency drops<br />

substantially for short and/or unaligned traffic<br />

• MPEG macro-block fetch is an easy example<br />

• Long burst traffic stays efficient<br />

• CPU traffic loses efficiency if DDR burst size > cache line size<br />

■ 2D traffic generates many page misses for<br />

row/bank/column DRAM address organization<br />

• Address tiling reduces 2D page misses substantially<br />

• Long burst traffic is (again) tolerant<br />

• But, traditional CPU traffic prefers row/bank/column!<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 44


Outline<br />

■ Multicore consumer SoC background<br />

■ DRAM subsystem challenges<br />

■ Solution aspects<br />

• High concurrency interconnect networks<br />

• Single ported DRAM controller protocols<br />

• High utilization scheduling with QoS<br />

• Scalable multichannel<br />

• Flexible address tiling<br />

■ Putting it all together<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 45


Congestion<br />

Problem<br />

On-chip<br />

Off-chip<br />

Star Topology Memory Subsystems<br />

Traditional Approach<br />

CPU MME DSP GFX<br />

Addr/<br />

Cmd<br />

DRAM<br />

Subsys.<br />

D<br />

R<br />

A<br />

M<br />

D<br />

R<br />

A<br />

M<br />

D<br />

R<br />

A<br />

M<br />

Data<br />

FIFO’s<br />

■ Initiators present requests in<br />

parallel to multi-port scheduler<br />

■ FIFO’s at initiators provide<br />

• Rate decoupling<br />

• Service jitter tolerance<br />

■ DRAM subsystem needs no<br />

FIFO, only pipelining<br />

■ System performance limited<br />

only by traffic & scheduler<br />

But:<br />

■ LOTS of wires/congestion<br />

■ Lots of small/inefficient FIFO’s<br />

■ Large part of system must be<br />

BW matched to DRAM<br />

Imagine Dedicated Links From > 100<br />

DRAM-connected Cores!<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 46


Single-ported Memory Subsystems<br />

Shared Interconnect Approach<br />

CPU MME DSP GFX<br />

On-chip<br />

Off-chip<br />

Shared Interconnect<br />

DRAM<br />

Subsys.<br />

Addr/<br />

Cmd<br />

D<br />

R<br />

A<br />

M<br />

D<br />

R<br />

A<br />

M<br />

D<br />

R<br />

A<br />

M<br />

Data<br />

■ Interconnect presents requests in<br />

series to single-port scheduler<br />

■ Saves wires/congestion<br />

But:<br />

■ Interconnect arbitration impacts<br />

scheduling<br />

• Risks lower utilization<br />

• May not meet deadlines<br />

■ Where do FIFO’s live?<br />

■ How much of system is BWmatched<br />

to DRAM?<br />

■ System performance also limited<br />

by communication system<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 47


Single-port DRAM Protocols<br />

■ Interconnect and subsystem must support multiple<br />

outstanding requests (cover DRAM pipeline depth)<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 48


In-order Protocol<br />

■ Interface protocol supports multiple outstanding burst<br />

requests, but all service matches request order<br />

In-Order<br />

. . .<br />

DRAM<br />

Controller<br />

Response Resp.<br />

DRAM DRAM DRAM<br />

Scheduler Network<br />

O-O-O/Blocking FC<br />

. . .<br />

Example:<br />

VSIA BVCI<br />

(AMBA AHB<br />

needs<br />

multi-port)<br />

Sched.<br />

Controller<br />

Response Resp.<br />

DRAM DRAM DRAM<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 49<br />

Network<br />

Scheduler<br />

O-O-O/Non-Blocking FC<br />

■ Simplest scheme (lowest hardware<br />

. . .<br />

cost)<br />

■ Service order determined by<br />

interconnect arbitration<br />

■ Scheduler can only optimize pipeline<br />

(looking ahead for page misses to<br />

other banks)<br />

■ High efficiency requires long bursts,<br />

Response Resp.<br />

DRAM DRAM DRAM<br />

Scheduler Network<br />

Per-thread flow control<br />

Sched.<br />

leads to high latency Controller (poor QoS)


Out-of-order Protocol with Blocking FC<br />

■ Interface protocol provides ordering tags to allow scheduler to<br />

reorder some requests, but flow control is shared across all tags<br />

In-Order<br />

. . .<br />

DRAM<br />

Controller<br />

Response Resp.<br />

DRAM DRAM DRAM<br />

Scheduler Network<br />

Head of line<br />

blocking<br />

Example:<br />

AMBA AXI<br />

O-O-O/Blocking FC<br />

. . .<br />

Sched.<br />

Controller<br />

Response Resp.<br />

DRAM DRAM DRAM<br />

Network<br />

Scheduler<br />

O-O-O/Non-Blocking FC<br />

■ Interconnect presents requests<br />

. . .<br />

in order<br />

■ Scheduler queues requests &<br />

chooses order to optimize<br />

throughput and QoS<br />

But<br />

■ Bursty flows can fill queues,<br />

DRAM DRAM DRAM<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 50<br />

Response Resp.<br />

Scheduler Network<br />

Per-thread flow control<br />

hurting Sched. latency & BW for others<br />

■ Full queues Controller block into network<br />

■ Frequency scales poorly with<br />

queue depth


Out-of-order Protocol with Non-blocking FC<br />

■ Interface protocol provides per-thread ID’s and flow<br />

control, enabling re-ordering while preventing blocking<br />

In-Order<br />

. . .<br />

. . .<br />

■ Interconnect maps initiator threads into<br />

Response Resp.<br />

DRAM DRAM DRAM<br />

Scheduler Network<br />

O-O-O/Blocking FC<br />

target threads<br />

■ Scheduler queues requests & chooses<br />

order to optimize throughput and QoS<br />

on per-thread basis<br />

■ Non-blocking (per-thread) flow control<br />

minimizes inter-thread interactions<br />

Response Resp.<br />

DRAM DRAM DRAM<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 51<br />

Network<br />

Scheduler<br />

Sched.<br />

DRAM<br />

■ Per-thread Controller queues inherently Controller ordered,<br />

implemented as compiled SRAM<br />

■ Result: lower latency, BW guarantees,<br />

higher guaranteed throughput<br />

O-O-O/Non-Blocking FC<br />

. . .<br />

Sched.<br />

Controller<br />

Response Resp.<br />

DRAM DRAM DRAM<br />

Scheduler Network<br />

Per-thread flow control<br />

Example:<br />

<strong>OCP</strong> Threads


Single-port DRAM Subsystem Protocols<br />

Ordering/<br />

flow control<br />

In-order/<br />

blocking<br />

Out-of-order/<br />

blocking<br />

Out-of-order/<br />

non-blocking<br />

Peak BW limited by DRAM DRAM DRAM<br />

Ordering flexibility None High High<br />

Queuing None Shared Per-thread<br />

Compiled RAM-friendly No No Yes<br />

Init. BW==DRAM BW Yes Yes No<br />

DRAM efficiency Medium High High<br />

Max. CPU latency High Medium Low<br />

Data interleaving None Minor High<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 52


Optimizing for High Utilization<br />

Goal Approach<br />

Minimize RD/WR turnarounds Group reads and writes<br />

Hide page misses Bank scheduling<br />

Avoid page misses to banks with conflicting<br />

transfers in flight Bank state tracking<br />

Maximize page/bank scheduling opportunities at<br />

minimum area Expose intrinsic traffic concurrency to scheduler<br />

Ensure write data bursts at DDR rate Buffer write data in DDR clock domain<br />

Ensure read data absorbed at DDR rate Buffer read data in DDR clock domain<br />

Isolate SoC architecture from DDR clock Asynchronous FIFO<br />

Prevent low-bandwidth initiators from stalling DDR Decoupling FIFO<br />

Make CPU highest priority, interleave bursts &<br />

Minimize CPU latency<br />

ensure paths cannot block<br />

Protect groups of DDR bursts against higher<br />

Eliminate page/bank thrashing<br />

priority traffic<br />

Demote QoS traffic overusing bandwidth & fair<br />

Protect against best-effort traffic starvation best-effort arbitration<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 53


QoS-based Arbitration<br />

■ Initiator data flow threads mapped to DRAM threads by interconnect<br />

• e.g. 40 data flows sharing 8 DRAM threads in a digital video system<br />

■ Independent threads assigned to QoS level<br />

■ Non-blocking, multi-threaded fabric and DRAM interfaces allow:<br />

• Higher priority requests to interleave with & respond before others<br />

• Guaranteed BW threads to minimize buffering & receive latency<br />

guarantees<br />

• High DRAM utilization<br />

Thread QoS<br />

Level<br />

Bandwidth<br />

Allocation ?<br />

Priority Yes<br />

Bandwidth Yes<br />

QoS Model<br />

Low latency while within BW allocation,<br />

best-effort otherwise<br />

Guaranteed BW while within BW<br />

allocation, best-effort otherwise<br />

Best-effort No N/A<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 54


Multichannel Solves Access Granularity<br />

CPU DSP MME<br />

SonicsMX SMART Interconnect<br />

SonicsSX SMART Interconnect<br />

64Mx16x4<br />

MemMax<br />

64<br />

DDR3 DDR3 DDR3 DDR3<br />

CPU DSP MME<br />

SonicsMX SMART Interconnect<br />

From Single to<br />

Multichannel<br />

CPU DSP MME<br />

SonicsMX SMART Interconnect<br />

DDR2 DDR3 DDR3<br />

Channels 1 1 2<br />

Data Width (B) 4 4 2<br />

Effective BW 100% 84% 100%<br />

Source: Customer (HDTV) System Dataflow<br />

Constant Frequency/Ideal Load Balancing<br />

SonicsSX SMART Interconnect<br />

MemMax<br />

32<br />

DDR3 DDR3<br />

64Mx16x2<br />

CPU DSP MME<br />

SonicsMX SMART Interconnect<br />

MemMax<br />

32<br />

DDR3 DDR3<br />

64Mx16x2<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 55


Multichannel DRAM System Challenges<br />

Application View<br />

Address<br />

Space<br />

Region 1<br />

Hole 1<br />

Region 2<br />

Region 3<br />

Hole 2<br />

2 Channels<br />

No Interleave<br />

Region 1<br />

Hole 1<br />

Ch. 1<br />

Region 2<br />

Ch. 2<br />

Region 3<br />

Hole 2<br />

Key Problems:<br />

■ Load balancing<br />

2 Channels<br />

Interleaved<br />

Region • Must 1 balance Region memory 1 traffic<br />

Hole 1<br />

1<br />

2<br />

1<br />

2<br />

Region 2<br />

1<br />

2<br />

1<br />

2<br />

Region 3<br />

Hole 2<br />

4 Channels<br />

Interleaved<br />

Hole 1<br />

1<br />

2<br />

3<br />

4<br />

Region 2<br />

1<br />

2<br />

3<br />

4<br />

Region 3<br />

Hole 2<br />

Physical Organization<br />

evenly among channels<br />

■ Maintaining throughput<br />

• Multiple channels cause<br />

throughput & ordering issues<br />

for pipelined memories<br />

Software and <strong>IP</strong> cores<br />

must manage multiple<br />

channels explicitly<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 56


Interleaved Multichannel Technology (IMT*):<br />

Seamless Transition to Multichannel<br />

*patents pending<br />

■ Interleaving support requires splitting traffic for<br />

delivery to proper channel<br />

• Splitting in memory controller creates performance and routing<br />

congestion bottleneck<br />

■ Predictably high performance<br />

• Automatically spreads traffic across channels to ensure load<br />

balancing<br />

• Keeps DRAMs operating at full throughput, without costly<br />

reorder buffers<br />

■ Scalable architecture<br />

• Up to 8 interleaved channels within the same address region<br />

• Fully distributed to avoid bottlenecks & placement restrictions<br />

■ Application flexibility<br />

• Transparent to software and initiator hardware<br />

• Supports full or partial memory configurations – at run time<br />

Multichannel Interleaving in the Interconnect<br />

Higher Performance, Lower Area, More Scalable<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 57


Transparent Multichannel Interleaving<br />

with Access-Optimized Boundaries<br />

Application View<br />

Physical Organization<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 58<br />

Independent Interleaving Size Support


2D Bursts, Address Tiling & Multichannel<br />

■ Two-dimensional block bursts<br />

• 2D transaction using a single read/write command<br />

• Popular for HD video and graphics<br />

■ Address tiling<br />

• Rearrange DRAM<br />

address organization<br />

to exploit 2D locality<br />

• Avoids page misses<br />

• >1 tiling schemes<br />

active at once<br />

■ Channels divide<br />

buffer into columns<br />

• Network splits 2D<br />

bursts that cross<br />

channel edges<br />

DRAM Page 0 DRAM Page 0<br />

(on 2-D DRAM Data bank Object 0) (on in DRAM Memory bank 1)<br />

0x08F0 0x08F8 0x0900 0x0908<br />

0x10F0 0x10F8 0x1100 0x1108<br />

2-D Data<br />

Object in<br />

Memory<br />

0x18F0 0x18F8 0x1900 0x1908<br />

0x20F0 0x20F8 0x2100 0x2108<br />

0x28F0 0x28F8 0x2900 0x2908<br />

DRAM Page 0<br />

(on DRAM bank 2)<br />

DRAM Page N<br />

(on DRAM bank 0)<br />

DRAM Page 0<br />

(on DRAM bank 3)<br />

DRAM Page N<br />

(on DRAM bank 1)<br />

DRAM Page 1<br />

(on DRAM bank 0)<br />

DRAM Page 1<br />

(on DRAM bank 2)<br />

DRAM Page N+1<br />

(on DRAM bank 0)<br />

2048 Bytes/row<br />

256B/page<br />

Video Frame Buffer<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 59<br />

16 rows/page<br />

1080 rows/frame


Outline<br />

■ Multicore consumer SoC background<br />

■ DRAM subsystem challenges<br />

■ Solution aspects<br />

■ Putting it all together<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 60


DRAM-limited Consumer SoCs: Solution<br />

Requirements<br />

■ DRAM subsystems optimized for high<br />

utilization and good Quality of Service (QoS)<br />

characteristics<br />

■ On chip interconnection networks that<br />

manage the large and increasing numbers of<br />

DRAM consumers<br />

• And protect the <strong>IP</strong> cores from DRAM evolution<br />

■ Solutions to inefficiencies due to different<br />

access patterns and access granularities<br />

■ Analysis tooling to enable SoC architecture<br />

exploration and performance validation<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 61


MemMax Memory Scheduler<br />

Multi-threaded &<br />

multi-tagged with<br />

non-blocking flow control<br />

In-order with early<br />

pre-charge/activate<br />

Compiled (SRAM) data<br />

buffers decouple rates<br />

& cross clock domain<br />

■ Address tiling<br />

■ Request grouping<br />

■ 2D burst support<br />

■ Guaranteed<br />

bandwidth QoS<br />

with demotion<br />

■ Per-thread buffer<br />

sizing<br />

■ Run time<br />

re-programmable<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 62


Futures


Futures<br />

■ 3D integration<br />

■ Power management<br />

■ Heterogeneous cache coherence<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 64


Wide I/O: TSV-enabled Mobile DRAM<br />

■ High bandwidth, even at low capacity<br />

• Like 4 (x32) channels of LPDDR2-800<br />

• 12.8 GBytes/sec peak bandwidth<br />

■ Lowest power<br />

• No PHY (simple drivers/receivers, no PLL/DLL)<br />

• Low loading (capacitance and inductance)<br />

• Modest frequency (200 MHz)<br />

■ Smallest form factor<br />

• 3D stacked based on thin (50µm) TSV-based die<br />

■ Minimal change to DRAM design (ex-TSV)<br />

■ Risks are all TSV-related<br />

■ Smartphone market will drive volumes<br />

■ SoC problem: how to spread traffic across channels?<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 65


Example Wide I/O System Solution<br />

Transport<br />

Engine<br />

IA<br />

TA<br />

SRAM<br />

Host<br />

CPU<br />

Video<br />

Decoder<br />

Video<br />

Post<br />

Processor<br />

Video<br />

Pre<br />

Processor<br />

Graphics<br />

Engine<br />

Audio<br />

Processor<br />

Sonics Interconnect with IMT<br />

Streaming<br />

Processor<br />

Debug<br />

Interface<br />

IA IA IA IA IA IA IA IA<br />

TA<br />

MemMax<br />

0<br />

Wide I/O<br />

Controller<br />

0<br />

Wide I/O<br />

DRAM<br />

TA<br />

MemMax<br />

3<br />

Wide I/O<br />

Controller<br />

3<br />

TA<br />

Storage<br />

SOC<br />

6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 66<br />

TA<br />

TA<br />

GPIO USB FLASH<br />

■ Concurrency mgmt., channel<br />

splitting, load balancing<br />

■ Scheduling for high efficiency<br />

& QoS<br />

■ Wide I/O interface<br />

TA


THANK YOU!

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!