Technion Presentation - OCP-IP
Technion Presentation - OCP-IP
Technion Presentation - OCP-IP
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Technion</strong> <strong>Presentation</strong><br />
Drew Wingard<br />
Founder, CTO<br />
June 2012
Industry Leader in System <strong>IP</strong><br />
■ Answering the SoC challenge<br />
• Help designers integrate entire<br />
systems onto one piece of silicon<br />
• Any <strong>IP</strong> on any chip at any time<br />
■ Proven technology for 15<br />
years<br />
• >1 Billion units shipped<br />
• >150 designs taped out<br />
■ Key Designs with leaders<br />
• 7 of top 10 SoC semi companies<br />
• 4 of top 10 systems companies<br />
■ Pioneering Technology Leader<br />
• Pioneer and World’s #1 supplier of<br />
on-chip networks for advanced<br />
SoCs<br />
• Highest efficiency memory<br />
subsystems<br />
6/24/2012<br />
<strong>Technion</strong> <strong>Presentation</strong> 2
Enabling Billons of Connected Devices<br />
6/24/2012<br />
1+ Billion Chips Shipped with Sonics Technology<br />
• High-growth technology Leader<br />
• Headquartered in Silicon Valley<br />
• Pioneer of on-chip network <strong>IP</strong><br />
• Advanced memory subsystems<br />
<strong>Technion</strong> <strong>Presentation</strong> 3
Advanced On-Chip Network Solutions<br />
Complete product<br />
portfolio for every SoC<br />
■ SonicsGN<br />
• High speed routed network<br />
■ SonicsSX TM<br />
• Low-Latency and High-Speed<br />
• Advanced Features: 2D, IMT<br />
■ SonicsLX TM<br />
• High-speed cost effective<br />
■ SNAP TM<br />
• High efficiency AMBA buses<br />
■ SonicsExpress TM<br />
• Clock & Power domain crossing<br />
■ MemMax TM, MemMax AMP<br />
• Increases DRAM efficiency<br />
• QOS scheduling of DRAM access<br />
■ Sonics3220<br />
• Low power and low area<br />
• Non-blocking Peripheral Interconnect<br />
6/24/2012<br />
Complex SoC Using Sonics Products<br />
<strong>Technion</strong> <strong>Presentation</strong> 4
Our Communications Architecture<br />
■Separation<br />
■Abstraction<br />
■Optimization<br />
■Independence<br />
Master<br />
CPU<br />
DMA DMA DRAM Controller<br />
Core Function<br />
Communication<br />
Socket<br />
Master Slave<br />
Core Function<br />
Communication<br />
16 128<br />
Core Function<br />
Slave<br />
Communication<br />
Slave Bus Network Slave Bus Master Bus<br />
Master Bus<br />
Agent Agent Agent Agent<br />
Adapter<br />
Adapter Adapter<br />
Adapter<br />
Internal Fabric<br />
SMART Interconnect<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 5
The Intelligence is in the Agents<br />
■ Agents provide…<br />
• Protocol conversion<br />
o Agent adapts to <strong>IP</strong> core<br />
• Decoupling of <strong>IP</strong> cores from fabric<br />
o Provide local, isolated environment<br />
• Layered services<br />
■ Agent services<br />
• Power management<br />
• Security management<br />
• Error management<br />
• QoS<br />
• Burst, width, and command<br />
conversion<br />
Fabric<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 6<br />
I<br />
T<br />
INITIATOR SOCKETS<br />
I<br />
T<br />
I<br />
Initiator Agents (IA)<br />
Target Agents (TA)<br />
T<br />
I<br />
T<br />
TARGET SOCKETS<br />
I<br />
T
Sonics View on NoC<br />
■NoC adapts key abstractions from<br />
networking<br />
• Layering (orthogonalization of concerns)<br />
oSocket-based design<br />
oOptimized fabric protocols<br />
oHigher level services (security, QoS, error<br />
mgmt.)<br />
• Distributed design & implementation<br />
■We call this decoupling<br />
<strong>Technion</strong><br />
<strong>Presentation</strong> 7 6/24/2012
To 1 GHz and Beyond…
Why So Fast?<br />
■ We’re fully converged!<br />
• Computing<br />
• Graphics<br />
• Video/Audio<br />
■ Everything runs user<br />
applications<br />
■ Apps need Giga’s<br />
• 1-2 GHz multicore CPUs<br />
• 100+ GFLOP multicore GPUs<br />
• 15-50 GB/sec DRAM<br />
■ At consumer pricing<br />
■ … and something to<br />
integrate it all!<br />
Consumer Electronics:<br />
“Wish List” 2011<br />
Rank Rank User<br />
Product<br />
Ages 6‐12 Ages 13+ Apps<br />
iPad 1 1 �<br />
Computer 4 2 �<br />
iPhone 3 7 �<br />
Tablet (non‐iPad) 5 5 �<br />
TV 9 4 �<br />
iPod Touch 2 12 �<br />
Kinect for Xbox 360 7 9<br />
E‐Reader 13 3 �<br />
Smartphone (non‐iPhone) 10 8 �<br />
Blu‐Ray Player 12 6 �<br />
Nintendo 3DS 6 16 �<br />
PlayStation 3 11 11 �<br />
Nintendo DS* 8 15 �<br />
Nintendo Wii 16 10 �<br />
Xbox 360 14 13 �<br />
PlayStation Move 17 14<br />
Other Mobile Phone 15 17 �<br />
PlayStation Portable 18 18 �<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong><br />
Source: Nielsen, November 2011<br />
9
Are We Ready?<br />
■ TSMC is ready…<br />
• 28nm HPM, with full ecosystem enablement<br />
■ ARM is ready…<br />
• Cortex-A15 CPU: 1-2.5 GHz, 1-4 cores/cluster<br />
• Mali-T658 GPU: 350 GFLOPs, 1-8 cores/cluster<br />
■ DRAM vendors are ready…<br />
• DDR3/4: 1600-3200 Mb/sec/pin � 6-50 GB/sec, 1-4 channels<br />
• LPDDR2/3: 800-1600 Mb/sec/pin � 3-25 GB/sec, 1-4 channels<br />
• Wide IO: 200-266 Mb/sec/pin � 13-17 GB/sec, 4 channels<br />
■ But what about the middle?<br />
Cortex-A15 Mali-658 Video Audio Camera Display USB<br />
Tablet SoC<br />
DDR<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 10<br />
?<br />
DDR DDR DDR<br />
…
Integration: Giga-SoC Performance Needs<br />
■ GHz on-chip networks<br />
• Peak processor-to-multichannel DRAM bandwidth<br />
• Enable cache-to-cache transfers (e.g. AMBA ® ACE)<br />
• Example: SonicsGN<br />
■ Scalable multichannel DRAM support<br />
• 4 channels and above<br />
• With effective load balancing<br />
• Example: Sonics Interleaved Multichannel Technology<br />
■ High efficiency DRAM scheduling<br />
• ~85% sustained throughput<br />
• With CPU latency priority and guaranteed bandwidths<br />
• Example: Sonics MemMax ®<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 11
Example Tablet Application Processor<br />
533MHz<br />
Cortex-A15 Quad core<br />
CPU<br />
Sonics MemMax<br />
Memory<br />
Scheduler<br />
DRAM<br />
Cont.<br />
CPU<br />
CPU CPU<br />
533MHz<br />
1333MHz 1066MHz 533MHz<br />
L2 Cache<br />
133MHz<br />
ROM<br />
Secure ROM<br />
Cortex-A7 Quad core<br />
133MHz<br />
CPU<br />
Security SRAM<br />
Sonics MemMax<br />
Memory<br />
Scheduler<br />
DRAM<br />
Cont.<br />
CPU<br />
CPU CPU<br />
533MHz<br />
267MHz<br />
1066MHz 1066MHz<br />
L2 Cache<br />
DMA<br />
267MHz<br />
533MHz<br />
Mali-T658<br />
Quad core<br />
GPU<br />
GPU<br />
GPU GPU<br />
Coherency Fabric<br />
HDMI<br />
SonicsGN On-chip Network<br />
133MHz<br />
Ethernet<br />
133MHz<br />
PCIe<br />
267MHz<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong><br />
267MHz<br />
LCD Controller<br />
400MHz<br />
Audio<br />
Video Codec<br />
267MHz 200MHz<br />
Power<br />
Domains<br />
12<br />
Sonics3220 Peripheral Network<br />
133MHz<br />
USB<br />
200MHz<br />
Cam 1<br />
133MHz<br />
SATA<br />
APB Peripherals<br />
133MHz<br />
Cam 2<br />
1066MHz
But What About Power?<br />
■ Convergence drives<br />
massive SoC integration<br />
• Thin is in!<br />
■ All these Giga’s cost<br />
power<br />
• But most devices run from<br />
batteries<br />
■ Result: cannot afford to<br />
power entire SoC at once<br />
• “Dark silicon”<br />
• Power only those subsystems<br />
needed for current apps<br />
• And only as long as needed<br />
Consumer Electronics:<br />
“Wish List” 2011<br />
Rank Rank Battery<br />
Product<br />
Ages 6‐12 Ages 13+ Powered<br />
iPad 1 1 �<br />
Computer 4 2 �<br />
iPhone 3 7 �<br />
Tablet (non‐iPad) 5 5 �<br />
TV 9 4<br />
iPod Touch 2 12 �<br />
Kinect for Xbox 360 7 9<br />
E‐Reader 13 3 �<br />
Smartphone (non‐iPhone) 10 8 �<br />
Blu‐Ray Player 12 6<br />
Nintendo 3DS 6 16 �<br />
PlayStation 3 11 11<br />
Nintendo DS* 8 15 �<br />
Nintendo Wii 16 10<br />
Xbox 360 14 13<br />
PlayStation Move 17 14<br />
Other Mobile Phone 15 17 �<br />
PlayStation Portable 18 18 �<br />
Source: Nielsen, November 2011<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 13
■ General techniques<br />
• Stop/start subsystem clocks<br />
• Dynamic clock frequency<br />
• On/off voltage domains<br />
• Dynamic voltage/frequency domains (DVFS)<br />
■ <strong>IP</strong>-specific techniques<br />
• ARM big.LITTLE (use optimum <strong>IP</strong> for loading)<br />
■ Power managers implement the techniques<br />
• Software: flexible, but slow<br />
• Hardware: very responsive, but less flexible<br />
difficulty<br />
Managing Dark Silicon…<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 14
Example Tablet Application Processor<br />
533MHz<br />
Cortex A15 Quad core<br />
CPU<br />
Sonics MemMax<br />
Memory<br />
Scheduler<br />
DRAM<br />
Cont.<br />
CPU<br />
CPU CPU<br />
533MHz<br />
1333MHz 1066MHz 533MHz<br />
L2 Cache<br />
133MHz<br />
ROM<br />
Secure ROM<br />
Cortex A7 Quad core<br />
133MHz<br />
CPU<br />
Security SRAM<br />
Sonics MemMax<br />
Memory<br />
Scheduler<br />
DRAM<br />
Cont.<br />
CPU<br />
CPU CPU<br />
533MHz<br />
267MHz<br />
1066MHz 1066MHz<br />
L2 Cache<br />
DMA<br />
267MHz<br />
533MHz<br />
Mali-T658<br />
Quad core<br />
GPU<br />
GPU<br />
GPU GPU<br />
Coherency Fabric<br />
HDMI<br />
SonicsGN On-chip Network<br />
133MHz<br />
Ethernet<br />
133MHz<br />
PCIe<br />
267MHz<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong><br />
267MHz<br />
LCD Controller<br />
400MHz<br />
Audio<br />
Video Codec<br />
267MHz 200MHz<br />
Power<br />
Domains<br />
15<br />
Sonics3220 Peripheral Network<br />
133MHz<br />
USB<br />
200MHz<br />
Cam 1<br />
133MHz<br />
SATA<br />
APB Peripherals<br />
133MHz<br />
Cam 2<br />
1066MHz
Managing Power with SonicsGN<br />
■ Flexible power<br />
domain support<br />
• Asynch/mesochronous<br />
• Isolation/level shifters<br />
■ HW-controlled safe<br />
shutdown<br />
■ Automatic wakeup<br />
■ Benefits:<br />
• More domains<br />
• Quicker shutdown<br />
• Faster wakeup<br />
■ Keep more dark,<br />
more of the time<br />
50% SoC Power Reduction!<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 16
Summary<br />
■ GHz, GFLOPs and GB/sec are consumer design<br />
points<br />
• And your next SoC will need them!<br />
■ SoC integration must exploit that performance<br />
• GHz on-chip networks: SonicsGN<br />
• Multichannel DRAM optimization: Sonics IMT<br />
• High efficiency DRAM scheduling: Sonics MemMax<br />
■ … while improving battery life<br />
• Automatic hardware power management, with software<br />
policies<br />
■ Integration requirements<br />
• Twice the frequency<br />
• One half the SoC power<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 17
SonicsGN Introduction
SonicsGN (SGN) Benefits<br />
■ Only Network-on-chip available at 2X<br />
the speed of competing solutions<br />
■ Lowest System Power<br />
■ Highly Optimized area<br />
■ Ideal for 28nm process node and<br />
below<br />
■ Advanced tooling – simplified network<br />
capture environment along with<br />
powerful analysis capability<br />
■ Exceeds the fabric performance<br />
requirements for processors at<br />
speeds up to 1-3GHz<br />
6/24/2012<br />
On-Chip Network <strong>IP</strong> for Complex SoCs Design<br />
<strong>Technion</strong> <strong>Presentation</strong><br />
19
SGN Topology Choices<br />
Shallow Network<br />
� A routed network topology <strong>IP</strong><br />
6/24/2012<br />
■ Complement other<br />
Sonics <strong>IP</strong> components<br />
<strong>Technion</strong> <strong>Presentation</strong> 20
On-Chip Network – Under the Hood<br />
• Routed Network<br />
• Virtual channels<br />
• Clock Crossing<br />
• Multiple power<br />
domains<br />
• Bit conversion<br />
6/24/2012<br />
<strong>Technion</strong> <strong>Presentation</strong><br />
21 21
Dolphin vs. Piranha<br />
Fabric Protocol Dolphin Piranha<br />
SGN Benefits<br />
First product SonicsMX SonicsGN<br />
Introduction 2004 2011<br />
System architecture<br />
Fabric + decoupled Fabric + decoupled<br />
> 150 tape-outs, > 1 Billion chips<br />
agents<br />
agents<br />
Switching element<br />
Crossbar,<br />
shared link<br />
Router Higher frequency<br />
Flow control Single cycle Credit-based Higher frequency/better clock crossing<br />
Maximum switch depth 4 Unlimited Scalability<br />
Separate request/response networks? Yes Yes<br />
Source routed? Yes Yes<br />
Header/payload separation? Yes Yes<br />
Header/payload transmission Parallel Serialized Routing congestion and scalability<br />
Blocking ID-based concurrency? Yes Yes<br />
Non-blocking concurrency Threads Virtual channels Mapping flexibility and lower area<br />
Internal buffering Per thread Shareable Lower area<br />
Per link data widths with integrated width<br />
conversion?<br />
Yes Yes<br />
Divided synch.,<br />
Clock domain crossing Divided synch. Mesochronous,<br />
Asynchronous<br />
Easier timing closure and partitioning<br />
Power domain crossing External<br />
Dynamic voltage,<br />
switched voltage<br />
Lower power<br />
Power management support Single domain Many domain Per domain control for lower power<br />
6/24/2012<br />
<strong>Technion</strong> <strong>Presentation</strong> 22
Design Capture<br />
Master core<br />
Socket<br />
Interface<br />
6/24/2012<br />
NoC Block Diagram View Power Partitioning View<br />
Request network<br />
Response network<br />
Slave Core<br />
Socket Interface<br />
<strong>Technion</strong> <strong>Presentation</strong><br />
Power Domains<br />
23
NoC Floor Plan Interactions<br />
M2<br />
T2<br />
Matching Floor Plan Mismatched Floor Plan<br />
M1 M3<br />
M4 M5<br />
T1 T3 T5 T6<br />
M6<br />
T4<br />
M4 M5<br />
■ NoC benefits greatly impacted by mismatched floor plan<br />
■ Wiring congestion worse than crossbar!<br />
■ Timing convergence challenges<br />
■ Need technology to protect “logical” topology after floor plan<br />
determined<br />
• Architect analyzes logical topology<br />
• SoC team determines physical topology<br />
• Virtual channels: key to independence!<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 24<br />
M6<br />
T6<br />
T2<br />
M3<br />
T1 T5 T3<br />
M2<br />
M1<br />
T4
Power Partitioning<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 25
SGN Power Management Interface<br />
■ Power management bundle<br />
• 2 signal handshake<br />
• 2 signal wakeup control<br />
(optional)<br />
• Activity status signal (optional)<br />
■ Distributed interconnect<br />
power manager (<strong>IP</strong>M)<br />
• Each IA determines path for<br />
incoming requests<br />
• Identifies required PM domains<br />
• Determines if request<br />
o can proceed<br />
o must terminate immediately<br />
o must request wake-up<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 26<br />
System Power Manager<br />
Interconnect Power Manager
High Performance Consumer<br />
SoC Memory Subsystems
Outline<br />
■ Multicore consumer SoC background<br />
■ DRAM subsystem challenges<br />
■ Solution aspects<br />
■ Putting it all together<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 28
Consumer SoC Examples<br />
■ What are some key applications for consumer SoCs?<br />
■ Key characteristic: relentless push for higher quality<br />
user experiences – at minimum system cost!<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 29
Concurrency in Consumer SoCs<br />
Consumer MPSoCs process data in parallel, but<br />
communicate…<br />
H.264 Decode<br />
Bitstream Entropy<br />
iScan<br />
Recon-<br />
Loop Decoded<br />
Entropy<br />
iScan<br />
iTrans Recon-<br />
Loop<br />
Decoding<br />
iQuant iTrans<br />
struction<br />
Filter Frames<br />
Decoding<br />
iQuant<br />
struction<br />
Filter<br />
Intra<br />
Intra<br />
Prediction<br />
Prediction<br />
MC<br />
MC<br />
Prediction<br />
Prediction<br />
Transport demux<br />
Audio Decode<br />
Video Out<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 30
Concurrency in Consumer SoCs<br />
H.264 Decode<br />
Bitstream Entropy<br />
iScan<br />
Recon-<br />
Loop Decoded<br />
Entropy<br />
iScan<br />
iTrans Recon-<br />
Loop<br />
Decoding<br />
iQuant iTrans<br />
struction<br />
Filter Frames<br />
Decoding<br />
iQuant<br />
struction<br />
Filter<br />
Intra<br />
Intra<br />
Prediction<br />
Prediction<br />
MC<br />
MC<br />
Prediction<br />
Prediction<br />
DRAM<br />
Transport demux<br />
Audio Decode<br />
Video Out<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 31
Concurrency in Consumer SoCs<br />
■ Assertion: video SoC applications have >><br />
50% of system traffic to/from external DRAM<br />
■ Consumer volumes and price points demand<br />
cheapest DRAM configurations that<br />
support required performance<br />
■ Implications:<br />
• SoC architecture is mostly a fan-in tree to external<br />
DRAM<br />
• Maximizing delivered DRAM throughput and<br />
utilization are key<br />
o Fewer DRAMs<br />
o Lower speed grades<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 32
Outline<br />
■ Multicore consumer SoC background<br />
■ DRAM subsystem challenges<br />
• Massive connectivity to DRAM<br />
• Achieving high DRAM utilization<br />
• QoS<br />
• Widely varying DRAM access styles<br />
• Increasing access granularity<br />
■ Solution aspects<br />
■ Putting it all together<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 33
SoC Architecture Trends<br />
■ Massive feature integration<br />
• Driven largely by Moore’s Law (supply) and<br />
convergence (demand)<br />
■ Continued movement of complexity to software<br />
■ Distributed architectures<br />
• Higher scalability (and independence?)<br />
■ Multiple processors<br />
• (Multicore) CPU<br />
• DSP<br />
• Special purpose (MPEG, GFX, …)<br />
■ Distributed DMA<br />
• Removes centralized DMA bottleneck<br />
• Simplifies driver software integration<br />
CPU MPEG DSP<br />
DRAM<br />
Controller<br />
Massive Connectivity to DRAM<br />
3D GFX MAC<br />
Video<br />
I/O<br />
Comm<br />
I/O<br />
System On Chip<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 34
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 35
• 3 traditional processors<br />
• ~90 DRAM connections<br />
• Already 7 years ago!<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 36
Utilization Rate<br />
■ Utilization Rate: the percentage of DRAM data cycles<br />
that transfer data that is useful to the system<br />
■ Example – at 85% utilization, 2 DDR3-1600 parts in a<br />
x32 configuration (e.g. 2 x16 DDR3 DRAMs) deliver:<br />
• (85%) x (1600 Mbits/sec/pin) x (32 pins) / (8 bits/Byte) = 5.44<br />
GBytes/sec<br />
■ Things which reduce DRAM utilization:<br />
• Refresh cycles<br />
• RD/WR data bus turnaround<br />
• Page misses<br />
• Partial bursts (lengths < BL)<br />
• Unaligned bursts<br />
• Command bus conflicts (across banks)<br />
• QoS optimizations!<br />
■ DRAM schedulers arbitrate among a set of system<br />
transactions to optimize DRAM utilization rates<br />
• Perhaps QoS, too!<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 37
DRAM Utilization is Traffic-Dependent<br />
■ System transactions targeting DRAM limit the<br />
peak utilization per initiator, but exploiting the<br />
parallelism between initiators allows an<br />
intelligent scheduler to optimize utilization<br />
■ Utilization-related traffic characteristics:<br />
• Burst lengths, address sequences and address<br />
alignment<br />
• Address relationships across transactions (e.g. 2D)<br />
• Number of outstanding transactions<br />
• Read/write mix<br />
• Ordering constraints<br />
• Time-domain behavior (i.e. isochronous vs.<br />
bursty/asynchronous)<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 38
How Traffic Impacts DRAM Utilization<br />
Traffic Characteristic<br />
Utilization Impact<br />
Burst lengths/sequences/alignment X X X X X<br />
Refresh cycles<br />
Addressing across transactions X X X X<br />
# of outstanding transactions X X X X<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 39<br />
RD/WR turnaround<br />
Read/write mix X<br />
Ordering constraints X X X X<br />
Time-domain behavior X X X X<br />
Assumes refresh is<br />
independent from traffic<br />
Page misses<br />
Partial bursts<br />
Unaligned bursts<br />
Command conflicts<br />
QoS optimizations
High DRAM Utilization vs. QoS<br />
■ High DRAM utilization (throughput) and<br />
Quality of Service are in conflict<br />
• Utilization prefers long DRAM bursts<br />
o DRAM operates most efficiently<br />
• QoS demands short DRAM bursts<br />
o Provide low latency service for CPUs<br />
o Control buffering requirements for real-time users<br />
■ Consumer SoCs use the DRAM scheduler to<br />
tune the trade-off between utilization and QoS<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 40
Locality Challenges: DRAM Access Styles<br />
■ Exploiting spatial locality is key for high utilization<br />
• CPUs tend to stay with an O/S page (multiple DRAM pages)<br />
• Much processing and I/O DMA uses long incrementing bursts<br />
• Image processing is tougher<br />
■ Two-dimensional bursts<br />
• 2D transaction using<br />
a single read or write<br />
command<br />
• Popular for HD video<br />
and graphics<br />
2-D Data Object in Memory<br />
0x08F0 0x08F8 0x0900 0x0908<br />
0x10F0 0x10F8 0x1100 0x1108<br />
0x18F0 0x18F8 0x1900 0x1908<br />
0x20F0 0x20F8 0x2100 0x2108<br />
0x28F0 0x28F8 0x2900 0x2908<br />
2048 Bytes/row<br />
Video Frame Buffer<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 41<br />
1080 rows/frame
Locality Challenges: DRAM Access Styles<br />
■ Exploiting spatial locality is key for high utilization<br />
• CPUs tend to stay with an O/S page (multiple DRAM pages)<br />
• Much processing and I/O DMA uses long incrementing bursts<br />
• Image processing is tougher<br />
■ Two-dimensional bursts<br />
• 2D transaction using<br />
a single read or write<br />
command<br />
• Popular for HD video<br />
and graphics<br />
■ Address tiling<br />
• Rearrange DRAM<br />
address organization<br />
to exploit locality<br />
• Avoids page misses<br />
• One size doesn’t<br />
fit all!<br />
DRAM Page 0 DRAM Page 0<br />
(on 2-D DRAM Data bank Object 0) (on in DRAM Memory bank 1)<br />
0x08F0 0x08F8 0x0900 0x0908<br />
0x10F0 0x10F8 0x1100 0x1108<br />
2-D Data<br />
Object in<br />
Memory<br />
0x18F0 0x18F8 0x1900 0x1908<br />
0x20F0 0x20F8 0x2100 0x2108<br />
0x28F0 0x28F8 0x2900 0x2908<br />
DRAM Page 0<br />
(on DRAM bank 2)<br />
DRAM Page N<br />
(on DRAM bank 0)<br />
DRAM Page 0<br />
(on DRAM bank 3)<br />
DRAM Page N<br />
(on DRAM bank 1)<br />
DRAM Page 1<br />
(on DRAM bank 0)<br />
DRAM Page 1<br />
(on DRAM bank 2)<br />
DRAM Page N+1<br />
(on DRAM bank 0)<br />
2048 Bytes/row<br />
256B/page<br />
Video Frame Buffer<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 42<br />
16 rows/page<br />
1080 rows/frame
Access Granularity:<br />
DRAM Burst Sizes Growing Too Large<br />
DRAM Words (BL) or<br />
DDR Width (Bytes)<br />
10<br />
8<br />
6<br />
4<br />
DDR<br />
2<br />
DDR2<br />
0<br />
0<br />
2003 2004 2005 2006 2007 2008 2009<br />
DDR BL DDR2 BL DDR3 BL<br />
DDR Width (Bytes) DRAM Burst<br />
DDR3<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 43<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
Minimum DRAM Burst (Bytes)<br />
64 Bytes<br />
8 Bytes<br />
Data Transfers Shorter Than Burst Size Lose Efficiency<br />
Many SoC Data Objects
Example: Analytic Traffic Characterization<br />
Traffic Flow<br />
Burst<br />
Length<br />
(dword) Height Aligned?<br />
Best-case Transfer<br />
Efficiency Page misses per DDR Burst<br />
Tiled? N Y N Y N Y<br />
16 32 64 Burst (Bytes) 16 16 32 32 64 64<br />
8 8 8 BL 8 8 8 8 8 8<br />
2 4 8 Data Width (B) 2 2 4 4 8 8<br />
Vid decode Wr 2 8 N 53% 36% 22% 53% 7% 73% 9% 89% 11%<br />
Vid decode Rd 10 3 N 85% 74% 59% 17% 6% 30% 10% 47% 16%<br />
Vid back end 32 1 Y 100% 100% 100% 6% 6% 13% 13% 25% 25%<br />
Source: Customer (HDTV) System Dataflow<br />
■ As DRAM burst size increases, efficiency drops<br />
substantially for short and/or unaligned traffic<br />
• MPEG macro-block fetch is an easy example<br />
• Long burst traffic stays efficient<br />
• CPU traffic loses efficiency if DDR burst size > cache line size<br />
■ 2D traffic generates many page misses for<br />
row/bank/column DRAM address organization<br />
• Address tiling reduces 2D page misses substantially<br />
• Long burst traffic is (again) tolerant<br />
• But, traditional CPU traffic prefers row/bank/column!<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 44
Outline<br />
■ Multicore consumer SoC background<br />
■ DRAM subsystem challenges<br />
■ Solution aspects<br />
• High concurrency interconnect networks<br />
• Single ported DRAM controller protocols<br />
• High utilization scheduling with QoS<br />
• Scalable multichannel<br />
• Flexible address tiling<br />
■ Putting it all together<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 45
Congestion<br />
Problem<br />
On-chip<br />
Off-chip<br />
Star Topology Memory Subsystems<br />
Traditional Approach<br />
CPU MME DSP GFX<br />
Addr/<br />
Cmd<br />
DRAM<br />
Subsys.<br />
D<br />
R<br />
A<br />
M<br />
D<br />
R<br />
A<br />
M<br />
D<br />
R<br />
A<br />
M<br />
Data<br />
FIFO’s<br />
■ Initiators present requests in<br />
parallel to multi-port scheduler<br />
■ FIFO’s at initiators provide<br />
• Rate decoupling<br />
• Service jitter tolerance<br />
■ DRAM subsystem needs no<br />
FIFO, only pipelining<br />
■ System performance limited<br />
only by traffic & scheduler<br />
But:<br />
■ LOTS of wires/congestion<br />
■ Lots of small/inefficient FIFO’s<br />
■ Large part of system must be<br />
BW matched to DRAM<br />
Imagine Dedicated Links From > 100<br />
DRAM-connected Cores!<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 46
Single-ported Memory Subsystems<br />
Shared Interconnect Approach<br />
CPU MME DSP GFX<br />
On-chip<br />
Off-chip<br />
Shared Interconnect<br />
DRAM<br />
Subsys.<br />
Addr/<br />
Cmd<br />
D<br />
R<br />
A<br />
M<br />
D<br />
R<br />
A<br />
M<br />
D<br />
R<br />
A<br />
M<br />
Data<br />
■ Interconnect presents requests in<br />
series to single-port scheduler<br />
■ Saves wires/congestion<br />
But:<br />
■ Interconnect arbitration impacts<br />
scheduling<br />
• Risks lower utilization<br />
• May not meet deadlines<br />
■ Where do FIFO’s live?<br />
■ How much of system is BWmatched<br />
to DRAM?<br />
■ System performance also limited<br />
by communication system<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 47
Single-port DRAM Protocols<br />
■ Interconnect and subsystem must support multiple<br />
outstanding requests (cover DRAM pipeline depth)<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 48
In-order Protocol<br />
■ Interface protocol supports multiple outstanding burst<br />
requests, but all service matches request order<br />
In-Order<br />
. . .<br />
DRAM<br />
Controller<br />
Response Resp.<br />
DRAM DRAM DRAM<br />
Scheduler Network<br />
O-O-O/Blocking FC<br />
. . .<br />
Example:<br />
VSIA BVCI<br />
(AMBA AHB<br />
needs<br />
multi-port)<br />
Sched.<br />
Controller<br />
Response Resp.<br />
DRAM DRAM DRAM<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 49<br />
Network<br />
Scheduler<br />
O-O-O/Non-Blocking FC<br />
■ Simplest scheme (lowest hardware<br />
. . .<br />
cost)<br />
■ Service order determined by<br />
interconnect arbitration<br />
■ Scheduler can only optimize pipeline<br />
(looking ahead for page misses to<br />
other banks)<br />
■ High efficiency requires long bursts,<br />
Response Resp.<br />
DRAM DRAM DRAM<br />
Scheduler Network<br />
Per-thread flow control<br />
Sched.<br />
leads to high latency Controller (poor QoS)
Out-of-order Protocol with Blocking FC<br />
■ Interface protocol provides ordering tags to allow scheduler to<br />
reorder some requests, but flow control is shared across all tags<br />
In-Order<br />
. . .<br />
DRAM<br />
Controller<br />
Response Resp.<br />
DRAM DRAM DRAM<br />
Scheduler Network<br />
Head of line<br />
blocking<br />
Example:<br />
AMBA AXI<br />
O-O-O/Blocking FC<br />
. . .<br />
Sched.<br />
Controller<br />
Response Resp.<br />
DRAM DRAM DRAM<br />
Network<br />
Scheduler<br />
O-O-O/Non-Blocking FC<br />
■ Interconnect presents requests<br />
. . .<br />
in order<br />
■ Scheduler queues requests &<br />
chooses order to optimize<br />
throughput and QoS<br />
But<br />
■ Bursty flows can fill queues,<br />
DRAM DRAM DRAM<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 50<br />
Response Resp.<br />
Scheduler Network<br />
Per-thread flow control<br />
hurting Sched. latency & BW for others<br />
■ Full queues Controller block into network<br />
■ Frequency scales poorly with<br />
queue depth
Out-of-order Protocol with Non-blocking FC<br />
■ Interface protocol provides per-thread ID’s and flow<br />
control, enabling re-ordering while preventing blocking<br />
In-Order<br />
. . .<br />
. . .<br />
■ Interconnect maps initiator threads into<br />
Response Resp.<br />
DRAM DRAM DRAM<br />
Scheduler Network<br />
O-O-O/Blocking FC<br />
target threads<br />
■ Scheduler queues requests & chooses<br />
order to optimize throughput and QoS<br />
on per-thread basis<br />
■ Non-blocking (per-thread) flow control<br />
minimizes inter-thread interactions<br />
Response Resp.<br />
DRAM DRAM DRAM<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 51<br />
Network<br />
Scheduler<br />
Sched.<br />
DRAM<br />
■ Per-thread Controller queues inherently Controller ordered,<br />
implemented as compiled SRAM<br />
■ Result: lower latency, BW guarantees,<br />
higher guaranteed throughput<br />
O-O-O/Non-Blocking FC<br />
. . .<br />
Sched.<br />
Controller<br />
Response Resp.<br />
DRAM DRAM DRAM<br />
Scheduler Network<br />
Per-thread flow control<br />
Example:<br />
<strong>OCP</strong> Threads
Single-port DRAM Subsystem Protocols<br />
Ordering/<br />
flow control<br />
In-order/<br />
blocking<br />
Out-of-order/<br />
blocking<br />
Out-of-order/<br />
non-blocking<br />
Peak BW limited by DRAM DRAM DRAM<br />
Ordering flexibility None High High<br />
Queuing None Shared Per-thread<br />
Compiled RAM-friendly No No Yes<br />
Init. BW==DRAM BW Yes Yes No<br />
DRAM efficiency Medium High High<br />
Max. CPU latency High Medium Low<br />
Data interleaving None Minor High<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 52
Optimizing for High Utilization<br />
Goal Approach<br />
Minimize RD/WR turnarounds Group reads and writes<br />
Hide page misses Bank scheduling<br />
Avoid page misses to banks with conflicting<br />
transfers in flight Bank state tracking<br />
Maximize page/bank scheduling opportunities at<br />
minimum area Expose intrinsic traffic concurrency to scheduler<br />
Ensure write data bursts at DDR rate Buffer write data in DDR clock domain<br />
Ensure read data absorbed at DDR rate Buffer read data in DDR clock domain<br />
Isolate SoC architecture from DDR clock Asynchronous FIFO<br />
Prevent low-bandwidth initiators from stalling DDR Decoupling FIFO<br />
Make CPU highest priority, interleave bursts &<br />
Minimize CPU latency<br />
ensure paths cannot block<br />
Protect groups of DDR bursts against higher<br />
Eliminate page/bank thrashing<br />
priority traffic<br />
Demote QoS traffic overusing bandwidth & fair<br />
Protect against best-effort traffic starvation best-effort arbitration<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 53
QoS-based Arbitration<br />
■ Initiator data flow threads mapped to DRAM threads by interconnect<br />
• e.g. 40 data flows sharing 8 DRAM threads in a digital video system<br />
■ Independent threads assigned to QoS level<br />
■ Non-blocking, multi-threaded fabric and DRAM interfaces allow:<br />
• Higher priority requests to interleave with & respond before others<br />
• Guaranteed BW threads to minimize buffering & receive latency<br />
guarantees<br />
• High DRAM utilization<br />
Thread QoS<br />
Level<br />
Bandwidth<br />
Allocation ?<br />
Priority Yes<br />
Bandwidth Yes<br />
QoS Model<br />
Low latency while within BW allocation,<br />
best-effort otherwise<br />
Guaranteed BW while within BW<br />
allocation, best-effort otherwise<br />
Best-effort No N/A<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 54
Multichannel Solves Access Granularity<br />
CPU DSP MME<br />
SonicsMX SMART Interconnect<br />
SonicsSX SMART Interconnect<br />
64Mx16x4<br />
MemMax<br />
64<br />
DDR3 DDR3 DDR3 DDR3<br />
CPU DSP MME<br />
SonicsMX SMART Interconnect<br />
From Single to<br />
Multichannel<br />
CPU DSP MME<br />
SonicsMX SMART Interconnect<br />
DDR2 DDR3 DDR3<br />
Channels 1 1 2<br />
Data Width (B) 4 4 2<br />
Effective BW 100% 84% 100%<br />
Source: Customer (HDTV) System Dataflow<br />
Constant Frequency/Ideal Load Balancing<br />
SonicsSX SMART Interconnect<br />
MemMax<br />
32<br />
DDR3 DDR3<br />
64Mx16x2<br />
CPU DSP MME<br />
SonicsMX SMART Interconnect<br />
MemMax<br />
32<br />
DDR3 DDR3<br />
64Mx16x2<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 55
Multichannel DRAM System Challenges<br />
Application View<br />
Address<br />
Space<br />
Region 1<br />
Hole 1<br />
Region 2<br />
Region 3<br />
Hole 2<br />
2 Channels<br />
No Interleave<br />
Region 1<br />
Hole 1<br />
Ch. 1<br />
Region 2<br />
Ch. 2<br />
Region 3<br />
Hole 2<br />
Key Problems:<br />
■ Load balancing<br />
2 Channels<br />
Interleaved<br />
Region • Must 1 balance Region memory 1 traffic<br />
Hole 1<br />
1<br />
2<br />
1<br />
2<br />
Region 2<br />
1<br />
2<br />
1<br />
2<br />
Region 3<br />
Hole 2<br />
4 Channels<br />
Interleaved<br />
Hole 1<br />
1<br />
2<br />
3<br />
4<br />
Region 2<br />
1<br />
2<br />
3<br />
4<br />
Region 3<br />
Hole 2<br />
Physical Organization<br />
evenly among channels<br />
■ Maintaining throughput<br />
• Multiple channels cause<br />
throughput & ordering issues<br />
for pipelined memories<br />
Software and <strong>IP</strong> cores<br />
must manage multiple<br />
channels explicitly<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 56
Interleaved Multichannel Technology (IMT*):<br />
Seamless Transition to Multichannel<br />
*patents pending<br />
■ Interleaving support requires splitting traffic for<br />
delivery to proper channel<br />
• Splitting in memory controller creates performance and routing<br />
congestion bottleneck<br />
■ Predictably high performance<br />
• Automatically spreads traffic across channels to ensure load<br />
balancing<br />
• Keeps DRAMs operating at full throughput, without costly<br />
reorder buffers<br />
■ Scalable architecture<br />
• Up to 8 interleaved channels within the same address region<br />
• Fully distributed to avoid bottlenecks & placement restrictions<br />
■ Application flexibility<br />
• Transparent to software and initiator hardware<br />
• Supports full or partial memory configurations – at run time<br />
Multichannel Interleaving in the Interconnect<br />
Higher Performance, Lower Area, More Scalable<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 57
Transparent Multichannel Interleaving<br />
with Access-Optimized Boundaries<br />
Application View<br />
Physical Organization<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 58<br />
Independent Interleaving Size Support
2D Bursts, Address Tiling & Multichannel<br />
■ Two-dimensional block bursts<br />
• 2D transaction using a single read/write command<br />
• Popular for HD video and graphics<br />
■ Address tiling<br />
• Rearrange DRAM<br />
address organization<br />
to exploit 2D locality<br />
• Avoids page misses<br />
• >1 tiling schemes<br />
active at once<br />
■ Channels divide<br />
buffer into columns<br />
• Network splits 2D<br />
bursts that cross<br />
channel edges<br />
DRAM Page 0 DRAM Page 0<br />
(on 2-D DRAM Data bank Object 0) (on in DRAM Memory bank 1)<br />
0x08F0 0x08F8 0x0900 0x0908<br />
0x10F0 0x10F8 0x1100 0x1108<br />
2-D Data<br />
Object in<br />
Memory<br />
0x18F0 0x18F8 0x1900 0x1908<br />
0x20F0 0x20F8 0x2100 0x2108<br />
0x28F0 0x28F8 0x2900 0x2908<br />
DRAM Page 0<br />
(on DRAM bank 2)<br />
DRAM Page N<br />
(on DRAM bank 0)<br />
DRAM Page 0<br />
(on DRAM bank 3)<br />
DRAM Page N<br />
(on DRAM bank 1)<br />
DRAM Page 1<br />
(on DRAM bank 0)<br />
DRAM Page 1<br />
(on DRAM bank 2)<br />
DRAM Page N+1<br />
(on DRAM bank 0)<br />
2048 Bytes/row<br />
256B/page<br />
Video Frame Buffer<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 59<br />
16 rows/page<br />
1080 rows/frame
Outline<br />
■ Multicore consumer SoC background<br />
■ DRAM subsystem challenges<br />
■ Solution aspects<br />
■ Putting it all together<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 60
DRAM-limited Consumer SoCs: Solution<br />
Requirements<br />
■ DRAM subsystems optimized for high<br />
utilization and good Quality of Service (QoS)<br />
characteristics<br />
■ On chip interconnection networks that<br />
manage the large and increasing numbers of<br />
DRAM consumers<br />
• And protect the <strong>IP</strong> cores from DRAM evolution<br />
■ Solutions to inefficiencies due to different<br />
access patterns and access granularities<br />
■ Analysis tooling to enable SoC architecture<br />
exploration and performance validation<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 61
MemMax Memory Scheduler<br />
Multi-threaded &<br />
multi-tagged with<br />
non-blocking flow control<br />
In-order with early<br />
pre-charge/activate<br />
Compiled (SRAM) data<br />
buffers decouple rates<br />
& cross clock domain<br />
■ Address tiling<br />
■ Request grouping<br />
■ 2D burst support<br />
■ Guaranteed<br />
bandwidth QoS<br />
with demotion<br />
■ Per-thread buffer<br />
sizing<br />
■ Run time<br />
re-programmable<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 62
Futures
Futures<br />
■ 3D integration<br />
■ Power management<br />
■ Heterogeneous cache coherence<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 64
Wide I/O: TSV-enabled Mobile DRAM<br />
■ High bandwidth, even at low capacity<br />
• Like 4 (x32) channels of LPDDR2-800<br />
• 12.8 GBytes/sec peak bandwidth<br />
■ Lowest power<br />
• No PHY (simple drivers/receivers, no PLL/DLL)<br />
• Low loading (capacitance and inductance)<br />
• Modest frequency (200 MHz)<br />
■ Smallest form factor<br />
• 3D stacked based on thin (50µm) TSV-based die<br />
■ Minimal change to DRAM design (ex-TSV)<br />
■ Risks are all TSV-related<br />
■ Smartphone market will drive volumes<br />
■ SoC problem: how to spread traffic across channels?<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 65
Example Wide I/O System Solution<br />
Transport<br />
Engine<br />
IA<br />
TA<br />
SRAM<br />
Host<br />
CPU<br />
Video<br />
Decoder<br />
Video<br />
Post<br />
Processor<br />
Video<br />
Pre<br />
Processor<br />
Graphics<br />
Engine<br />
Audio<br />
Processor<br />
Sonics Interconnect with IMT<br />
Streaming<br />
Processor<br />
Debug<br />
Interface<br />
IA IA IA IA IA IA IA IA<br />
TA<br />
MemMax<br />
0<br />
Wide I/O<br />
Controller<br />
0<br />
Wide I/O<br />
DRAM<br />
TA<br />
MemMax<br />
3<br />
Wide I/O<br />
Controller<br />
3<br />
TA<br />
Storage<br />
SOC<br />
6/24/2012 <strong>Technion</strong> <strong>Presentation</strong> 66<br />
TA<br />
TA<br />
GPIO USB FLASH<br />
■ Concurrency mgmt., channel<br />
splitting, load balancing<br />
■ Scheduling for high efficiency<br />
& QoS<br />
■ Wide I/O interface<br />
TA
THANK YOU!