Intel Ivy Bridge Architecture

Intel Ivy BridgeArchitectureMatt Kelly & Kevin Hartman

Outline● Introduction● Implementation Technology● Overall Architecture● Core Architecture● Memory● On-Die Graphics● Benchmarks● Security

What is Ivy Bridge?● Successor to Sandy Bridge○○○Most features carried overTick-Tock Model■ 32 nm -> 22 nmBut, more than just a die shrink■ Tick+● First microarchitecture built with Tri-Gate transistors

Main Goals● Low power computing advancements● Increased graphics performance● Enhanced security capability

Die ComparisonSandyBridgeIvyBridge

Previous Transistor Design● "Flat" source /drain● Less control overcurrent

Tri-Gate Transistor● Source / drainextend into gate● Better control overcurrent

Performance Comparison

Ring Interconnect● Connects all keycomponents:○ SA, LLCs, Cores, GPU● Actually 4 rings:○○○○32-byte data ring■ Two packets for e.g.cache dataRequest ringAcknowledge ringSnoop ring

Ring Benefits● Bandwidth: 96 GB/s @ 3 MHz (per core)○ Aggregated bandwidth = 96 * 4 cores = 384 GB/s● Distributed arbitration○○Communicate via packets"Train" stops -> see there's no cargo -> hop on!■ Decision to get on is entirely local● Extremely scalable, modular, maintainable○○Microarchitecture doesn't care if components areadded or removedExpected to be around for a long time

Core Architecture● Intel "Dynamic Execution" for Core series:○○○In-order issueSpeculative, super-scalar, out-of-order execution14 stage pipeline1. Front-End Pipeline (FEP)■ Fetch and decode, buffer micro-ops2. Execution Engine (EE)■Dynamically schedule micro-ops when operandsare ready3. Retirement Unit (EE)■Physical Register File introduced in Sandy Bridge

Vector Capabilities - SSE● Streaming SIMD Extensions (SSE)introduced in 1999 with Pentium III○ Maximum vector size = 128 bits○ Maximum number of registers = 2 (e.g. A = A + B)○ Support for only one data type: 4 32-bit FP numbers● SSE2 - SSE4○○More data types and instructionsStill only 128-bit, two register

Advanced Vector Extensions (AVX)● Sandy Bridge introduction○○Double size to 256-bit vectorsSupport for 3 or 4 registers per instruction■ e.g. C = A + B■ Allows non-destructive operations● Improved in Ivy Bridge○○○○Conversion between compressed 16-bit FP formatand 32-bit single precision format■ Latter is used for AVX, SSEEnables higher precision calculationsFuture support for up to 1024 FP vector supportMore operations per instruction -> less power usage

Physical Register File● Micro-ops carry no data, just pointers○ Greatly reduce out-of-order execution overhead● AVX makes PRF a necessity○○Don't want to carry 256-bit operandsSave die area■ Can add more buffers:NehalemLoad Buffers 48 64Store Buffers 32 36Scheduler Entries 36 54PRF Integer N/A 160PRF Floating Point N/A 144ROB Entries 128 168Ivy Bridge

Branching● Previously two bits○○Whether it was taken or notStrong or weak confidence whether it will be takenagain● Intel noticed almost all are strong confidence● Sandy Bridge -> one confidence bit formultiple branches● Ivy Bridge also introduced variable-sizebranch target structures○ Most branches are close -> save room

Memory● Shared L3 cache (LLC)○○○○4 slices - one per coreAny core can access any sliceGPU also has equal accessShared via ring■ Ring runs directly over LLC -> no area impact

Graphics● Ivy Bridge brings:○○Improved programmability■ Enhancements made to parallelize execution ofinstructions.Increased performance■ Architectural changes to cache and shader cores

GPU General Architecture● GPU now split into 5 partitions○ Global, slice common, slice, media, display● Enhanced scalability○Sets (eight in a set) of Shader cores can be slicedoff during hardware implementation.■ Allows Intel to produce different budgetintegrated graphics■ They can reduce hardware costs and decreasethe size of the chip■ Previously, low budget versions wereimplemented by scaling back operatingfrequency

Shader Core Front End● Previously○○● Now○○○Each core had its own 4KB cache.Instructions would be cached redundantly■ Shaders often run on multiple coressimultaneouslyShader cores within the same slice (of 8) share asingle 32KB cacheNo duplicates, so more instructions can be cachedBonus■ Threads/core bumped up from 5 to 8

Front End Comparison

Shader Core Back End● Increased parallelization of instructionexecution● Previously○○● Now○○Possible 2 instructions issued/cycleSecond pipeline was advanced-math specific■ Small chance of co-issue: 10%Still 2 instructions issued/cycleSecond pipeline expanded to support more commoninstructions■ Floating point multiplication■ Register moves■ Higher chance of co-issue: 60-70%

Back End Comparison

Sampling Pipeline - Textures● Each slice now has its own samplingpipeline● 2 slices on Ivy Bridge○○2 sampling pipelines~double the bandwidth requirements● Integrated GPU = shared memory w/ CPU○ not a ton of liberty with bandwidth● L3 graphics cache added to curb this○ GPU also shares LLC with CPU, so in total it has 4levels of cache!

Graphics - Programmability● Added double precision floating point units○○CompatibilityGeneral purpose code● Tessellation support○Hardware supports tessellation for Direct X 11.0 andOpenCL 1.1

Benchmarks - Average Power

Benchmarks - Energy Used

Benchmarks - Game Performance

Random Number Generator● Codename "Bull Mountain"● Completely digital○ Researched / under development for over a decade● Refines random bits into "adequatelyrandom" secure 256-bit sequences

Random Bit Generator● To get a random bit:○○○Force both input andoutput of inverters tologic '1'Creates instabilityRandom due tothermal noise● Yields one randombit per cycle

Questions?

Intel Ivy Bridge Architecture

Create successful ePaper yourself

Delete template?

Save as template?