12.07.2015 Views

Intel Ivy Bridge Architecture

Intel Ivy Bridge Architecture

Intel Ivy Bridge Architecture

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Intel</strong> <strong>Ivy</strong> <strong>Bridge</strong><strong>Architecture</strong>Matt Kelly & Kevin Hartman


Outline● Introduction● Implementation Technology● Overall <strong>Architecture</strong>● Core <strong>Architecture</strong>● Memory● On-Die Graphics● Benchmarks● Security


What is <strong>Ivy</strong> <strong>Bridge</strong>?● Successor to Sandy <strong>Bridge</strong>○○○Most features carried overTick-Tock Model■ 32 nm -> 22 nmBut, more than just a die shrink■ Tick+● First microarchitecture built with Tri-Gate transistors


Main Goals● Low power computing advancements● Increased graphics performance● Enhanced security capability


Die ComparisonSandy<strong>Bridge</strong><strong>Ivy</strong><strong>Bridge</strong>


Previous Transistor Design● "Flat" source /drain● Less control overcurrent


Tri-Gate Transistor● Source / drainextend into gate● Better control overcurrent


Performance Comparison


Ring Interconnect● Connects all keycomponents:○ SA, LLCs, Cores, GPU● Actually 4 rings:○○○○32-byte data ring■ Two packets for e.g.cache dataRequest ringAcknowledge ringSnoop ring


Ring Benefits● Bandwidth: 96 GB/s @ 3 MHz (per core)○ Aggregated bandwidth = 96 * 4 cores = 384 GB/s● Distributed arbitration○○Communicate via packets"Train" stops -> see there's no cargo -> hop on!■ Decision to get on is entirely local● Extremely scalable, modular, maintainable○○Microarchitecture doesn't care if components areadded or removedExpected to be around for a long time


Core <strong>Architecture</strong>● <strong>Intel</strong> "Dynamic Execution" for Core series:○○○In-order issueSpeculative, super-scalar, out-of-order execution14 stage pipeline1. Front-End Pipeline (FEP)■ Fetch and decode, buffer micro-ops2. Execution Engine (EE)■Dynamically schedule micro-ops when operandsare ready3. Retirement Unit (EE)■Physical Register File introduced in Sandy <strong>Bridge</strong>


Vector Capabilities - SSE● Streaming SIMD Extensions (SSE)introduced in 1999 with Pentium III○ Maximum vector size = 128 bits○ Maximum number of registers = 2 (e.g. A = A + B)○ Support for only one data type: 4 32-bit FP numbers● SSE2 - SSE4○○More data types and instructionsStill only 128-bit, two register


Advanced Vector Extensions (AVX)● Sandy <strong>Bridge</strong> introduction○○Double size to 256-bit vectorsSupport for 3 or 4 registers per instruction■ e.g. C = A + B■ Allows non-destructive operations● Improved in <strong>Ivy</strong> <strong>Bridge</strong>○○○○Conversion between compressed 16-bit FP formatand 32-bit single precision format■ Latter is used for AVX, SSEEnables higher precision calculationsFuture support for up to 1024 FP vector supportMore operations per instruction -> less power usage


Physical Register File● Micro-ops carry no data, just pointers○ Greatly reduce out-of-order execution overhead● AVX makes PRF a necessity○○Don't want to carry 256-bit operandsSave die area■ Can add more buffers:NehalemLoad Buffers 48 64Store Buffers 32 36Scheduler Entries 36 54PRF Integer N/A 160PRF Floating Point N/A 144ROB Entries 128 168<strong>Ivy</strong> <strong>Bridge</strong>


Branching● Previously two bits○○Whether it was taken or notStrong or weak confidence whether it will be takenagain● <strong>Intel</strong> noticed almost all are strong confidence● Sandy <strong>Bridge</strong> -> one confidence bit formultiple branches● <strong>Ivy</strong> <strong>Bridge</strong> also introduced variable-sizebranch target structures○ Most branches are close -> save room


Memory● Shared L3 cache (LLC)○○○○4 slices - one per coreAny core can access any sliceGPU also has equal accessShared via ring■ Ring runs directly over LLC -> no area impact


Graphics● <strong>Ivy</strong> <strong>Bridge</strong> brings:○○Improved programmability■ Enhancements made to parallelize execution ofinstructions.Increased performance■ Architectural changes to cache and shader cores


GPU General <strong>Architecture</strong>● GPU now split into 5 partitions○ Global, slice common, slice, media, display● Enhanced scalability○Sets (eight in a set) of Shader cores can be slicedoff during hardware implementation.■ Allows <strong>Intel</strong> to produce different budgetintegrated graphics■ They can reduce hardware costs and decreasethe size of the chip■ Previously, low budget versions wereimplemented by scaling back operatingfrequency


Shader Core Front End● Previously○○● Now○○○Each core had its own 4KB cache.Instructions would be cached redundantly■ Shaders often run on multiple coressimultaneouslyShader cores within the same slice (of 8) share asingle 32KB cacheNo duplicates, so more instructions can be cachedBonus■ Threads/core bumped up from 5 to 8


Front End Comparison


Shader Core Back End● Increased parallelization of instructionexecution● Previously○○● Now○○Possible 2 instructions issued/cycleSecond pipeline was advanced-math specific■ Small chance of co-issue: 10%Still 2 instructions issued/cycleSecond pipeline expanded to support more commoninstructions■ Floating point multiplication■ Register moves■ Higher chance of co-issue: 60-70%


Back End Comparison


Sampling Pipeline - Textures● Each slice now has its own samplingpipeline● 2 slices on <strong>Ivy</strong> <strong>Bridge</strong>○○2 sampling pipelines~double the bandwidth requirements● Integrated GPU = shared memory w/ CPU○ not a ton of liberty with bandwidth● L3 graphics cache added to curb this○ GPU also shares LLC with CPU, so in total it has 4levels of cache!


Graphics - Programmability● Added double precision floating point units○○CompatibilityGeneral purpose code● Tessellation support○Hardware supports tessellation for Direct X 11.0 andOpenCL 1.1


Benchmarks - Average Power


Benchmarks - Energy Used


Benchmarks - Game Performance


Random Number Generator● Codename "Bull Mountain"● Completely digital○ Researched / under development for over a decade● Refines random bits into "adequatelyrandom" secure 256-bit sequences


Random Bit Generator● To get a random bit:○○○Force both input andoutput of inverters tologic '1'Creates instabilityRandom due tothermal noise● Yields one randombit per cycle


Questions?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!