13.07.2015 Views

Cell Processor Architecture

Cell Processor Architecture

Cell Processor Architecture

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Ramazan Cav


<strong>Cell</strong> <strong>Processor</strong>• 3.2 GHz PowerPC Core, managingprocessor• 512 KB of L2 Cache• 8 SPEs –each with 256 KB ofSRAM, each one is a 128 bit SIMDvector processor• Optimized towards singleprecision floating pointcomputationsSource: http://www.blachford.info/computer/<strong>Cell</strong>/<strong>Cell</strong>1_v2.html


<strong>Cell</strong> <strong>Processor</strong> <strong>Architecture</strong>Source: http://www.blachford.info/computer/<strong>Cell</strong>/<strong>Cell</strong>1_v2.html


Synergistic <strong>Processor</strong> Elements (SPE)• RISC processor with 128 bit SIMD organization for single and double precisioninstructions• Self contained vector processor acts as an independent processor• Multiple operations simultaneously with 1 instruction• 256 KB embedded SRAM for instructions and data ‐> Local Storage• 128 registers that are 128 bits wide• At 3.2 GHz each SPE is capable of 25.6 GFLOPS of single precision performance• Registers can be used for scalar data types ranging from 8 to 128 bits in size• Lack of cache, instead use a Local Store –harder to program but reduces hardwarecomplexity and increases performance• Act like a second level register file• SPEs operate on registers only which are read from or written to Local Stores• Local Stores can read/write main memory in blocks of 1KB to 16KB


SPEs• By not using cache ‐> Less circuitryand complex design leads to a fastersystem• Local Store can move 16 bytes (128bits) per cycle continuously for over10,000+ cycles w/o going to RAM• Common Problem: Contention• Solution: The external data transfersaccess the local memory 1024 bits ata time in one cycle• Moves a lot of data at once to keepcontention to a minimum• 1985 Cray 2 supercomputer used aLocal Store type setupSource: http://www.blachford.info/computer/<strong>Cell</strong>/<strong>Cell</strong>1_v2.html


Rambus XDR Memory & Controller• 2 Channel high speed XDR RAM with memory bandwidth of 25.6 GB persecond• Memory interface run at 3.2 Gb per second per pin• ECC protected• XDR is designed to scale to 6.4 Gb/s• Total system memory is variable,XDR interface is configurable• XDR DRAM has 8 banks interleaved• Capable of sustained data transfers of8000/6400/4800 MB/sSource: http://www.rambus.com/us/technology/solutions/xdr/index.html


Programming Models• Job queue• The PPE maintains a job queue, schedules jobs in SPEs, and monitors progress. EachSPE runs a "mini kernel" whose role is to fetch a job, execute it, and synchronize withthe PPE.• Self‐multitasking of SPEs• The kernel and scheduling is distributed across the SPEs. Tasks are synchronized usingmutexes or semaphores as in a conventional operating system. Ready‐to‐run tasks waitin a queue for an SPE to execute them. The SPEs use shared memory for all tasks in thisconfiguration.• Stream processing• Each SPE runs a distinct program. Data comes from an input stream, and is sent toSPEs. When an SPE has terminated the processing, the output data is sent to an outputstream.• This provides a flexible and powerful architecture for stream processing, and allowsexplicit scheduling for each SPE separately. Other processors are also able to performstreaming tasks, but are limited by the kernel loaded.• Software managed cache• SDK’s software implementation of a cache to simply a programmer’s job by presentinga traditional cache that can be used to program the SPEs


PlayStation 3 Overview• <strong>Processor</strong>: <strong>Cell</strong> <strong>Processor</strong>• GPU: The RSX (Reality Synthesizer) NVidia Created• 500 MHz, 300 million transistors• Built on the traditional independent vertex/pixel shader architecture• 1 HDMI output that supports: 480i, 480p, 720p, 1080i and 1080p• Blu‐Ray Drive• 20, 60, 120, 250 GB hard drive sizes• Built in Wi‐Fi 802.11 b/g• Audio: Dolby Digital 5.1, DTS, LPCM, DSP• Current <strong>Cell</strong> CPU at 65 nm, GPU at 45 nm• 256 MB XDR RAM at 3.2 GHz• 256 MB GDDR3 VRAM at 700 MHz


RSX GPU• 300+ Million transistors on 8 Layer90 nm process• Newer systems have a 40 nm GPU• Connected to <strong>Cell</strong> by 35 GB persecond link• 20GB/s Write, 15GB/s read• Same <strong>Architecture</strong> as the GeForceGTX 7800


<strong>Cell</strong>’s Future• Possible advancements that can be made:• Smaller VLSI technologies can allow for more processors integrated into chip• CMP style with 2 independent <strong>Cell</strong> processors in 1 package• Expand <strong>Cell</strong> to have another PPE as well as more SPEs• Actuality• IBM stopped development of the <strong>Cell</strong> processor• Some reasons may include large foundation, better technologies to adapt andstart fresh with• Incompatible with new and upcoming technologies that are more suitable to beused in a <strong>Cell</strong>‐like processor• Good News• IBM does believe that heterogeneous multiprocessors, <strong>Cell</strong> was the first massmarket example of this, are here to stay (David Turek of IBM)• Heterogeneous multiprocessors contain at least 2 different types of cores• IBM was right => new CPUs have GPUs integrated in them


Summary• Emphasizes efficiency per watt• Prioritizes bandwidth over latency• Favors peak computational throughput over simplicity of program code• Regarded as a challenging environment for software development• Shows promise with wide range of possible uses (supercomputing, etc.)• Basically 9 CPUSs in one package• RISC approach ‐> moving functionality into software => Compilers are moreimportant and more complicated

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!