01.05.2015 Views

Building Blocks for 64-bit AMD Opteron Clusters - Linux Clusters ...

Building Blocks for 64-bit AMD Opteron Clusters - Linux Clusters ...

Building Blocks for 64-bit AMD Opteron Clusters - Linux Clusters ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Building Blocks for 64-bit

AMD Opteron TM Processor-Based Clusters

Rich Brunner

AMD Fellow, SW Architecture

ClusterWorld June 24, 2003


Review: Scale-Up

One system, [mid-range is typically SMP configuration]:

– Multiple processors (up to 16P)

– Large amounts of shared memory

– Shared I/O resources

– High bandwidth and low latency system interconnect

– OS must handle the balancing of incremental resources

causing the complexity of OS to scale with processors and

memory

• Critical that the system design supports balanced scaling of processors,

memory, and I/O devices

• Fits a number of legacy commercial application scenarios where close

sharing of memory by multiple worker threads is critical.

• 2P system that incrementally scales to 16P can cost 5x that of a 2P

system that scales to 4P.

– You pay now for the ability to scale-up later.

– Systems that scale-up beyond 4P are more costly due to the

complexity of the system, its interconnect, and RAS features.

2


Review: Scale-Out

Many small, simple systems:

– Each system has 2 to 4 processors

– Each system has its own memory

– Systems are interconnected by a

common network

• Each individual system does not need to be expensive, complex, and

high-tolerance design

• Small size allows modular and dense packing in racks – very flexible

• Many RAS requirements can be satisfied by simple redundancy and ease

of replacing/upgrading individual

• Network parallelism much easier for standard OS to handle

• However, the network is relatively slow interconnect:

– Good fit for applications/workloads which can be decomposed into

multiple threads/tasks with little data-sharing or communication.

– Fits scientific/technical computing well

3


Targeting the Benefits of Both

AMD is targeting the mid-range, high-volume server space by

supporting the best of both scale-up and scale-out approaches

Performance

1P Server

& Workstation

2P/4P Servers

Workstation

4P+ HPC

Cluster

Provide Design Lower Remove Leverage pricing these Server-

& extend CPUs with

class and

“PC” complexity

the Industry components

volumes

CPUs &

and

to

price barrier

components fit between investment the Scale-up cost and to and

build complexity

& expertise Scale-out balanced, in x86

high-performance

envelope of simple

1P Scale-Out to 8P Scale-Up systems;

systems;

1P Desktop

& Mobile

Price

4


AMD64 Technology

Building a Bridge from the 32- to the 64-bit World

• Leverages the initial success

of AMD Athlon TM MP processor

• Adds 64-bit capabilities to the

world’s highest performing

32-bit core for 2P and 4P

servers

• Current 32-bit applications

will work on both 32-bit and

64-bit operating systems

• Doesn’t require special

hardware or investment in a

proprietary infrastructure

• Developing a solid ecosystem

of motherboards, operating

systems, development tools,

and device drivers

32-bit Operating

System

32-bit

Applications

64-bit Operating

System

32-bit

Applications

64-bit

Applications

5


AMD64

Architecture


AMD64 Computing Strategy (1)

AMD took the x86 architecture and extended it to 64-

bits to make the AMD64 architecture

Extensions are so simple and

compatible, that the processor

can support both x86-32 and

x86-64 at full speed &

performance

• Offers compatibility and

performance for 32-bit

applications (legacy mode)

• Can move to 64-bit addressing

and data types without giving up

32-bit compatibility (long mode)

– Leverages the key PC

infrastructure rather than needing

to re-invent it.

Legacy

Mode


Compat 64-bit

Mode Mode

AMD64

7


AMD64 Computing Strategy (2)

AMD64 Architecture:

64-bit integer registers

64-bit Virtual Address

– 52-bit Physical Address

– Sixteen 64-bit integer

regs

– Sixteen 128-bit SSE

regs

– SSE2 Instruction Set

– Double precision scalar

and vector operations

– 16x8, 8x16 way vector

packed integer

operations

– SSE1 already added

with AMD Athlon TM MP

Processor

S

S

E

In x86

Added by AMD64

63

127 0

XMM0

XMM7

XMM8

XMM8

XMM15

RAX

G

P

R

31

EAX

EAX

EDI

R8

R15

15

AH

7

AL

x

8

7

0

79

0

EIP

8


64-Bit Mode Operation (1)

• Default data size is 32 bits

– Override to 64 bits using new REX prefix

– Override to 16 bits using legacy operation size prefix (66h)

• Default address size is 64 bits

– Pointers are 64 bits

• 2 New instructions added, Some redundant encodings reclaimed

– MOVSXD: Move sign extended double to quad

– SWAPGS: Allows quick swap of GS in ISRs

• New override (REX) allows naming 16 GP and 16 SSE registers

– Only 1 override byte per-instruction is needed for extended registers;

regardless of how many are used by the instruction

Prefix Type

Default

REX

66h

Operand Size

32

64

16

9


64-Bit Mode Operation (2)

• Paging extended to 4-levels to provide 64-bit addressing

– Page Table entries simple extension of x86 PAE-formatted entries

AMD64 supports 64-bit Virtual Address & 52-bits Physical Address

AMD Opteron TM Processor implements 48-bit Virtual Address and 40-

bit Physical Address

• Interrupts and exceptions create 64-bit state

64-bit mode uses flat, unsegmented virtual address space

– The legacy x86 segmentation scheme is disabled in 64-bit mode

• Code Segments still exist in Long Mode to specify default mode

(16-, 32- or 64-bit) and execution privilege level (CPL)

– So existing privilege level and checking mechanisms are retained

• Switch between 64-bit mode & Compatibility Mode

accomplished via normal Far Transfer instructions:

– CALLF, RETF, JMPF, IRET, INT

10


REX Prefix Byte

•Optional REX prefix specifies 64-bit operation size override

and 3 additional register encoding bits

– Extra registers encoded without altering existing instruction format

– REX is actually a family of 16 prefixes (40-4F)

64-bit mode Average instruction length increased by 0.4 bytes

Instruction

Prefixes

Optional

Instruction

REX

Prefixes

Prefix Byte

OPCODE

M O D R M

S I B

Displacement

Immediate

0 1 0 0 W R X B

MODRM r/m, SIB base, or opcode reg field extension

SIB index field extension

MODRM reg field extension

Operand Size: 0 = 32-bit, 1 = 64-bit

11


Industry Transition to AMD64

AMD Opteron TM and AMD Athlon TM 64 processors include

AMD64 technology

• Transition to 64-bit computing will occur at the pace of demand

for its benefits

• Transition from 286 to 386 is the perfect analogy

– 386 was an initiative to create 32-bit capable processors

– Initial users enjoyed highest performance 16-bit execution

– Operating system and application development took time

– Operating system support allowed 16-bit and 32-bit processes to coexist

and interoperate

– 32-bit software is now the norm

– Although the 386 was introduced in 1985, 16-bit compatibility was

important for years

• Great compatibility combined with great performance is the only

practical approach to introducing new capabilities

12


AMD64 Building

Blocks


AMD64 Building Blocks

• Scalable systems must be built around efficient components

– Power, cooling, board space and cost are crucial in building

these.

AMD provides the key building blocks

for scalable AMD64 platforms:

– Glueless multiprocessing through integrated memory controller

and North bridge on the AMD Opteron TM processor die

– HyperTransport TM technology interconnect and devices (PCI-X,

AGP-8x, etc)

– Reference platform designs to provide concrete examples to

our OEM partners

AMD64 thermal/mechanical solutions are designed to meet the

demanding requirements of PC and 1U form factors

• An OEM, working with AMD, designs retail platforms

customized to the OEM’s needs and markets.

14


AMD Opteron Processor Overview

‣ Performance

➼High-bandwidth integrated

memory controller scales with

processor frequency and

number of processors

➼L2 1MB Cache

AMD Opteron processor

architecture

DDR Memory

Controller

‣ Compatibility

➼Approximately 10,000 legacy

applications at time of launch

‣ Scalability

➼Reduced costs for high-end

systems

➼Removes I/O bottlenecks

➼Easy multiprocessor scaling

➼16-bit HyperTransport

technology links are at 1600MT/s;

provides 6.4GB/s peak aggregate

bandwidth

AMD64

Core

L1

Instruction

Cache

L1

Data

Cache

HyperTransport

technology

16 16 16

L2

Cache

15


Integrated Memory Controller

DDR Memory

Controller

AMD64

Processor

Core

L1

Instr

Cache

L1

Data

Cache

HyperTransport

Technology

. .

. .

L2

Cache

• Run memory controller at processor

speeds rather than FSB speeds

– Today’s AMD Athlon XP processor north

bridge memory controllers run at 133 MHz

• Dramatically decrease latency

– QuantiSpeed architecture achieves

~160 ns best case latency

AMD64 architecture designed to achieve

~80 ns best case latency

– Latency generally decreases further as the

core frequency increases

• Add intelligence without decreasing

performance

• Supports variety of DDR memories

– 200, 266 and 333 MHz

– Registered and unbuffered DIMMs

– Future processor cores planned to support

DDR-II, etc.

16


AMD Opteron Processor Architecture

Typical System

Memory access delayed

by passing through

northbridge

Server

Processor

I/O & memory compete

for CPU’s FSB B/W

DDR

DDR

North

Bridge

PCI-X

Bridge

PCI-X

More Chips Needed for

Basic Server

IDE, FDC,

USB, Etc.

South

Bridge

B/W bottlenecks:

link B/W < I/O device B/W

PCI

17


AMD Opteron Processor

System Architecture

DDR

DDR

AMD AMD

Opteron

Processor

HyperTransport technology

buses for glueless I/O or

CPU expansion

Separate memory and

I/O paths eliminate most

bus contention

Fewer chips needed

for basic server

IDE, FDC,

USB, Etc.

AMD-8131

PCI-X

Bridge

AMD-8111

I/O I/O

Hub

PCI

HyperTransport bus

has ample bandwidth

for I/O devices

PCI-X

18


Integrated Memory Controller Latency

(Local memory access, registered memory, CAS2.5)

Read Latency Accessing Local Memory, PC2700

200

180

160

140

PageHit 0 Hop

PageMiss 0 Hop

Prb Miss 1 Hop (2 node case)

Prb Miss 2 Hop (4 node case)

Latency (ns)

120

100

80

60

40

20

0

800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000

Frequency (MHz)

19


AMD64 Processors: HyperTransport

DDR Memory

Controller

AMD64

Processor

Core

L1

Instr

Cache

L1

Data

Cache

HyperTransport

Technology

. .

. .

L2

Cache

W = 2, 4, 8, 16, or 32-bits each way

AMD developed HyperTransport

technology

– High-speed, low pin-count, asynchronous, chipto-chip

board level interconnect

– Proven, industry-standard technology in

production today – ( Xbox )

• HyperTransport technology is not …

– A replacement for PCI, PCI-X, PCI-Express

– A networking fabric

• HyperTransport technology physical

interface

– Point-to-point, differential, low-voltage swing

– HyperTransport 1.0 -> Up to 1600MT/s

– HyperTransport 2.0 -> Beyond 4000MT/s

• HyperTransport technology logical interface

– 100% PCI-compliant API

– OS I/O (PCI) enumeration code untouched for

AMD64 processor-based systems

AMD made HyperTransport an OPEN

STANDARD – High-profile, best-in-class

partners

– Broadcom, Cisco, NVIDIA, Sun, many more

–www.hypertransport.org

20


HyperTransport TM

Technology

ATA-133

10/100

USB

AC’97

AMD-8111 TM TM

HyperTransport

I/O I/O Hub Hub

LPC

IOAPIC

PCI 2.2

FLASH

SIO

• HyperTransport TM technology

I/O Hub

• 8-bit host interface (800MB/s)

• Integrated I/O features

• 10/100 ethernet controller

• EIDE controller (supporting up to ATA-133)

• PCI 2.2, LPC, USB, SMbus, IOAPIC, etc

133Mhz

AMD-8131 TM TM 133Mhz

HyperTransport

PCI-X 1.0 PCI-X 1.0

PCI-X PCI-X 1.0 1.0 Tunnel Tunnel

533/266Mhz

AMD-8132 TM TM 533/266Mhz

HyperTransport

PCI-X 2.0 PCI-X 2.0

PCI-X PCI-X 2.0 2.0 Tunnel Tunnel

• HyperTransport technology tunnel

• 16-bit host interface (6.4GB/s)

• 8-bit next device interface (3.2GB/s)

• 2 independent PCI-X 1.0 ports

• Supporting 133MHz, 100MHz, 66MHz, and

legacy-PCI modes

• I/O APIC

• HyperTransport technology tunnel

• 16-bit host interface (8.0GB/s)

• 16-bit next device interface (8.0GB/s)

• 2 independent PCI-X 2.0 Ports

• Supporting 533MHz, 266MHz, 133MHz, 100MHz,

66MHz, and legacy-PCI modes

• I/O APIC

21


HyperTransport TM Technology

AGP-8X

32-bits @

533MHz

AMD-8151 TM TM

HyperTransport

AGP AGP 3.0 3.0 Tunnel Tunnel

• HyperTransport TM technology tunnel

• 16-bit host interface (6.4GB/s)

• 8-bit next device Interface (3.2GB/s)

• AGP 3.0 Interface

• AGP 3.0 (AGP-8X) & AGP 2.0 compliant interface.

*

*

*

AMD-8xxx

HyperTransport

PCI-Express PCI-Express Tunnel Tunnel

• HyperTransport technology tunnel

• 16-bit host interface

• 16-bit next device interface

• PCI-Express ports

• Several configurable primary Ports

• Ports are sub dividable into several smaller ports

22


Typical Multiprocessing System

•System scalability

limited by northbridge

Typical MP System

–Max of 4 processors

Processor

Processor

Processor

Processor

–Processors compete for

FSB bandwidth

–Memory size and

bandwidth are

limited

–Max of 3 PCI-X

bridges

–Many more chips

required

DDR

DDR

Memory

Expander

Memory

Expander

North

Bridge

PCI-X

Bridge

PCI-X

Bridge

PCI-X

Bridge

PCI-X

PCI-X

PCI-X

IDE, FDC,

USB, Etc.

South

Bridge

PCI

23


800-Series 4P AMD Opteron

Processor-based Server

DDR

200-333MHz

144-Bit Reg

DDR

DDR

AMD

Opteron

cHT [1]

cHT [1] cHT [1]

AMD

Opteron

cHT [1]

AMD

Opteron

AMD

Opteron

DDR

DDR

•Idle Latencies to First Data

•1P System:


Physical Memory Map Layout

•Assume

– 2-CPU WS, 3GB DRAM per CPU

– Non-node interleaved DRAM map

•BIOS maps these below 4GB so

legacy OS & PCI devices can

access:

– AGP Aperture, PCI I/O-Map region,

PCI Mem-Map region, APIC

•Simple mapping places 2nd CPU’s

memory above 4GB:

– Other optimizations possible.

Processor

Node 0

(3 GB)

AGP (256MB)

PCIIO, MMIO,

APIC

Processor

Node 1

(3 GB)

00,0000,0000

00,BFFF,FFFF

00,D000,0000

00,DFFF,FFFF

00,E000,0000

00,FFFF,FFFF

01,0000,0000

01,BFFF,FFFF

25


Simple SW Model for NUMA

• NUMA bring dramatic scalability advantages

– But software management is hard to get right

• SMP systems bring dramatically simplified software model

– But memory system doesn’t scale up as you add processors

AMD Opteron processor provides benefits of both:

SMP view for Software

– Physical address space is flat and fully coherent

– Latency difference between local and remote memory in an 8P system is

comparable to the difference between a DRAM page hit and page miss

– DRAM can be contiguous or interleaved

• MP support designed into processor & system from the beginning

– Lower overall system chip count increases reliability and lowers cost

– All MP system functions use CPU technology and frequency

• Latency shrinks quickly by increasing CPU and HyperTransport

technology link speed

– Additional processor nodes bring increased memory bandwidth and great

overall system throughput

26


AMD64 Systems


AMD Opteron Processor Platforms

Platform

Processor

Memory

Physicals

I/O

Management

Khepri (Newisys)

2P AMD Opteron

DDR-333 x4 each

1U

PCIX, 2xGig Ether, SCSI

Proprietary Service Processor

Company

Processor

Memory

Physicals

I/O

Management

TYAN

2P AMD Opteron

DDR-333 x4 each (typical)

2U or pedestal

PCIX, 2xGig Ether, SCSI

IPMI 1.5/Remote Mgmt LAN

Company

Processor

Memory

Physicals

I/O

Management

GIGABYTE

2P AMD Opteron

DDR-333 x4 each (typical)

2U or pedestal

PCIX, 2xGig Ether, SCSI

IPMI 1.5/Remote Mgmt LAN

Platform

Processor

Memory

Physicals

I/O

Management

Quartet

4P AMD Opteron

DDR-333 x4 each

4U N+1 PSU

PCIX, 2xGig Ether, SCSI

IPMI 1.5/Remote Mgmt LAN

Company

Processor

Memory

Physicals

I/O

Management

ARIMA

2P AMD Opteron

DDR-333 x4 each (typical)

2U or pedestal

PCIX, 2xGig Ether, SCSI

IPMI 1.5/Remote Mgmt LAN

Motherboard Rhapsody (RDK)

Processor 2P AMD Opteron

Memory DDR-333 x4 each (typical)

Physicals 2U or pedestal

I/O PCIX, 2xGig Ether, SCSI

Management IPMI 1.5/Remote Mgmt LAN

Company

Processor

Memory

Physicals

I/O

Management

MSI

2P AMD Opteron

DDR-333 x4 each (typical)

2U or pedestal

PCIX, 2xGig Ether, SCSI

IPMI 1.5/Remote Mgmt LAN

Available from OEMs

world-wide

1H 2003

28


Manageability

AMD committed to industry standard manageability

AMD reference designs and Validated Server Program (VSP)

products provide standards based management

– WBEM/CIM – Web-Based Enterprise Management

• Initiative of the Distributed Management Task Force (DMTF) to provide

standard framework for the for the management of clients, servers,

networks, storage, etc…

– IPMI – Intelligent Platform Management Interface

• Standard for server management, hardware health monitoring and

remote system control (power/reset/etc)

– SMBIOS – System Management BIOS

– PXE – Pre-boot eXecution Environment

29


IPMI

•OSA is the management solution provider for AMD processorbased

server reference platforms and VSP products

– Supports Quartet and Serenade platforms

– Base IPMI 1.5 support

– Multi-platform management and monitoring

– Centralized event and alert handling

– In-band and out-of-band management

– 3-tiered architecture to support internet and multi-console access

– SSL security

30


HyperTransport Technology Backplane

• Using non-coherent interconnect

200-333MHz

9 byte Reg. DDR

Connector

200-333MHz

9 byte Reg. DDR

4P

Blade

AMD AMD Opteron

Opteron

AMD AMD Opteron

Opteron

8GB DRAM

200-333MHz

9 byte Reg. DDR

AMD AMD Opteron

Opteron

8GB DRAM

200-333MHz

9 byte Reg. DDR

AMD AMD Opteron

Opteron

Hot swap

connection

8-G DRAM

16x16 HyperTransport @

1600MT/s

SI4041

Switch

SI4041

Switch

SI4041

Switch

PCI 33/32

8x8 HyperTransport @

1.6GB/sec.

EIDE

AMD-8111 AMD-8111 TM

TM

I/O Hub

I/O Hub

USB1.1,2.0

AC97

ACR 1.0

GMII

LPC

FLASH

SIO

NIC

10/100

Switches and AMD-

8111 TM I/O hub on

the backplane

Contact Alliance Semiconductor for further details

31


Scaling Beyond 8 Processors

Interconnect Fabric

SW0

SW1

SW2

SW3

SW2

SW3

SW2

SW3

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

4

P

• Scaling beyond 8P is enabled

• External Coherent HyperTransport technology

switch Coherent Interconnect

Snoop filter

Data caching

• Up to 16 processors within the same 2 40 SMP

memory space

32


AMD64 Software


32-bit OS & Application Support

•32-bit software does not have to be ported since the

AMD Opteron processor is compatible with 32-bit OS,

applications, and drivers

– Natively supports x86 instruction set

– Floating-point model is x87

– SSE, SSE2, MMX, 3DNow!, and in-line ASM are supported

AMD Opteron processor offers a high-performance platform

for 32-bit software

– Takes full advantage of core enhancements offered by

AMD Opteron processor

– Is expected to progressively run faster as systems speed up

AMD Opteron processor-based systems are posting leading 32-bit

benchmarks, including SPECint_rate, SPECfp ® _rate, SPECWEB99,

MMB2, and TPC-C

34


64-bit OS & Application Interaction

32-bit Compatibility Mode

64-bit OS runs existing 32-bit APPs

with leading edge performance

• No recompile required, 32-bit code

directly executed by CPU

64-bit OS provides 32-bit libraries and

“thunking” translation layer for 32-bit

system calls.

64-bit Mode

• Migrate only where warranted, and at

the user’s pace to fully exploit AMD64

64-bit OS requires all kernel-level

programs & drivers to be ported.

• Any program that is linked or plugged

in to a 64-bit program (ABI-level)

must be ported to 64-bits.

32-bit thread

32-bit

Application

4GB expanded

address space

Translation

USER

64-bit thread

64-bit

Application

512GB (or 8TB)

address space

64-bit Operating

System

64-bit Device Drivers

KERNEL

35


Increased Memory for 32-bit Applications

32-bit server, 4 GB DRAM

• OS & App share small

32-bit VM space

• 32-bit OS & applications

all share 4GB DRAM

• Leads to small dataset

sizes & lots of paging

0 GB

2 GB

4 GB

Virtual

Memory

32-bit

App

32-bit

OS

4GB

DRAM

Shared

Virtual

Memory

32-bit

App

32-bit

OS

0 GB

2 GB

4 GB

64-bit server, 12 GB DRAM

• App has exclusive use of

32-bit VM space

64-bit OS can allocate

each application large

dedicated portions

of 12GB DRAM

• OS uses VM space way

above 32-bits

• Leads to larger dataset

sizes & reduced paging

0 GB 0 GB

4 GB

256 TB

Virtual

Memory

32-bit

App

64-bit

OS

12GB

DRAM

Not

shared

Not

shared

Not

shared

Virtual

Memory

32-bit

App

64-bit

OS

4 GB

256 TB

36


Current Operating System Support

Operating System

SuSE Linux Enterprise Server (SLES) 8

Type

32 & 64-bit

SuSE Linux 8.2 Personal & Professional

32-bit

UnitedLinux Version 1.0 code base by UnitedLinux

Consortium

Conectiva Linux Enterprise Edition

Linux AMD64 kernel patches

Mandrake Linux 9.0

Mandrake Linux 9.1

Mandrake Linux Corporate Server 2.1

NetBSD

SCO Linux

Scyld Beowulf Cluster Operating System

Solaris 9 for x86 (32-bit)

Turbolinux 8 for AMD64

Windows ® 2000 Server (32 bit)

Windows Server 2003 (32-bit)

32 & 64-bit

32-bit

64-bit

64-bit

32-bit

32-bit

32 & 64-bit

32-bit

32-bit

32-bit

32 & 64-bit

32-bit

32-bit

37


Planned Operating Systems Support

Operating System

Red Hat Enterprise Linux AS 3.0

SuSE Linux 8.3 Personal & Professional

Windows® XP Professional for AMD64

Windows Server 2003 for AMD64

Type

32- &

64-bit

32- &

64-bit

64-bit

64-bit

Available

Beta

mid-2003

September

2003

Beta

mid-2003

Beta

mid-2003

Support for AMD's

64-bit processors is scheduled

for Windows Server 2003 Service Pack 1,

"hopefully by the end of the year." David

Thompson, Microsoft VP

Linux on the AMD Opteron TM Processor

is a port to enhanced x86

architecture -- one that delivers 64-

bit punch, while maintaining

compatibility for classic x86 32-bit

applications.

http://www.extremetech.com/article2/0,3973,1061703,00.asp

http://newsforge.com/newsforge/03/04/21/1914216.shtml?tid=7

38


AMD64 Developer Tools

•GNU compilers

– GCC 3.2.2 - 32-bit and 64-bit

– GCC 3.3 - optimized 64-bit

Optimized Compilers

Are Reaching

Production Quality

• PGI Workstation 5.0 beta

– Optimized Fortran 77/90,

C/C++ for 32-bit Linux and

Windows and 64-bit Linux

– Product release by end of June

– Performance goals are to be on parity with Intel

compilers

1.8 MHz AMD Opteron Processor– SPECint2000

Compiler OS Base Peak

Intel C/C++ 7.0 Windows Server 2003 1095 1170

Intel C/C++ 7.0 Linux/x86-64 1081 1108

Intel C/C++ 7.0 Linux (32-bit) 1062 1100

GCC 3.3 (64-bit) Linux/x86-64 1045

GCC 3.3 (32-bit) Linux/x86-64 980

GCC 3.3 (32-bit) Linux (32-bit) 960

http://www.aceshardware.com/

• Visual C, C++ for AMD64 is included with Windows ® for AMD64 alpha

release and current build reflects initial optimization

39


FORTRAN and C/C++ Compiler Support

AMD and STMicroelectronics are working together to bring

The Portland Group Compiler Technology to AMD64

– Support will include

• F90 & F77

– Some F95 extensions also included

• Optimized 32-bit and 64-bit code generation

• Both Linux and Windows ® will be supported

• OpenMP support

• Full debugging support

– Compiler is designed to provide performance that can meet or

exceed that of competitive compilers in the market today

•Beta versions freely available today at

www.pgroup.com/AMD64

•Commercial release of Workstation 5.0 is scheduled for

6/30/03 – CDK in July

The Portland Group

Compiler Technology

40


PGI CDK 5.0 - Additional Features

•Tentative Availability - July, 2003

•Highly Scaleable

•MPI-CH - Pre-configured libraries and utilities for ethernetbased

x86 and AMD64/Linux clusters

•PBS – Portable Batch System batch-queuing from NASA Ames

and MRJ Technologies

•ScaLAPACK - Pre-compiled distributed-memory parallel Math

Library

•ACML – The AMD Core Math Library is planned to be included

•Training – Tutorials (OSC), exercises, examples and

benchmarks for MPI, OpenMP and HPF programming

The Portland Group

Compiler Technology

41


Optimized Numerical Libraries: ACML

AMD and The Numerical Algorithms Group (NAG) are jointly

developing the AMD Core Math Library (ACML)

• For use with mathematical, engineering, scientific and financial

applications as well as general HPC computing

• ACML is comprised of:

• Basic Linear Algebra Subroutines (BLAS) levels 1, 2 and 3

• A wide variety of Fast Fourier Transforms (FFTs)

• Linear Algebra Package (LAPACK)

•ACML will have the following features:

• Fortran and C Interfaces

• Highly optimized routines for the AMD64 Instruction Set

• Ability to address single-, double-, single-complex and double-complex

data types

• Will be available for commercially available OSs

•ACML 1.0 is scheduled to be released on 6/30/03 and will be

freely downloadable from www.developwithamd.com/acml

42


AMD64 Performance


Fortran Polyhedron Compiler Comparison

160.00

140.00

Measured on AMD Opteron TM Model 144 (1.8GHz, 1MB L2,

128-bit Memory Controller, DDR-333, CL 2.5, 512MB)

120.00

100.00

Time (secs)

80.00

60.00

40.00

32-bit Intel IFC7.0 on 32-bit SuSE Linux 8.1

32-bit Intel IFC7.0 on 64-bit SuSE SLES8 RC7

64-bit PGI-5.0-Beta on 64-bit SuSE SLES8 RC7

20.00

0.00

SCATTERING

RNFLOW

PROTEIN

MONTECARLO

KEPLER

INDUCTANCE

GASDYNAM

FATIGUE

CHANNEL

CAPACITA

44


32-bit App Performance on 64-bit Linux

45

3.00%

% Speed-up of (32-bit App on SLES8 for AMD64) relative to (same 32-bit App on Linux32)

2.00%

1.00%

0.00%

Rolled-Single

Unrolled-Single

Rolled-Double

Unrolled-Double

Copy

Scale

Add

Triad

NUMERIC_SORT

STRING_SORT

BITFIELD

FP_EMUL

ASSIGN

IDEA

HUFFMAN

NEURALNET

LU_DECOMP

-1.00%

Stream

Linpack

BYTEmark (tm) ver. 2

-2.00%

Measured on AMD Opteron Model 144 (1.8GHz, 1MB L2, 128-bit Memory Controller, DDR-333, CL 2.5, 512MB)


32-bit vs 64-bit App Performance

80.00%

60.00%

32-bit App compiled using GCC 3.2 for x86

64-bit App compiled using GCC 3.3 for AMD64

All run-times measured on SLES8 For AMD64 RC7

% Speed-Up for 32-bit App Ported to 64-bit

40.00%

20.00%

0.00%

-20.00%

Rolled-Single

Unrolled-Single

Rolled-Double

Unrolled-Double

Copy

Scale

Add

Stream

Triad

NUMERIC_SORT

STRING_SORT

BITFIELD

FP_EMUL

ASSIGN

IDEA

HUFFMAN

LU_DECOMP

-40.00%

Linpack

BYTEmark (tm) ver. 2

-60.00%

Measured on AMD Opteron Model 144 (1.8GHz, 1MB L2, 128-bit Memory Controller, DDR-333, CL 2.5, 512MB)

46


Scalable Memory Bandwidth

Sisoft Sandra Standard 2003

14000

12000

10000

MB/s

8000

6000

4000

2000

0

4 AMD Opteron

Model 846

8xPC2700 CL2.5

2 AMD Opteron

Model 246

4xPC2700 CL2.5

1 P4 3.0GHz

800FSB, 875,

4xPC3200 CL2

1 AMD Opteron

Model 146

2xPC2700 CL2.5

2 Xeon 3.06GHz,

GC-LE,

4xPC2100 CL2

4 Xeon MP 2GHz,

GC-HE,

16xPC1600 CL2

All benchmarks run on Microsoft® Windows® Server 2003 Enterprise Edition


Low Memory Latency

ScienceMark 2.0 Beta, 512-Byte Stride

180

160

140

2-hop

Latency

(nS)

120

100

1-hop

80

60

40

20

0

AMD Opteron

Model 146

2xPC2700 CL2.5

2 AMD Opteron

Model 246

4xPC2700 CL2.5

4 AMD Opteron

Model 846

8xPC2700 CL2.5

1 P4 3.0GHz, 800

FSB, 875, 4xPC3200

CL2

2 Xeon 3.06GHz,

GC-LE,

4xPC2100 CL2

4 Xeon MP 2GHz,

GC-HE,

16xPC1600 CL2

All benchmarks run on Microsoft® Windows® Server 2003 Enterprise Edition

48


MP Integer Performance

SPECint®_rate2000 Performance

(Peak, 2P)

SPECint®_rate2000 Performance

(Peak, 4P, Windows®)

AMD

Opteron

Model 246*

30.3

AMD Opteron

Model 846*

56.6

AMD Opteron

Model 244

26.8

AMD Opteron

Model 844

48.5

AMD Opteron

Model 242

24

AMD Opteron

Model 842

45.1

Xeon

3.06GHz

22.5

Itanium 2

1.0GHz

36.8

Itanium 2

1 GHz

18.7

Xeon MP

2.0GHz

34.7

* 246 & 846 Results are estimated pending final SPEC submission

SPEC and the benchmark name SPECint are registered trademarks of the Standard Performance Evaluation Corp.

Competitive numbers shown reflect results published on www.spec.org as of June 17, 2003. For the latest SPEC

results visit .

49


MP Floating-point Performance

SPECfp®_rate2000 Performance

(Peak, 2P)

SPECfp®_rate2000 Performance

(Peak, 4P)

Itanium 2

1.0GHz

30.7

Itanium 2

1.0GHz

58.4

AMD Opteron

Model 246*

29.5

AMD Opteron

Model 846*

52.4

AMD Opteron

Model 244

26.7

AMD Opteron

Model 844

49.2

AMD Opteron

Model 242

25.1

AMD Opteron

Model 842

45

Xeon

3.06GHz

17

Xeon MP

2.0GHz

20.2

* 246 & 846 Results are estimated pending final SPEC submission

SPEC and the benchmark name SPECfp are registered trademarks of the Standard Performance Evaluation Corp.

Competitive numbers shown reflect results published on www.spec.org as of June 17, 2003. For the latest SPEC

results visit .

50


High Performance Linpack

GOTO Library Results

AMD Opteron system

#

P

Rmax

(GFlops)

Nmax

(order)

N1/2

(order

)

Rpeak

(GFlops)

GFLOP/

Proc

Rmax /

Rpeak

4P AMD Opteron 1.8GHz

2GB/proc PC2700 8GB Total 4 12.06 28000 1008 14.4 3.02 83.8%

2P AMD Opteron 1.8GHz

2GB/proc PC2700 4GB Total 2 6.22 20617 672 7.2 3.11 86.4%

1P AMD Opteron 1.8GHz

2GB PC2700 1 3.14 15400 336 3.6 3.14 87.1%

High-Performance BLAS by Kazushige Goto

• sgemm/dgemm/cgemm/zgemm available today

• Optimized http://www.cs.utexas.edu/users/flame/goto

GOTO results were with 64-bit SuSE 8.1 Linux Professional Edition with NUMA kernel

and Myrinet MPIch-gm-1.2.5..10 message passing library.

51


AMD, the AMD Arrow logo, AMD Athlon, AMD Opteron, 3DNow!

and combinations thereof, AMD-8100, AMD-8111 and AMD-

8131 AMD-8151 are trademarks of Advanced Micro Devices,

Inc. HyperTransport is a licensed trademark of the

HyperTransport Technology Consortium. Microsoft and

Windows are registered trademarks of Microsoft Corporation in

the U.S. and/or other jurisdictions. Pentium and MMX are

registered trademarks of Intel Corporation in the U.S. and/or

other jurisdictions. SPEC and SPECfp are registered

trademarks of Standard Performance Evaluation Corporation in

the U.S. and/or other jurisdictions. Other product and

company names used in this presentation are for identification

purposes only and may be trademarks of their respective

companies.

52


BACKUP


AMD64 Processors And Target Systems

AMD Opteron Processor 200 Series:

• 2-way server and workstation proc

• 144-bit DDR interface per CPU:

200 266, 333 MHz*

• Three 16-bit HyperTransport links

per CPU. Typically, two are used to

connect to another CPU and I/O

• 1-MB integrated L2 cache per CPU

Upcoming AMD Athlon 64

Processor

• Performance Desktop

Processor

• 72-bit DDR interface 200,

266, 333, 400 MHz*

• One 16-bit HyperTranport link

NOTE: The

Upcoming AMD

Athlon 64 and

AMD Opteron

are AMD64

processors

AMD Opteron Processor 800 Series:

• Up to 8-way server processor

• 144-bit DDR interface per CPU: 200, 266, 333 MHz*

• Three 16-bit HyperTransport Links per CPU.

Typically all 3 used to connect to other CPUs & I/O

• 1-MB integrated L2 cache

* = Future memory technology support as it is defined

16-bit HyperTransport Links are at 1600MT/s; provides 6.4GB/s Peak Aggregate Bandwidth

54


AMD Opteron Processor Core Facts

16 instruction bytes fetched per cycle

L2

Cache

L1

Instruction

Cache

64KB

Fetch

Fastpath

Scan/Align

Branch

Prediction

Microcode Engine

System

Request

Queue

Bus Unit

L1

Data

Cache

64KB

µOPs

Instruction Control Unit (72 entries)

Int Decode & Rename FP Decode & Rename

Crossbar

Res Res Res

36-entry FP scheduler

Memory

Controller

HyperTransport TM

44-entry

Load/Store

Queue

AGU

ALU

MULT

AGU AGU FADD FMUL FMISC

ALU ALU

9-way Out-Of-Order execution

• 12-stage Int, 17-stage fast-path pipelines

• Enhanced TLB structures w/flush filter

• Opts for off-loading writes, probes, memory

• 16 SSE & SSE2 128-bit xmm registers

• 8 legacy x87 80-bit registers

FPU Throughput

• 36 entry FPU instruction scheduler

64-bit/80-bit FP Realized thru-put (1

Mul + 1 Add)/cycle: 1.9 FLOPs/cycle

• 32-bit FP Realized thru-put (2 Mul +

2 Add)/cycle: 3.4+ FLOPs/cycle

55


Compatibility Thunking Layer

•A DLL integral to operating system

– Transparent to end-user

•Resides within a 32-bit process established by the 64-bit OS

to run 32-bit application

•32-bit application is dynamically linked to Thunking Layer

•Thunking layer implements all 32-bit kernel calls

– Translates parameters as necessary

– Calls 64-bit kernel

– Translates results as necessary

•Well understood technology implemented in Microsoft ®

Windows ® :

– Windows on Windows (WOW32, WOW64)

56


Scientific Applications:Project “Red Storm”

• Cray Computer plans to build a

40+ teraflop supercomputer

using AMD Opteron processors

for Sandia National Laboratories

• Will be used for advanced

engineering simulations

RED STORM

• $90 million project plans to use

more than 10,000 AMD Opteron

processors

• Will feature a simple building

block approach with

HyperTransport technology

that is designed to enable easy

implementation and reduce

engineering, design, and

component costs

57


In-Band/Out-of-Band Management

•Eliminates need to physically visit

server

– Remote power down

– Remote power up (Out-Of-Band only)

– Hard reset (Out-Of-Band only)

•Available during all server states

(setup, boot, OS, or halted)

•Secured against malicious attacks

– Multi-layer passwords (for remote

power features)

– SSL authentication

58


Internal View of 1u 2P Server

Half Length PCI-X

66 bit/64MHz Slot

Dual Gb Ethernet

U320 SCSI

Controller

PCI/Memory

Cooling Fans

2 AMD

Opteron CPUs

Memory

Slots

4 DIMMS

HDD Bays

(x2)

Service Processor

Full Length PCI-X

64 bit/133 MHz Slot

465 W Power Supply

250K hours MTBF

AMD-8131 PCI-X Bridge

AMD-8111 Southbridge

Power

Supply/Memory

Cooling Fans

Memory

Slots

4 DIMMS

CPU Cooling

Fans

CD-ROM

and

Floppy Bay

59


Other Development Tools

– Absoft will be bringing their full set of FORTRAN toolsets to the

AMD64 architecture on both Linux and Windows ®

• Potential beta testers should send email to: opteronbeta@absoft.com

• Beta available June 2003

– MigraTEC’s source code migration tool, 64Express, is now available

to aid in the migration of C/C++ code from 32-bit to 64-bit

Available Now

– MigraTEC’s cross-platform tool, 32Direct, is now available to assist

in cross-platform migrations (i.e. Solaris to Linux) – Available Now

– Etnus has announced 32-bit support of x86-64 with their TotalView

distributed debugging product

60


Other Development Tools

– ATLAS (Automatically Tuned Linear Algebra Subroutines)

– ATLAS has incorporated optimized 64-bit Linux routines to their

3.5.0 Developer release - http://math-atlas.sourceforge.net/ -

Available Now

• Further 64-bit optimizations are forthcoming

– Scyld Computing has announced their intent to support the AMD64

architecture with their Beowulf product around time of

AMD Opteron TM processor launch

•MPICH

– MPICH is available via the open-source community and Linux

distributions

61


Other Development Tools

– Announced 32-bit support with Vampir/Vampirtrace for the AMD64

architecture

– Announced support for AMD64 with their Distributed Debugger

Tool (DDT)

• This is the first graphical software debugger to support AMD64 with a 64-

bit OS

• Commercial release now available

•Blackdown Java

– Announced support that their J2SE Version 1.4.2 will support the

AMD64 architecture on the Linux OS

• Blackdown is based on Sun’s HotSpot technology

62


Other Development Tools

•Announced their EnfuZion cluster management product will

have support for both 32-bit and 64-bit OSes on the AMD64

architecture

•Announced their support for 64-bit versions of their popular

message passing implementations – MPIPro (1.2) and

ChaMPIon Pro (2.1)

•Announced the release of the NAGWare F95 compilers for 64-

bit Linux

– Available via www.nag.com/f95AMD

63


AMD64 Computing Strategy (3)

• BIOS is standard x86 32-bit code

– Transfer to 64-bit mode is done by OS loader. No extra Requirements.

•Legacy Mode

AMD64 processors run any 32-bit legacy OS with leading edge performance

– Fully compatible with existing 32-bit systems and software

• Compatibility Mode under 64-bit OS

64-bit OS runs existing 32-bit Apps with leading edge performance

– Processor core provides full x86 compatibility at full speed. No application

recompile required, no emulation layer

– OS provides thunking layer at kernel-call boundary

64-bit Mode under 64-bit OS

– Migrate only where warranted, and at user’s pace to fully exploit AMD64

– Even Apps not needing 64-bit addressing can still enjoy performance

enhancements from recompiling into 64-bit

64


Compatibility Mode

•Provides a mode where existing applications can run

unchanged under Long Mode

•Selected on a code-segment basis (CS.L=0)

– Uses far transfer rather than a full mode switch

• Faster than mode switch

•Application-level code runs unchanged

– Legacy segmentation

– Legacy address and data size defaults

•System aspects use 64-bit mode semantics

– Interrupts and exceptions use Long Mode

handling

– Paging aspects use Long Mode semantics

• No support for v86

65

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!