HIGH-PERFORMANCE COMPUTING(Red Hat Package Manager) under Red Hat Linux served as a greatimpetus for the selection of Red Hat Linux as the base OS. TheOS is installed on the compute nodes using the Preboot ExecutionEnvironment (PXE), in which the client nodes perform a networkboot to obtain the OS. Rocks recompiles the available Red Hat RPMsand uses them in the Rocks package.Configuring the compute nodes for seamless integrationwith the cluster. Like other open source and commercial clusterpackages, Rocks uses a master node—called a front-end node inthe Rocks terminology—for centralized deployment and managementof a cluster. The Rocks front-end node helps administratorsto specify the cluster-wide private cluster network configurationand networking parameters, including IP addresses assigned to thecompute nodes. These parameters are specified during the front-endnode installation process. The front-end node then provides the IPaddresses to the compute nodes when the compute nodes contactthe front-end nodes via network PXE boot.Constructing a database of cluster-wide information. Manyservices at and above the OS level—for example, job schedulers andthe Dynamic Host Configuration Protocol (DHCP) server—requirea global knowledge of the cluster to generate service-specific configurationfiles. The front-end node of a Rocks cluster maintains adynamic MySQL database back end, which stores the definitions ofall the global cluster configurations. This database forms the coreof the NPACI Rocks package, and can be used to generate databasereports to create service-specific configuration files such as /etc/hosts and /etc/dhcpd.conf.Delivering middleware libraries and tools that build programsto run on the cluster. Rocks delivers code using two typesof CD. The Rocks Base CD contains the core Rocks package. RocksRoll CDs are add-on software packages that augment the cluster withspecific capabilities. The Rocks Roll software packages, designedto seamlessly integrate with the base Rocks installation, provide amechanism to enable administrators and vendors to add softwarethat provides extra functionality to Rocks clusters. Administratorscan create their own Rolls, independent from the Rocks Base CD,to contain their domain-specific software and applications.The following components come packaged in a separate HPCRoll CD: middleware (for example, MPICH, 2 a portable implementationof Message Passing Interface [MPI], and Parallel VirtualMachine [PVM] 3 ); cluster performance monitoring software suchas Ganglia; 4 and common performance benchmark applicationssuch as Linpack. Other Rolls provided by NPACI include Intel ®compilers, 5 the Maui 6 scheduler, and Portable Batch System (PBS) 7resource manager functionality.Monitoring the cluster. Cluster monitoring is an important taskfor helping system administrators to proactively understand andtroubleshoot issues occurring in a cluster. Rocks is packaged withopen source software tools such as Ganglia, which helps providein-band global monitoring of the entire cluster. Ganglia is an opensource tool developed by the University of California at Berkeleyand the San Diego Supercomputer Center under National ScienceFoundation (NSF) NPACI funding.Managing the cluster. Efficient tools for cluster management arean important part of any cluster computing package. NPACI Rocksprovides easy-to-use command-line interface (CLI) commands foradding, replacing, and deleting nodes from the cluster. Rocks treatscompute nodes as soft-state machines—that is, machines that haveno application information or data stored on their hard drives—anduses fresh OS reinstallation as a way to ensure uniformity across allthe nodes in the cluster. This OS-reinstallation approach works wellfor large clusters because the OS is installed rapidly and automatically.As a result, no time is wasted in performing exhaustive checksto debug problems that might exist in a system on which the OS hasalready been installed. For cluster-wide operations, NPACI Rockscomes with a parallel command (cluster-fork), which is capableof executing query-based operations.Rocks cluster installationNPACI Rocks cluster installation consists of two parts. Figure 1 showsa typical Rocks cluster layout.Front-end node installationFront-end node installation requires the administrator to install theRocks Base CD along with any of the optional Rocks Roll CDs. Asof Rocks 3.3.0, the Rocks Base CD along with the HPC Roll CD andthe Kernel Roll CD are required to build a functional Rocks-basedHPC cluster.A front-end node typically has two network interface cards(NICs). One of the NICs connects to the external, public network.The other NIC is used to connect to the private network of thecluster. Rocks interactive front-end node installation allows theadministrator to configure the external-network NIC. It also allowsthe administrator to specify the network configuration details (such2For more information about MPICH, visit www-unix.mcs.anl.gov/mpi/mpich.3For more information about PVM, visit www.csm.ornl.gov/pvm.4For more information about Ganglia, visit ganglia.sourceforge.net.5For more information about Intel compilers, visit www.intel.com.6For more information about Maui, visit www.clusterresources.com/products/maui.7For more information about PBS, visit www.openpbs.org.112POWER SOLUTIONS Reprinted from <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, February 2005. Copyright © 2005 <strong>Dell</strong> Inc. All rights reserved. February 2005
HIGH-PERFORMANCE COMPUTINGFigure 1. A typical Rocks cluster layoutas IP address and subnet mask) of the private network NIC. Thesedetails are used to assign IP addresses to the client nodes of thecluster in the latter part of the cluster installation process.Rocks is self-contained, including the necessary tools, scripts,and services for subsequent installation phases. Rocks front-end nodeinstallation takes place on bare-metal hardware. The Rocks CDs installthe OS on the front-end node during the installation procedure.Compute node installationFront-end nodeNIC 1 NIC 2The rest of the Rocks cluster is installed from the front-end node,using a test-based utility called insert-ethers (see Figure 2). To buildthe compute nodes, administrators can launch the insert-ethers utilityfrom the front-end node. The insert-ethers utility is completelydatabase driven. It allows the administrator to configure each nodeas a compute node, to be used for computational purposes; a ParallelVirtual File System (PVFS) 8 node, to be used only for I/O purposes;or a combination compute and PVFS node, to be used for both computationaland I/O purposes.SwitchNode 0 Node 1 Node 2 Node nPublicnetworkThe insert-ethers utility is also used to populate the nodestable in the MySQL database. To this end, the utility continuouslyparses the Linux OS log files (such as the /var/log/messages file)to extract DHCPDISCOVER messages and Media Access Control(MAC) addresses. When the compute node is PXE-booted, the DHCPrequest is sent to the front-end node. When the front-end node findsthe MAC address in the parsed files, and if the MAC address is notpresent in the database, then the front-end node accepts the newnode and installs it using the Rocks kickstart mechanism. Howeverif the MAC address is already present in the database, the frontendnode reinstalls the compute node by assigning the same IPaddress that it previously had. Hence, the insert-ethers utility merelyreimages the compute node by matching previous information presentin the database. Front-end node scripts in Rocks generate thekickstart file on the fly depending on the information—such asarchitecture and CPU count—provided by the compute node throughthe modified Rocks-supplied Red Hat Anaconda installer.Rocks is designed to make compute node installation easy.Rocks can also detect NICs (such as Myricom Myrinet 9 cards)on the compute nodes and compile the correct drivers for theseNICs during installation. Rocks is based on XML and the Python 10programming language. This design, combined with the MySQLdatabase back end, helps provide great flexibility for administratorsto modify compute node settings such as disk partitioning;enable selected services such as Remote Shell (RSH); and installthird-party packages. An administrator can customize the targetnode environment by modifying select XML files in Rocks andsimply re-creating a new Rocks distribution, which can then beused for installation.Rocks cluster management and monitoringRocks uses the basic mechanism of OS reinstallation for updating thenodes in the cluster. If a hardware failure occurs, or if a hard driveor NIC must be replaced, the compute node should be reinstalledfrom the front-end node; the front-end node’s MySQL database isautomatically updated if required. If additional software or updatefiles need to be installed on the compute nodes, then the administratorshould add these to the front-end node and create an updatedRocks distribution. All nodes should then be reinstalled using theupdated distribution to help maintain consistency across the cluster.Reinstallation is fast and helps reduce or eliminate administratortime spent troubleshooting and debugging failed nodes.The Rocks CLI is also used to replace and remove nodes froma cluster. An administrator can create a naming scheme that helpsidentify nodes by their cabinet and rack locations.Step 1: Administrator installs the front-endnode (Linux with Rocks)Step 2: Administrator configures thefront-end nodeStep 3: Administrator launchesthe insert-ethers utilityStep 4: PXE boots the nodeStep 5: Insert-ethers utilitydiscovers the MAC addressStep 6: Rocks starts the OS installationFigure 2. Steps to install a Rocks clusterMySQLdatabaseFront-end nodeCompute node 1Compute node 2PVFS node 3PVFS node 48For more information about PVFS, visit www.parl.clemson.edu/pvfs.9For more information about Myrinet, visit www.myri.com.10For more information about Python, visit www.python.org.www.dell.com/powersolutions Reprinted from <strong>Dell</strong> <strong>Power</strong> <strong>Solutions</strong>, February 2005. Copyright © 2005 <strong>Dell</strong> Inc. All rights reserved. POWER SOLUTIONS 113