Problem Determination Guide - Systems Group

IBM System Blue Gene 

Solution 

Problem Determination Guide 

Learn detailed procedures 

through illustrations 

Use sample scenarios as 

helpful resources 

Discover GPFS 

installation hints and tips 

ibm.com/redbooks 

Front cover 

Octavian Lascu 

Peter F Custerson 

Marty Fullam 

Ravi K Komanduri 

Dr. Thanh V Lam 

Sean Saunders 

Chris Stone 

Shinsuke Ueyama 

Dino Quintero

International Technical Support Organization 

IBM System Blue Gene Solution: Problem 

Determination Guide 

October 2006 

SG24-7211-00

Note: Before using this information and the product it supports, read the information in 

“Notices” on page ix. 

First Edition (October 2006) 

This edition applies to Version 1, Release 2, Modification 1 of Blue Gene/L driver 

(V1R2M1_020_2006-060110). 

© Copyright International Business Machines Corporation 2006. All rights reserved. 

Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP 

Schedule Contract with IBM Corp.

Contents 

Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 

Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 

The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 

Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv 

Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 

Chapter 1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 

1.1 Blue Gene/L system overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.2 Hardware components of the Blue Gene/L system. . . . . . . . . . . . . . . . . . . 2 

1.2.1 Racks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 

1.2.2 Midplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.2.3 Compute (processor) card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

1.2.4 I/O card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 

1.2.5 Node card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

1.2.6 Service card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 

1.2.7 Fan card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

1.2.8 Clock card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.2.9 Bulk power modules/enclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.2.10 Link card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

1.2.11 Rack, power module, card, and fan naming conventions . . . . . . . . 19 

1.3 Blue Gene/L networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

1.3.1 Service network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

1.3.2 Functional network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 

1.3.3 Three dimensional torus (3D torus). . . . . . . . . . . . . . . . . . . . . . . . . . 26 

1.3.4 Collective (tree) network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

1.3.5 Global barrier and interrupt network . . . . . . . . . . . . . . . . . . . . . . . . . 27 

1.4 Service Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

1.4.1 DB2 database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

1.4.2 Service Node system processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

1.5 Front-End Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 

1.6 External file systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 

1.7 Blue Gene/L system software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

1.7.1 I/O node kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

1.7.2 I/O kernel ramdisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

1.7.3 I/O kernel CIOD daemon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

1.7.4 Compute Node Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

© Copyright IBM Corp. 2006. All rights reserved. iii

1.7.5 Microloader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

1.8 Boot process, job submission, and termination. . . . . . . . . . . . . . . . . . . . . 36 

1.9 System discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

1.9.1 The bglmaster process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

1.9.2 SystemController. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

1.9.3 Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

1.9.4 PostDiscovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

1.9.5 CableDiscovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

1.10 Discovering your system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

1.10.1 Discovery logs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

1.11 Service actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

1.11.1 PrepareForService . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

1.11.2 EndServiceAction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

1.12 Turning off the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 

1.13 Turning on the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 

Chapter 2. Problem determination methodology . . . . . . . . . . . . . . . . . . . . 55 

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

2.2 Identifying the installed system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

2.2.1 Blue Gene/L Web interface (BGWEB) . . . . . . . . . . . . . . . . . . . . . . . 57 

2.2.2 DB2 select statements of the DB2 database on the SN . . . . . . . . . . 58 

2.2.3 Network diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

2.2.4 Service Node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

2.2.5 Front-End Nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

2.2.6 Control system server logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

2.2.7 File systems (NFS and GPFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 

2.2.8 Job submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

2.2.9 Racks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

2.2.10 Midplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

2.2.11 Clock cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

2.2.12 Service cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

2.2.13 Link cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 

2.2.14 Link card chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 

2.2.15 Link summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

2.2.16 Node cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 

2.2.17 I/O cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 

2.2.18 Compute or processor cards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

2.3 Sanity checks for installed components . . . . . . . . . . . . . . . . . . . . . . . . . . 82 

2.3.1 Check the operating system on the SN. . . . . . . . . . . . . . . . . . . . . . . 83 

2.3.2 Check communication services on the SN . . . . . . . . . . . . . . . . . . . . 84 

2.3.3 Check that BGWEB is running . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 

2.3.4 Check that DB2 is working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 

2.3.5 Check that BGLMaster and its child daemons are running. . . . . . . . 89 

iv IBM System Blue Gene Solution: Problem Determination Guide

2.3.6 Check the NFS subsystem on the SN. . . . . . . . . . . . . . . . . . . . . . . . 90 

2.3.7 Check that a block can be allocated using mmcs_console. . . . . . . . 91 

2.3.8 Check that a simple job can run (mmcs_console) . . . . . . . . . . . . . . 96 

2.3.9 Check the control system server logs . . . . . . . . . . . . . . . . . . . . . . . . 99 

2.3.10 Check remote shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 

2.3.11 Check remote command execution with secure shell . . . . . . . . . . 102 

2.3.12 Check the network switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 

2.3.13 Check the physical Blue Gene/L racks configuration . . . . . . . . . . 104 

2.4 Problem determination methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

2.4.1 Define the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

2.4.2 Identify the Blue Gene/L system . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

2.4.3 Identify the problem area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

2.5 Identifying core Blue Gene/L system problems. . . . . . . . . . . . . . . . . . . . 108 

2.6 Identifying IBM LoadLeveler jobs to BGL jobs (CLI) . . . . . . . . . . . . . . . . 109 

Chapter 3. Problem determination tools . . . . . . . . . . . . . . . . . . . . . . . . . . 113 

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

3.2 Hardware monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

3.2.1 Collectable information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

3.2.2 Starting the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 

3.2.3 Checking the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

3.3 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 


3.3.2 Checking the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

3.4 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

3.4.1 Test cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 



3.5 MMCS console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 



Chapter 4. Running jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 

4.1 Parallel programming environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 

4.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 

4.2.1 The blrts tool chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 

4.2.2 The IBM XLC/XLF compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 

4.3 Submitting jobs using built-in tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

4.3.1 Submitting a job using MMCS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

4.3.2 The mpirun program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 

4.3.3 Example of submitting a job using mpirun . . . . . . . . . . . . . . . . . . . 163 

4.4 IBM LoadLeveler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 

4.4.1 LoadLeveler overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 

Contents v

4.4.2 Principles of operation in a Blue Gene/L environment . . . . . . . . . . 170 

4.4.3 How LoadLeveler plugs into Blue Gene/L. . . . . . . . . . . . . . . . . . . . 172 

4.4.4 Configuring LoadLeveler for Blue Gene/L. . . . . . . . . . . . . . . . . . . . 172 

4.4.5 Making the Blue Gene/L libraries available to LoadLeveler . . . . . . 173 

4.4.6 Setting Blue Gene/L specific environment variables. . . . . . . . . . . . 175 

4.4.7 LoadLeveler and the Blue Gene/L job cycle . . . . . . . . . . . . . . . . . . 176 

4.4.8 LoadLeveler job submission process . . . . . . . . . . . . . . . . . . . . . . . 178 

4.4.9 LoadLeveler checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 

4.4.10 Updating LoadLeveler in a Blue Gene/L environment . . . . . . . . . 209 

Chapter 5. File systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 

5.1 NFS and GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 

5.1.1 I/O node boot sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 

5.1.2 Additional scripts in I/O node boot sequence . . . . . . . . . . . . . . . . . 214 

5.2 NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 

5.2.1 How NFS plugs into a Blue Gene/L system . . . . . . . . . . . . . . . . . . 215 

5.2.2 Adding an NFS file system to the Blue Gene/L system . . . . . . . . . 216 

5.2.3 NFS problem determination methodology. . . . . . . . . . . . . . . . . . . . 217 

5.2.4 NFS checklists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 

5.3 GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 

5.3.1 When to use GPFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 

5.3.2 Features and concepts of GPFS. . . . . . . . . . . . . . . . . . . . . . . . . . . 226 

5.3.3 GPFS requirements for Blue Gene/L . . . . . . . . . . . . . . . . . . . . . . . 227 

5.3.4 GPFS supported levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 

5.3.5 How GPFS plugs in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 

5.3.6 Creating the GPFS file system on a remote cluster (gpfsNSD) . . . 232 

5.3.7 Creating a GPFS cluster on Blue Gene/L (bgIO) . . . . . . . . . . . . . . 232 

5.3.8 Cross mounting the GPFS file system on to Blue Gene/L cluster. . 246 

5.3.9 GPFS problem determination methodology . . . . . . . . . . . . . . . . . . 254 

5.3.10 GPFS Checklists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 

5.3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 

Chapter 6. Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 

6.2 Blue Gene/L core system scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 

6.2.1 Hardware error: Compute card error. . . . . . . . . . . . . . . . . . . . . . . . 268 

6.2.2 Functional network: Defective cable . . . . . . . . . . . . . . . . . . . . . . . . 269 

6.2.3 Service network: Defective cable . . . . . . . . . . . . . . . . . . . . . . . . . . 269 

6.2.4 Service Node functional network interface down . . . . . . . . . . . . . . 271 

6.2.5 SN service network interface down. . . . . . . . . . . . . . . . . . . . . . . . . 271 

6.2.6 The /bgl file system full on the SN (no GPFS) . . . . . . . . . . . . . . . . 273 

6.2.7 The / file system full on the SN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 

6.2.8 The /tmp file system is full on the SN . . . . . . . . . . . . . . . . . . . . . . . 275 

vi IBM System Blue Gene Solution: Problem Determination Guide

6.2.9 The ciodb daemon is not running on the SN. . . . . . . . . . . . . . . . . . 276 

6.2.10 The idoproxy daemon not running on the SN . . . . . . . . . . . . . . . . 280 

6.2.11 The mmcs_server is not running on the SN . . . . . . . . . . . . . . . . . 282 

6.2.12 DB2 not started on the SN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 

6.2.13 The bglsysdb user OS password changed (Linux) . . . . . . . . . . . . 287 

6.2.14 Uncontrolled rack power off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 

6.3 File system scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 

6.3.1 Port mapper daemon not running on the SN . . . . . . . . . . . . . . . . . 295 

6.3.2 NFS daemon not running on the SN . . . . . . . . . . . . . . . . . . . . . . . . 296 

6.3.3 GPFS pagepool (wrongly) set to 512MB on bgIO cluster nodes . . 297 

6.3.4 Secure shell (ssh) is broken . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 

6.3.5 The /bgl file system full (Blue Gene/L uses GPFS). . . . . . . . . . . . . 312 

6.3.6 Installing new Blue Gene/L driver code (driver update) . . . . . . . . . 320 

6.3.7 Duplicate IP addresses in /etc/hosts . . . . . . . . . . . . . . . . . . . . . . . . 343 

6.3.8 Missing I/O node in /etc/hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 

6.3.9 Adding an extra alias for the SN in /etc/hosts . . . . . . . . . . . . . . . . . 347 

6.4 Job submission scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 

6.4.1 The mpirun command: scenarios description . . . . . . . . . . . . . . . . . 349 

6.4.2 The mpirun command: environment variables not set . . . . . . . . . . 350 

6.4.3 The mpirun command: incorrect remote command 

execution (rsh) setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 

6.4.4 LoadLeveler: scenarios description. . . . . . . . . . . . . . . . . . . . . . . . . 358 

6.4.5 LoadLeveler: job failed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 

6.4.6 LoadLeveler: job in hold state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 

6.4.7 LoadLeveler: job disappears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 

6.4.8 LoadLeveler: Blue Gene/L is absent . . . . . . . . . . . . . . . . . . . . . . . . 371 

6.4.9 LoadLeveler: LoadLeveler cannot start. . . . . . . . . . . . . . . . . . . . . . 375 

Chapter 7. Additional topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 

7.1 Cluster Systems Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 

7.1.1 Overview of CSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 

7.1.2 Monitoring the Blue Gene/L database with CSM . . . . . . . . . . . . . . 386 

7.1.3 Customizing the monitoring capabilities of CSM. . . . . . . . . . . . . . . 387 

7.1.4 Defining your own CSM monitoring constructs . . . . . . . . . . . . . . . . 390 

7.1.5 Miscellaneous related information. . . . . . . . . . . . . . . . . . . . . . . . . . 394 

7.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 

7.2 Secure shell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 

7.2.1 Basic cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 

7.2.2 Secure shell basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 

7.2.3 Sample configuration in a cluster environment . . . . . . . . . . . . . . . . 402 

7.2.4 Using ssh in a Blue Gene/L environment . . . . . . . . . . . . . . . . . . . . 406 

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L . . . 409 

Contents vii

Installing LoadLeveler on SN and FENs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 

Obtaining the rpms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 

Installing the rpms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 

Setting up the LoadLeveler cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 

Enabling Blue Gene/L capabilities in LoadLeveler . . . . . . . . . . . . . . . . . . 413 

Setting Blue Gene/L specific environment variables. . . . . . . . . . . . . . . . . 413 

Appendix B. The sitefs file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 

The /bgl/dist/etc/rc.d/init.d/sitefs file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 

Appendix C. The ionode.README file . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 

/bgl/BlueLight/ppcfloor/docs/ionode.README file . . . . . . . . . . . . . . . . . . . . . 432 

Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 

Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 

IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 

Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 

Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 

How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 

Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 

viii IBM System Blue Gene Solution: Problem Determination Guide

Notices 

This information was developed for products and services offered in the U.S.A. 

IBM may not offer the products, services, or features discussed in this document in other countries. Consult 

your local IBM representative for information on the products and services currently available in your area. 

Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM 

product, program, or service may be used. Any functionally equivalent product, program, or service that 

does not infringe any IBM intellectual property right may be used instead. However, it is the user's 

responsibility to evaluate and verify the operation of any non-IBM product, program, or service. 

IBM may have patents or pending patent applications covering subject matter described in this document. 

The furnishing of this document does not give you any license to these patents. You can send license 

inquiries, in writing, to: 

IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. 

The following paragraph does not apply to the United Kingdom or any other country where such provisions 

are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES 

THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, 

INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, 

MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer 

of express or implied warranties in certain transactions, therefore, this statement may not apply to you. 

This information could include technical inaccuracies or typographical errors. Changes are periodically made 

to the information herein; these changes will be incorporated in new editions of the publication. IBM may 

make improvements and/or changes in the product(s) and/or the program(s) described in this publication at 

any time without notice. 

Any references in this information to non-IBM Web sites are provided for convenience only and do not in any 

manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the 

materials for this IBM product and use of those Web sites is at your own risk. 

IBM may use or distribute any of the information you supply in any way it believes appropriate without 

incurring any obligation to you. 

Information concerning non-IBM products was obtained from the suppliers of those products, their published 

announcements or other publicly available sources. IBM has not tested those products and cannot confirm 

the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on 

the capabilities of non-IBM products should be addressed to the suppliers of those products. 

This information contains examples of data and reports used in daily business operations. To illustrate them 

as completely as possible, the examples include the names of individuals, companies, brands, and products. 

All of these names are fictitious and any similarity to the names and addresses used by an actual business 

enterprise is entirely coincidental. 

COPYRIGHT LICENSE: 

This information contains sample application programs in source language, which illustrates programming 

techniques on various operating platforms. You may copy, modify, and distribute these sample programs in 

any form without payment to IBM, for the purposes of developing, using, marketing or distributing application 

programs conforming to the application programming interface for the operating platform for which the 

sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, 

therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, 

modify, and distribute these sample programs in any form without payment to IBM for the purposes of 

developing, using, marketing, or distributing application programs conforming to IBM's application 

programming interfaces. 

© Copyright IBM Corp. 2006. All rights reserved. ix

Trademarks 

The following terms are trademarks of the International Business Machines Corporation in the United States, 

other countries, or both: 

eServer 

ibm.com® 

AIX 5L 

AIX® 

Blue Gene® 

DB2® 

IBM® 

LoadLeveler® 

NUMA-Q® 

Power PC® 

PowerPC® 

POWER 

POWER4 

POWER5 

The following terms are trademarks of other companies: 

x IBM System Blue Gene Solution: Problem Determination Guide 

Redbooks 

Redbooks (logo) 

Sequent® 

System p 

Tivoli® 

1350 

Java, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other 

countries, or both. 

Nina, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or 

both. 

Chips, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel 

Corporation or its subsidiaries in the United States, other countries, or both. 

UNIX is a registered trademark of The Open Group in the United States and other countries. 

Linux is a trademark of Linus Torvalds in the United States, other countries, or both. 

Other company, product, or service names may be trademarks or service marks of others.

Preface 

This IBM® Redbook is intended as a problem determination guide for system 

administrators in a High Performance Computing environment. It can help you 

find a solution to issues that you encounter on your IBM eServer Blue Gene® 

system. 

This redbook presents an architectural overview of the IBM eServer Blue Gene 

Solution, with some of the principles that have been used to design this 

revolutionary supercomputer, a description of the hardware and software 

environment that compose this solution, along with a short description of each 

component and how to identify them in an installed system. 

This redbook also includes a problem determination methodology that we 

developed during our residency, along with the problem determination tools that 

are available with the basic IBM eServer Blue Gene Solution. It also discusses 

additional software components that are required for integrating your Blue Gene 

system in a complex computing environment. These components include file 

systems (NFS and GPFS) and job submission tools (mpirun and IBM 

LoadLeveler®). 

This redbook also describes a GPFS installation procedure that we used in our 

test environment and several scenarios that describe possible issues and their 

resolution that we developed following the proposed problem determination 

methodology. 

Finally, this redbook includes a short introduction about to how to integrate your 

Blue Gene system in a High Performance Computing environment managed by 

IBM Cluster Systems Management as well as an introduction to how you can use 

secure shell in such an environment. 

The team that wrote this redbook 

This redbook was produced by a team of specialists from around the world 

working at the International Technical Support Organization (ITSO), Austin 

Center. 

Octavian Lascu is a Project Leader at the ITSO, Poughkeepsie Center. He 

writes extensively and teaches IBM classes worldwide on all areas of IBM 

System p and Linux® clusters. Before joining the ITSO, Octavian worked in 

IBM Global Services Romania as a software and hardware Services Manager. 

© Copyright IBM Corp. 2006. All rights reserved. xi

He holds a Master's Degree in Electronic Engineering from the Polytechnical 

Institute in Bucharest and is also an IBM Certified Advanced Technical Expert in 

AIX/PSSP/HACMP. He has worked with IBM since 1992. 

Peter F Custerson is a Product Support Specialist based in Farnborough, 

United Kingdom. He has worked for IBM for nine years, four years for the former 

Sequent® organization and five years for the IBM UK Unix Support Centre. He 

currently is the High Performance Computing (HPC) Technical Advisor, and 

concentrates on customer support issues with the HPC product set for our 

regions HPC customers. This includes Cluster 1600, 1350 and Blue Gene 

customers. He holds an honors degree in Computer Studies from the University 

of Glamorgan in the United Kingdom. 

Marty Fullam is a software engineer on the Cluster Systems Management 

development team. He is the architect and lead developer for the CSM Blue 

Gene support project. He joined IBM in Poughkeepsie, New York, in 1982, 

working on electronic design automation tools. He holds a B.E in Electrical 

Engineering from SUNY Stonybrook, and an M.S. in Computer Engineering from 

Syracuse University. 

Ravi K Komanduri is a software engineer at High Performance Computing 

group of IBM Systems and Technology Labs, India. He joined IBM in 2004 and 

has about 5 1/2 years of total experience in High Performance Computing & Grid 

technologies. His areas of expertise include Parallel programming, 

Benchmarking & Performance analysis, Developing software tools on clusters, 

and currently involved in functional testing of the Blue Gene control system. He 

holds a bachelors degree in Computer Science from Jawaharlal Nehru 

Technological University, Hyderabad, India. 

Dr. Thanh V Lam is a software engineer in Cluster System Test. He leads a 

team of test engineers in testing cluster software and hardware on multiple 

platforms. His areas of expertise include LoadLeveler, high performance 

computing, parallel applications, and Blue Gene. He joined the IBM High 

Performance Supercomputing Lab, Kingston, New York, in 1988 and started 

working in service processing for the early scalable parallel systems known as 

SP. He holds a degree in Doctor of Professional Study in Computing from Pace 

University, White Plains, New York. 

Sean Saunders is an HPC Product Support Specialist working for ITS in IBM 

UK. He joined IBM in 2000, initially working on NUMA-Q® systems and now 

mainly supporting Cluster 1600, 1350 and Blue Gene systems, including the 

HPC software stack. He also supports AIX® and Linux. He holds a B.Sc.(Hons) 

degree in Computer Science from Kingston University, England. 

xii IBM System Blue Gene Solution: Problem Determination Guide

Chris Stone is a Senior IT Specialist based IBM Hursley Park, United Kingdom. 

He works as part of the High Performance Computing Services Team based in 

the UK mainly on storage systems for customers. He has been working for IBM 

for over 20 years and in that time has gained a wide range of experience in 

monitor hardware development, software development and customer services. 

He has 7 years of experience in designing and installing GPFS storage systems 

for customers and has recently gained experience with Blue Gene systems by 

team leading the install of two Blue Gene installs in Europe. He holds an 

Honours degree in Electrical and Electronic engineering from Bristol University in 

the UK. 

Shinsuke Ueyama is a software engineer for Engineering and Technology 

Services in IBM Japan. He joined IBM in 2005, mainly working on supporting 

Blue Gene systems, and recently experienced the installation of a Blue Gene 

system in Japan. He holds a Master degree in Information Science and 

Technology from The University of Tokyo, Japan. 

Dino Quintero is a Consulting IT Specialist at the ITSO in Poughkeepsie, New 

York. Before joining the ITSO, he worked as a Performance Analyst for the 

Enterprise Systems Group, and as a Disaster Recovery Architect for IBM Global 

Services. His areas of expertise include disaster recovery and IBM System p 

clustering solutions. He is an IBM Certified Professional on IBM System p 

technologies and also certified on System p system administration and System p 

clustering technologies. Currently, he leads technical teams that deliver IBM 

Redbook solutions on System p clustering technologies and technical workshops 

worldwide. 

Thanks to the following people for their contributions to this project: 

Steve Mearns 

IBM Portsmouth, UK 

Randy A. Brewster 

Puneet Chaudhary 

Richard Coppinger 

Alexander Druyan 

Bruce Hempel 

Steve Normann 

Edwin Varella 

IBM Poughkeepsie, NY 

Lynn A Boger 

Mark Campana 

Cathy Cebell 

Thomas A. Budnik 

Darwin Dumonceaux 

Preface xiii

Frank Ingram 

Randal Massot 

Mike Nelson 

Jeff Parker 

Karl Solie 

Mike Woiwood 

IBM Rochester, MN 

Tom Engelsiepen 

IBM San Jose, CA 

Mark Mendell 

IBM Toronto, Canada 

Marc B Dombrowa 

David M. Singer 

IBM Yorktown Heights, NY 

CSM for Blue Gene Development Team: 

Ling Gao 


ITSO Editing team: 

Ella Buslovich 


Debbie Willmschen 

IBM Raleigh, NC 

Become a published author 

Join us for a two- to six-week residency program! Help write an IBM Redbook 

dealing with specific products or solutions, while getting hands-on experience 

with leading-edge technologies. You'll team with IBM technical professionals, 

Business Partners or customers. 

Your efforts will help increase product acceptance and customer satisfaction. As 

a bonus, you'll develop a network of contacts in IBM development labs, and 

increase your productivity and marketability. 

Find out more about the residency program, browse the residency index, and 

apply online at: 

ibm.com/redbooks/residencies.html 

xiv IBM System Blue Gene Solution: Problem Determination Guide

Comments welcome 

Your comments are important to us! 

We want our Redbooks to be as helpful as possible. Send us your comments 

about this or other Redbooks in one of the following ways: 

► Use the online Contact us review redbook form found at: 


► Send your comments in an e-mail to: 

redbook@us.ibm.com 

► Mail your comments to: 

IBM Corporation, International Technical Support Organization 

Dept. HYTD Mail Station P099 

2455 South Road 

Poughkeepsie, NY 12601-5400 

Preface xv

xvi IBM System Blue Gene Solution: Problem Determination Guide

Chapter 1. Introduction 

1 

This book discusses diagnostic and problem determination methodologies for an 

IBM eServer Blue Gene Solution (also known as Blue Gene/L). In this chapter, 

we give a brief overview of Blue Gene/L architecture and describe the networks 

and the system boot process. 

© Copyright IBM Corp. 2006. All rights reserved. 1

1.1 Blue Gene/L system overview 

In this book, we investigate and describe symptoms, techniques, and 

methodologies for tackling issues that you might encounter while using your Blue 

Gene/L system. However, we do not describe in detail the hardware or 

application porting and running process for Blue Gene/L. For that purpose, we 

recommend the following IBM Redbooks: 

► Blue Gene/L: Hardware Overview and Planning, SG24-6796 

► Blue Gene/L: System Administration, SG24-7178 

► Unfolding the IBM eServer Blue Gene Solution, SG24-6686 

To begin using this book and solving any issues that you have with your system, 

you should have an understanding of your Blue Gene/L hardware, software, 

network, and boot process, as well as the location codes and LED status 

reported by the system. This type of information is described in this chapter. 

Later chapters discuss problem solving and include check lists for the basic Blue 

Gene/L system as well as other components, such as GPFS and LoadLeveler. 

1.2 Hardware components of the Blue Gene/L system 

The current configuration of Blue Gene/L is built from dual core processors 

placed in pairs on a Compute Card together with 1 GB of RAM (512 MB for each 

dual core CPU), as shown in Figure 1-1 on page 3. The configuration includes: 

► 16 Compute Cards installed in a node card (32 processors) 

► 16 Node Cards (512 processors) installed into a dual-sided midplane (1/2 

rack) 

► Two midplanes installed in a rack (1024 processors) 

You can link the racks together to a maximum of 64 racks (65536 processors). 

2 IBM System Blue Gene Solution: Problem Determination Guide

1.2.1 Racks 

Chip 

2 processors 

Compute Card 

2 chips, 1x2x1 

2.8/5.6 GF/s 

4MBC h 

Node Card 

(32 chips 4x4x2) 

16 compute, 0-2 IO cards 

5.6/11.2 GF/s 

1.0 GB RAM 

Figure 1-1 Blue Gene/L System 

32 Node Cards 

90/180 GF/s 

16 GB RAM 

2.8/5.6 TF/s 

512 GB RAM 

64 Racks, 64x32x32 

180/360 TF/s 

32 TB RAM 

The following paragraphs describe each of the hardware elements in a Blue 

Gene/L system. 

The hardware that we discuss in this chapter is installed in racks. The current 

maximum numbers of racks in a Blue Gene/L system is 64. Each rack has its 

own location code that is seen in system logs and RAS events. You need to learn 

these locations to determine problem efficiently. 

Figure 1-2 on page 4 and Figure 1-3 on page 5 show where the hardware 

components are installed in the rack and the location codes. These two figures 

show a front and a back view for each side of the midplane, and each rack has 

two midplanes installed. 

Rack 

System 

Note: In the hardware descriptions that we include here, you see two location 

codes per item. The first code refers to how the Blue Gene/L system software 

references the location, and the second code is how the hardware is labeled. 

(Hardware references begin with the letter J.) 

Chapter 1. Introduction 3

Figure 1-2 Card positions in the front of the rack with location names 


Figure 1-3 Card positions in the back of the rack with location names 


1.2.2 Midplane 

As the name suggests, the midplane sits in the middle of the rack (in vertical 

position). There are two midplanes per rack. Position 0 is at the bottom, and 

position 1 is at the top. All nodes, link, service, and fan cards plug into the 

midplane (see Figure 1-4). 

The midplane also provides the communications infrastructure by which the 

components to talk to each other. The network services provided by the node, 

service cards, and midplane are discussed in 1.3, “Blue Gene/L networks” on 

page 21. 

Figure 1-4 Midplane position in the rack 


1.2.3 Compute (processor) card 

1.2.4 I/O card 

The compute (processor) card, shown in Figure 1-5, is the basic building block of 

a Blue Gene/L system. It is comprised of two 700 Mhz 32-bit PowerPC® 440x5 

dual-core processors and 1 GB of RAM (512 Mb per dual core CPU). 

These nodes are the work horse of the system on which applications run. One 

compute card provides two or four nodes, depending on whether you select to 

run in communication co-processor (co) or virtual node (vn) mode. In co mode, 

one core of each PowerPC chip runs the application, while the other handles the 

message passing. In vn mode, both cores run the application and perform the 

message passing duties. 

The node runs the compute node kernel (CNK), which is a proprietary IBM kernel 

that is optimized for the Blue Gene/L environment. This kernel is a single user, 

single process, with up to two threads and does not implement paging (in fact, in 

this configuration, the virtual memory is limited to the real memory). It uses a 

subset of about 40% of supported Linux system calls. The CNK can also be 

known as Blue Gene Light Runtime System (BLRTS). 

Figure 1-5 Compute (processor) card 

The I/O card is similar to the compute card except these nodes have more 

memory (currently 2 Gb per card or 1 Gb per node) and have a specific use 

within the system. The I/O card runs a Linux version (2.4 uni-processor kernel) 

known as the mini-control program (MCP). The MCP has been altered from a 

standard Linux distribution to include support for the Blue Gene/L hardware 

(including an altered kernel). There is one instance of the MCP per chip, giving 

two nodes per I/O card. 

Important: Even though the I/O card is a dual core CPU, only one processor 

is used to run the I/O MCP. 


1.2.5 Node card 

As an application runs on a compute node, it is I/O is routed to the outside world 

through the integrated communication hardware on the I/O card. The I/O card 

handles I/O operations on behalf of compute nodes (groups of compute nodes). 

This grouping of compute nodes to an assigned I/O card is known as its 

processor set or pset. 

You can tell the difference between the compute and I/O card, by looking at the 

edge connectors. The compute card has a total of six edge connectors, which is 

one more than the I/O card (see Figure 1-6). 

Figure 1-6 I/O card 

The node card is the location into which the compute and I/O cards are inserted. 

16 compute cards are added to each node card, as well as up to two optional I/O 

cards. Each card slot inside the node card has its own unique location code, as 

shown in Figure 1-7 on page 9. 

Note: If every I/O node position in the system has a card installed, the system 

is known as I/O rich. 

The node cards are then installed into a midplane — eight in the front and eight 

in the back. This installation is repeated for the second midplane, giving a total of 

32 node cards per rack, each with its own unique location code. 


Figure 1-7 Node card block diagram showing location codes 

Note: Each of the locations has a specific use. For example: 

– J1: Midplane connector 

– C0 → CF or J2 → J17: Compute cards 

– I0, I1 or J18, J19: I/O cards 

The node card has LED status indicators on the front (Figure 1-8 on page 10). 

The center LED panel shows whether the card is linked to the midplane, whether 

the IDo link is working, and whether the card itself or something plugged into the 

card has a fault. 


Figure 1-8 Node card center LED panel 

Table 1-1 Node card status LEDs 

Ethernet LEDs 

Led Name/Color Indication 

LINK/Green ON - An active link from the Service Network into the 

node card. Card can be monitored and controlled. 

OFF - Control link is missing. All the other LEDs might 

contain inaccurate information because the Service 

Network connection is used to set them. 

4/Green ON - Flashing - card is operating 

OFF - Card is down. Node Cards should only be 

removed when this LED is OFF. 

3/Green Flashing - Un-initialized 



0/Amber ON - Card has a fault 

OFF - Card has no fault 

FLASHES - Card needs human interaction 

Note: If LED 0 is on after everything is initialized, one of the power modules 

might have failed. In this case, the card is operational but no longer has its 

redundant power. 


1.2.6 Service card 

In addition to the center LED panel, each RJ45 jack marked 1, 2, 3, or 4 has 

status LEDs (Figure 1-8 on page 10 shows two RJ45 jacks marked 2 and 3. The 

left green LED in 3 is on.) 

Table 1-2 Node card RJ45 LEDs and indications 

Led Name/Color/Position Indication 

RJ45 Gbit ethernet link LED/Green/Left ON - Active Gbit ethernet link 

OFF - No active Gbit ethernet link 

RJ45 Gbit ethernet link LED/Green/Right FLASHING - Indicates traffic 

OFF - No traffic 

The service card provides the major control functions for each midplane: 

► Provides an interface to the network that controls the fans and cooling 

through the I²C network. (See also 1.3.1, “Service network” on page 22.) 

► Distributes the clock signal to all node cards on the midplane. The clock card 

is connected directly to the front of each service card. (See 1.2.8, “Clock card” 

on page 15.) 

► Controls the ethernet switch for the midplane TCP/IP network. 

► Controls the boot sequence of each midplane. 

► Delivers persistent power to link cards, so that torus and collective networks 

continue to function in the event of a power outage, so that additional racks in 

the Blue Gene/L system continue to work. See 1.3.3, “Three dimensional 

torus (3D torus)” on page 26 and 1.3.4, “Collective (tree) network” on page 26 

for more details. 

► Connects devices on the midplane to the service node through the service 

network. Each device that is inserted into the midplane has an IDo chip that 

acts as a bridge between the network and the other hardware interfaces. (For 

more details see 1.3.1, “Service network” on page 22.) 

The service card also has its own set of LEDs to help with problem 

determination, as illustrated in Figure 1-9. 

Figure 1-9 Front of the service card 


Status LEDs are located on the right-hand side of the service card. Table 1-3 

explains these LEDs. 

Table 1-3 Service card status LEDs 

Led Name/Color Indication 

Power/Green ON - 3.3 volt persistent power to the Midplane present. This 

power is used to run the service card and to run the service 

path logic on all other Link, Node Cards and Fan modules in the 

midplane. This LED should always be on during normal 

operations. 

OFF - 3.3 volt persistent power to the midplane missing. 

Service Cards should only be removed when this LED is OFF 

4 /Green ON/FLASHING - Card is operating 

OFF - Card is down 

3/Amber OFF - Card initialized or no rack power 

FLASHING - Card un-initialized or needs to be brought up after 

a power cycle 



a power cycle 



a power cycle 

0 - Amber ON - Card has a fault 

OFF - Card has no fault 


a power cycle 

The service card also hosts 100 Mbps and 1 Gbps network ports. The 100 Mbps 

port is used in the discovery process, and the 1 Gbps port is used for talking to 

the hardware when the discovery process has completed. For more information 

about the discovery process, see 1.9.3, “Discovery” on page 43. Each RJ45 jack 

on the service card has status LEDs that are similar to the ones on the node card 

(see Table 1-2 on page 11). 


1.2.7 Fan card 

The cooling on the Blue Gene/L system is achieved through banks of fans. Each 

fan unit contains three individual fans and two status LEDs on the front 

(Figure 1-10). 

Figure 1-10 Fan unit front (showing LEDs) and fan unit removed from rack 

The fans draw the air through the intake plenum, through the rack, and release it 

out through the output plenum towards the ceiling (Figure 1-11). It is the plenums 

that give the Blue Gene/L its unique shape. 

Figure 1-11 Rack cooling 


There are 20 fan units (clusters) per rack — 10 are installed on the front and 10 

in the back, each with their own location code. (For the locations codes, see 

Figure 1-2 on page 4.) Figure 1-12 shows the fan clusters installed in the rack. 

Figure 1-12 Fans installed in a rack - exhaust plenum removed for clarity 

Table 1-4 shows the fan status LEDs significance. 

Table 1-4 Fan status LEDs 

Fan Good - Green - Left Fan Fault - Amber - Right Indication 

OFF OFF Not powered, can be in this 

mode during the first few 

seconds of rack power up. 

ON ON Autonomous mode. Host 

communication failure. 

ON OFF Fan is working normally. 

OFF ON One or more fans in the 

module is operating slow 

or not at all. Fan assembly 

is bad and needs to be 

looked at by hardware 

support. 

FLASHING ON or OFF Identification mode, used 

to identify a specific fan 

unit. 


1.2.8 Clock card 

The clock card is attached physically to the bottom of the rack and is connected 

to the midplanes in the system through special cables that connect to the coaxial 

ports that are located on the front of each service card (Figure 1-13). Because 

Blue Gene/L is a massive parallel system, the clocks on all nodes that are 

involved in running a job must be synchronized (especially for any MPI job). The 

clock card logic provides the necessary synchronization for all node cards. 

Figure 1-13 Clock card 

1.2.9 Bulk power modules/enclosure 

The bulk power modules (BPM) are located on top of each rack. There are seven 

modules per rack (three in the front and four in the back of the rack). These 

modules are connected in an n+1 redundancy scheme. The modules are 

inserted into the bulk power enclosure (BPE), as shown in Figure 1-14, which 

also houses the circuit breaker module that is used to turn on and off the rack. 

Figure 1-14 Front view of the BPE (three BPMs and a circuit breaker) 


The BPMs also have status LEDs, as shown in Figure 1-15. 

Figure 1-15 BPM (LEDs on top left) 

Table 1-5 details the significance of these LEDs. 

Table 1-5 BPM Status LEDs 

LED Name/Color/Position Indication 

AC Good/Green/Left ON - AC Input to power module is present and 

within specification 

OFF - There is a problem with the AC supply 

DC Good/Green/Middle ON - DC Input to power module is present and 

within specification 

OFF - There is a problem with the DC supply 

Fault/Amber/Right ON - Replace power module 

OFF - Operating normally (as long as AC Good and 

DC Good are ON) 

All LEDs OFF - No Power 


1.2.10 Link card 

The link card is used to connect different Blue Gene/L internal networks between 

compute (processor) cards on different midplanes. It allows the Blue Gene/L to 

expand beyond the physical midplane limit. The link care also provides 

connectors for X, Y, and Z dimension cabling through torus cables which are 

plugged into it (as shown in Figure 1-16). The Z cables go between midplanes 

that are located in the same rack, the Y cables go between midplanes in the 

same row, and the X cables go between midplanes in different rows. 

Figure 1-16 Direction of X,Y and Z cabling 

Note: The process of cabling a Blue Gene/L is beyond the scope of this 

redbook and is covered by the installation team when the system is installed. 

What you need to know is the locations and socket names on the link card (as 

illustrated in Figure 1-16). 


Figure 1-17 Link card that shows the location codes of X, Y, and Z cable connections 


1.2.11 Rack, power module, card, and fan naming conventions 

Each card is named according to its position in the rack. Figure 1-17, 

Figure 1-18, and Figure 1-19 depict the way a unique system location is built for 

each component. 

Racks: 

Rxx 

Rack column (0-F) 

Rack row (0-F) 

Power modules: 

Rxx-Px 

Power module (0-7) 

0-3 Left-to-right facing front 

Midplanes: 

Rxx-Mx 

Clock cards: 

Rxx-K 

4-7 Left-to-right facing rear 

Rack 

Figure 1-18 Hardware naming convention, 1 of 3 

Note: Position 0 is 

reserved for the 

circuit breaker 

Midplane (0-1) 0 = Bottom, 1 = Top 

Rack 

One Clock card per rack 

Rack 


Fan assemblies: 

Rxx-Mx-Ax 

Fan assembly (1-A) 

1 = Bottom front, 9 = Top front (Odd numbers) 

Link cards: 

Rxx-Mx-Lx 


20 IBM System Blue Gene Solution: Problem Determination Guide 

2 = Bottom rear, A = Top rear (Even numbers) 

Midplane 

Rack 

Link card(0-3) 

Midplane 

Rack 

0 = Bottom front, 1 = Top front 

2 = Bottom rear, 2 = Top rear 

Fans: 

Rxx-Mx-Ax-Fx Fan(0-2) 0 = Tailstock, 2 = Midplane 

Fan assembly 

Midplane 

Rack 

Service cards: 

Rxx-Mx-S 

Service card - one per rack 

Midplane 

Rack

Node cards: 

Rxx-Mx-Nx 


1.3 Blue Gene/L networks 

Node card (0-F) 

0 = Bottom front, 7 = Top front 

8 = Bottom rear, F = Top rear 

Midplane 

Rack 

Compute cards: 

Rxx-Mx-Nx-C:Jxx Compute card (2* -17) 

Node card 

Midplane 

Rack 

I/O cards: 

Rxx-Mx-Nx-I:Jxx I/O card (18-19) 18 = Right, 19 = Left 

Node card 

Midplane 

Rack 

Even numbers - right 

Odd numbers - left 

Lower numbers - toward midplane 

Upper numbers toward tailstock 

Blue Gene/L has five networks that connect the internal components and the 

system to the outside world. The networks are: 

► Service network 

– JTAG 

– I²C 

► Functional Ethernet 

► Torus 

► Collective (tree) 

► Barrier and Global interrupt 


1.3.1 Service network 

The service network provides a gateway into the Blue Gene/L system, so that the 

Service Node can control and monitor the hardware. The Service Node 

communicates to the Service Card using 100 Mbps or 1 Gbps connections. An 

ethernet switch located in the Service Card provides two Fast and two Gigabit 

ethernet ports that are located on the front of the Service Card. The internal 

switch also provides 20 additional ports, which are used to connect to the IDo 

chips that are located on the 16 node cards and four link cards hosted on the 

midplane (Figure 1-21). 

Figure 1-21 Service Network 

The hardware on the node and link cards is unable to talk directly to the Service 

Network. These components communicate using their own protocols. The IDo 

chip acts as a bridge between the service network and these communication 

protocols. The two protocols are: 

► JTAG 

► I²C 


JTAG 

Each processor has an interface onto a control system. The protocol used is 

called the Joint Technical Advisory Group (JTAG) interface, which is an Institute 

of Electrical and Electronic Engineers (IEEE) 1149.1 standard. The IDo chip 

converts the 100 Mbps ethernet bus into 40 JTAG buses, two for each Compute 

Node (32 in all, one for control of each processor), two for each of the I/O nodes 

(4 in all, one for control of each processor), and four for the gigabit ethernet 

transceivers that are associated with the I/O nodes. On Blue Gene/L, JTAG 

provides: 

► Hardware control to turn on the CPU and start its clock. 

► Hardware diagnostic and debugging. 

► Delivery of the microloader to each CPU to start the boot process. 

► A mailbox function for recording messages from the MCP or CNK, which are 

used for recording RAS events. 

Figure 1-22 illustrates node cards links to the JTAG network. 

Figure 1-22 Node cards links to the JTAG network 


I²C 

The Inter-Integrated Circuit (I²C) bus (and protocol) is used to interface with the 

fan control logic and temperature sensors in the Blue Gene/L system 

(Figure 1-23). Data such as fan speed and voltages are also reported using this 

2-wire serial protocol. 

I²C Network 

Rear 

Link Card 

Node Card 

Node Card 

Node Card 

Node Card 

Node Card 

Node Card 

Node Card 

Node Card 

BPM 

iDo 

iDo 

iDo 

iDo 

iDo 

iDo 

iDo 

iDo 

iDo 

Link Card iDo 

Figure 1-23 Service network showing JTAG and I²C connections 

1.3.2 Functional network 

The functional network connects the Blue Gene/L I/O nodes, Service Node, 

Front-end Nodes (FENs), and file system providers together in one network. The 

functional network: 

► Provides system software and application data access to the I/O and 

Compute nodes, because there is no persistent data storage inside the Blue 

Gene/L racks. This is done through the I/O nodes over the functional network, 

in the form of NFS GPFS and mounts from external sources. (For more 

information, see Chapter 5, “File systems” on page 211.) 


Midplane Network 

Midplane 

iDo 

iDo 

iDo 

iDo 

iDo 

Switch 

iDo 

iDo 

iDo 

iDo 

iDo 

FAN 

Front 

Link Card 

Node Card 

Node Card 

Node Card 

Node Card 

iDo 

Node Card 

Node Card 

Node Card 

Node Card 

Link Card 

Service Card 

JTAG Network 

Service 

Network 

Service Node 

(idoproxy)

► Communicates stdin, stdout, and stderr from the Compute Nodes back 

from where the application was submitted through the ciod and ciodb 

processes that are running on the I/O Node and Service Node (as shown in 

Figure 1-24). For more information, see 1.4, “Service Node” on page 29. 

Figure 1-24 Functional network 


1.3.3 Three dimensional torus (3D torus) 

Three dimensional torus (3D torus) is the first of the three specialized networks 

implemented on the Blue Gene/L system to enable high performance parallel 

computing. The 3D torus network is used for general purpose, point-to-point 

message passing and multicast operations to a selected class of nodes when an 

application is running. 

Each processor is directly connected to six other neighbor processors, two 

processors in each direction (X, Y, and Z dimensions: X+1, X-1, Y+1, Y-1, Z+1, 

Z-1), as shown in Figure 1-25. The tori are implemented in hardware by the node 

card, midplane, link cards, and torus cables. 

Figure 1-25 4x4x4 (64) node torus 

For more information about the torus network, see Unfolding the IBM eServer 

Blue Gene Solution, SG24-6686. 

1.3.4 Collective (tree) network 

The collective network is the second Blue Gene/L specialized network. It is used 

for one-to-all, all-to-one, and all-to-all communication. It connects all compute 


C – Compute Node 

I/O – I/O Node 

I/O 

C C 

C C C C 

Figure 1-26 Collective network 

nodes in the shape of a tree, and any node can be the tree root (originating 

point). Any compute node can send messages up or down the tree structure, and 

that message can stop at any level (see Figure 1-26). 

In a system with as many processors as Blue Gene/L, it is impractical to provide 

each processor with its own external connection. We have I/O nodes with 

dedicated external connections that handle external I/O operations on behalf of 

groups of compute nodes. This grouping of compute nodes to an assigned I/O 

node is known as its processor set or pset. 

The I/O nodes are connected to the collective network but do not participate in 

global messaging. They just handle I/O requests. All compute nodes exist under 

their associated I/O node in the collective (tree) network. 

C C 

C External 

C External 

C External 

C 

In/Out 

In/Out 

In/Out 

C C 

C C C C C C C C C C C C 

Pset Pset Pset 

1.3.5 Global barrier and interrupt network 

I/O 

I/O 

External 

In/Out 

C C 

The third specialized network is the global barrier and interrupt network. A global 

barrier is a way of synchronizing groups of compute nodes. When a barrier is 

raised by an application, the nodes wait until everyone reaches a certain position 

or condition, so that they can all continue. An interrupt is an asynchronous signal 

that indicates the need for attention (for example an error condition). 

Every node on the Blue Gene/L system is connected to the global barrier and 

interrupt network through four inputs and four outputs. Each input and output pair 

forms a channel in the network. Each of these channels can be independently 

I/O 

I/O 

C C 

External 

In/Out 


programmed as a global logical “OR” or “AND” statement, depending on the 

polarity of the signals. 

On each node card, the outputs of eight of the compute connections with the 

outputs of the optional I/O connections are connected together to form a 

“dot-OR” network. Each node card therefore has four “dot-OR” networks. These 

are then connected into the node cards IDo chip. 

The IDo chip on the Node Card is used to sample on all the “dot-OR” networks 

on that card and to pass any signals onto other “dot-OR” networks in the same 

card or to pass it onto other node cards. 

The global barrier and interrupt network on a midplane is divided into quadrants, 

each with four node cards. One of these node cards serves as the head of the 

quadrant. 

The outputs of the three non-head node cards are connected together and fed 

into the IDo chip of the quadrant head along with the quadrant head “dot-OR” 

networks. The output of each quadrant head is connected to the IDo chips on the 

link cards (see Figure 1-27). A link card handles each global interrupt and barrier 

channel. 

The output signal from a node is called the up signal and carries the node 

contribution all the way to the top of the partition. The combined signals are then 

fed back to all the nodes of the partition. This input signal is called the down 

signal. 


Figure 1-27 Global barrier and interrupt network 

1.4 Service Node 

1.4.1 DB2 database 

The Service Node is the control system for the Blue Gene/L racks. It is an IBM 

System p system (stand-alone or LPAR) running SUSE Linux Enterprise Server 

9 (SLES 9) and has connections to the Service and Functional networks. A DB2 

database is installed and running on the Service Node, and this contains the 

current state of the Blue Gene/L as well as jobs and system configuration. The 

Blue Gene/L specific set of daemons runs on the Service node, utilizing the 

database and performing activities such as booting, job submission, and 

hardware control. 

The service node runs a DB2 database, which is used to store the following four 

data categories about the Blue Gene/L: 


► Configuration data: Includes system configuration data, operational data, 

and historical data. 

– System configuration data, which includes a representation of the 

physical system in database tables. All system hardware and connections 

are recorded in this category of data and are only altered when a 

component fails or is replaced. 

Each component of the system is represented in a database table with its 

relevant information. All the entries in the tables are identified by the 

hardware components unique serial numbers as follows: 

Machine Generated value 

Midplane Service Card’s IDo chip License Plate (unique 

identifier for each chip) 

Node Card Card’s IDo Chip License Plate 

Processor Card Processor Card’s Serial Number and position in the 

Node Card (for example, J03,J04) or ECID 

(ElectronicChip ID is the unique identifier for each 

chip) 

Node Compute or I/O Node ECID 

Link Card IDo chip License Plate 

Service Card IDo chip License Plate 

Link Chip Chips® ECID 

IDo Chip License Plate 

Note: The configuration database is empty when the system is first 

installed. It is populated by the discovery process as described in 1.9.3, 

“Discovery” on page 43. 

– Operational data, which includes the status of what is currently in use by 

applications and the status of the jobs themselves. The Midplane 

Management Control System (MMCS) interacts with this database to 

schedule jobs and allocate blocks. 

– Historical data, which keeps track of hardware changes on the system 

and shows the history of what has run, when it has run, and on what 

hardware it has run. 

► Environment data: Periodically records the values for all sensors in the 

system. Voltage levels, fan speeds, and so forth are recorded here. 

► RAS data: This data is a very important in terms of problem determination. 

The system records all Reliability, Availability, and Scalability (RAS) 


information. It is critical to monitor RAS information as an indicator of system 

health. For more information about monitoring RAS, see 3.2, “Hardware 

monitor” on page 114 and “RAS events” on page 126. 

► Diagnostic data: Contains the results from diagnostic tests on the hardware. 

1.4.2 Service Node system processes 

The following paragraphs describe Blue Gene/L system processes that run on 

the Service Node. 

mmcs_db_server 

The Midplane Management Control System (MMCS) server process is 

responsible for the management of blocks. Blocks are partitions (sets of compute 

and I/O nodes) of the Blue Gene/L in which jobs run. The mmcs_db_server 

process configures blocks at boot time and identifies what physical hardware 

should be used and in what configuration. It also polls the database for block 

actions and starts the boot processes. 

ciodb 

After the blocks are booted, ciodb manages the job launch to the block. It then 

handles passing back stdin, stdout, and stderr for each job. The ciodb daemon 

talks to the ciod process running on the I/O node. 

idoproxydb 

The idoproxydb daemon handles hardware related communication 

communication through the Service Network. It communicates with the IDo chips 

located on the Service, Link, and Node Cards. 

bglmaster 

The bglmaster process is the parent process for the other three system 

processes. It starts all three of the main system processes (idoproxy, 

mmcs-db_server, and ciodb) and restarts them if a process is ended for any 

reason. It can also provide information about the latest status of the spawned 

processes. 

Additional software 

For the Service Node to function, additional software is required. For more 

information, see Unfolding the IBM eServer Blue Gene Solution, SG24-6686. 


1.5 Front-End Node 

A Front-End Node is another IBM System p running SUSE Linux Enterprise 

Server 9 (SLES 9). Multiple Front-End Nodes can be installed. This is where the 

users log into and submit jobs to the Blue Gene/L. IBM compilers and Blue 

Gene/L compiler extensions are installed so that the user can cross compile 

code so it will run on the Blue Gene/L hardware. The user then submits the job to 

the system through a job scheduler. 

The Front-End Node is needed because compilations and handling job I/O from 

many users can have a severe effect on the performance of the Service Node. 

You can find more information about compiling and submitting jobs in Chapter 4, 

“Running jobs” on page 141. 

1.6 External file systems 

As we explained in 1.3.2, “Functional network” on page 24, the Blue Gene/L 

system does not have any persistent storage (disk) directly attached. Storage is 

provided through external file systems such as NFS or GPFS (as illustrated in 

Figure 1-28). These file systems are then mounted by the I/O nodes, and the 

Compute nodes perform I/O through the collective network. You can find more 

information in Chapter 5, “File systems” on page 211. 


Front End Node 

Job 

Job Scheduler 

ciodb 

idoproxy 

mmcs_server 

DB2 

Service Node 

Figure 1-28 Communication in Blue Gene/L 

1.7 Blue Gene/L system software 

1.7.1 I/O node kernel 

NFS/GPFS 

Functional Network 

Service Network 

CIOD 

Blue Gene/L 

I/O Node(s) 

Compute 

Resources 

iDo to JTAG 

Block 

So far we have talked about the physical hardware and software that runs on the 

Service Node. Now, we move onto the software that actually runs on the Blue 

Gene/L hardware (compute and I/O nodes). 

When a block is booted, each I/O node within that block receives the same boot 

image and configuration. This is in fact a port of Linux with specific patches to 

support the Blue Gene/L hardware. This altered kernel is also known as the 

Mini-Control Program (MCP). It is seen as linuximg in the block definition and 

has a small specialized shell that provides a subset of commands and command 

options called BusyBox. The I/O node boot scripts run the BusyBox commands. 


BusyBox is open source software. You can learn more about BusyBox at: 

http://www.busybox.net/ 

1.7.2 I/O kernel ramdisk 

The ramdisk is a stripped down UNIX® file system which contains just the root 

user, configuration files, and binaries for the services that need to be started. 

This file system is mounted by the MCP at boot time. It is customized 

dynamically, so updates to startup files and services are seen immediately by the 

I/O nodes when they are next booted. The ramdisk is specified by ramdiskimg in 

the block definition. 

1.7.3 I/O kernel CIOD daemon 

The Compute node IO Daemon (ciod) is started on the I/O node by the 

initialization scripts. It controls applications on the Compute Nodes and provides 

I/O services to them. It interacts with ciodb on the Service Node. The ciod 

daemon waits for connection on TCP port 7000. When it receives the connection, 

it reads the command from ciodb from the socket. Commands that are sent are: 

VERSION Establish protocol 

LOGIN Set up job information 

LOAD Load application 

START Start running the job 

KILL End running job 

After the application is running, ciod also reports output from the Compute 

Nodes back to ciodb. CONSOLE_TEXT messages are stdout and stderr output, and 

CONSOLE_STATUS is the return status of the application. 

1.7.4 Compute Node Kernel 

For information about the Compute Node Kernel, see 1.2.3, “Compute 

(processor) card” on page 7. It is seen as blrtsimg in the clock definition. 


1.7.5 Microloader 

The microloader is used to boot inactive processors into a state where they can 

receive the CNK (compute nodes) or MCP (I/O nodes). It is seen as mloaderimg 

in the block definition. Example 1-1 shows the block output and the boot images 

specified. 

Example 1-1 Block definition showing image definitions 

mmcs$ list bglblock R000_128 

OK 

==> DBBlock record 

_blockid = R000_128 

_numpsets = 0 

_numbps = 0 

_owner = 

_istorus = 000 

_sizex = 0 

_sizey = 0 

_sizez = 0 

_description = Generated via genSmallBlock 

_mode = C 

_options = 

_status = F 

_statuslastmodified = 2006-03-31 15:08:11.992582 

_mloaderimg = 

/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts 

_blrtsimg = 

/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts 

_linuximg = 

/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf 

_ramdiskimg = 

/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf 

_debuggerimg = none 

_debuggerparmsize = 0 

_createdate = 2006-03-06 17:56:20.600708 


1.8 Boot process, job submission, and termination 

The following steps outline the boot process, job submission, and termination 

steps: 

1. When a user submits a job using mpirun, llsubmit, or submit_job, depending 

on the command, a new block might be created according to the user’s 

specification or an exiting block might be reused. 

2. The selected block in the BGLBLOCK database table is set to C for configure. 

An entry is made in the BGLJOB table with the status of Q for queued. 

3. The mmcs_db_server process continually polls the BGLBLOCK database table 

looking for blocks in the C for configure state. 

4. When the mmcs_db_server process finds a block in the configure state, the 

boot process is started by changing the status of the block to A for allocating. 

The BGLEVENTLOG table is monitored for any FATAL RAS events. If any are 

recorded during the boot process, the block is D for de-allocated and F for 

freed. 

5. Block information is updated with the block owner which is set to the user ID 

that was used to submit the job. The mmcs_db_server process then uses 

idoproxy to establish connections to the IDo chips on the cards where the 

block is to be booted. If any IDo connections fail, a FATAL RAS event is 

created in the BGLEVENTLOG database table. 

6. IDo commands are used to initialize the chips on the I/O and compute cards. 

7. If the block spans multiple midplanes, Link Training is performed on the link 

chips. Signal patterns are sent between the chips. The patterns are used, so 

that the chips can lock onto each other and synchronize when they recognize 

the signal. 

Figure 1-29 illustrates these steps. 


Front-End Node 

Job 

Job Scheduler 

mpirun 

Create Block 

Status = C 

JOB Status = Q 

Poll for Block 

status=C 

3 

1 

2 

Service Node 

mpirun_be 

Database 

STATUS=C 

STATUS=A 

OWNER 

Block Definition 

STATUS=Q 

Job Definition 


idoproxy 

Figure 1-29 Booting a block - initial steps 

5 

Set to A 

Set Owner 

4 

5 

6 

Link Training 

iDo 

SRAM 

SRAM 

7 

SRAM 

SRAM 

iDo 

iDo 

SRAM 

SRAM 

SRAM 

SRAM 

Rack 

SRAM 

SRAM 

Node Cards 

Service Card 

Service Card 

SRAM 

SRAM 

SRAM 

SRAM 

SRAM 

SRAM 

SRAM 

SRAM 


After the chips are initialized, the microloader that is specified in the block 

definition is loaded onto the SRAM area of the I/O and compute cards (see 

Figure 1-30). A checksum is performed to show that the code has arrived as 

expected. Each processor is then started and the microloader executes. Failure 

to load the microloader generates a FATAL RAS event. 

Figure 1-30 Microloader distributed to processors 

The microloader then loads over the Service Network and performs checksums 

for the MCP and ramdisk on the I/O nodes and CNK on compute nodes (see 

Figure 1-31). The loads are performed in parallel, and a start command is sent to 

each microloader when completed. The microloader then gives up control and 

starts the loaded code. The MCP and CNK boot nodes become active entities. 

Figure 1-31 MCP and CNK loaded 


The next steps in the process are: 

1. After the mmcs_db_server process has finished sending the start commands, it 

sets the status of the block to B for booting. 

2. During this process, the nodes begin Tree training. Collective and GB 

ethernet drivers are loaded. MCP and CNK establish links to each other. 

Training involves sending signal patterns for the nodes to identify each other: 

– Synchronization pattern is sent out on the torus and collective networks 

Torus network across midplanes 

Collective between I/O and Compute Nodes 

– Nodes take turns in sending and receiving signals 

– Each node looks out for a particular bit stream 

– After the stream is found, the nodes synchronize 

– After everyone has found their counterparts, training completes 

– Failure in the process generates a FATAL RAS event 

3. I/O nodes start the GB Ethernet connection to the functional network and NFS 

mount /bgl from the Service Node. Initialization scripts start on the I/O nodes, 

and ciod starts. 

4. The ciod daemon sends a message down the collective network and waits for 

the Compute Nodes to respond. 

5. When the nodes respond, ciod generates the RAS event CIOD initialized. 

6. The mmcs_db_server process waits for all I/O nodes to generate CIOD 

initialized, and then changes the block to I for initialized. 

Figure 1-32 illustrates this process. 


56 

ALL CIOD 

Complete 

Front-End Node 

Job 

Job Scheduler 

mpirun 

Start 1 

Complete 

Service Node 

mpirun_be 

/BGL 

Figure 1-32 Initializing the block 

Database 

STATUS=I 

STATUS=B 


CIOD_INITIALIZED 

RAS 


idoproxy 

Mount /bgl 

3 

Compute 

Responded 

Tree 

Training 


5 

2 

iDo 

iDo 

iDo 

Rack 

Node Cards 

Service Card 

Service Card 

CIOD Start 

4 

CIOD CIOD

The next steps in this process are: 

1. When the block is set to I for initialized, mpirun changes the status of the job 

in BGLJOB table from Q for queued to S for starting. 

2. Then, ciodb polls for jobs in status S. When it finds these, it establishes a 

connection to ciod on the I/O Node on port 7000. The ciod receives the LOAD 

command from ciodb and sends the application over the collective network to 

the Compute Nodes. 

3. When all Compute Nodes have all received the application, the START 

command is issued. BGLJOB STATUS is set to R for running. 

4. Then, ciodb forwards STDIN to ciod and ciod forwards STDOUT and STDERR 

back to ciodb. ciod handles the file I/O on behalf of the Compute Nodes. 

ciodb continues to poll ciod for job completion, which happens when the job 

completes or is killed. 

5. Finally, ciodb marks BGLJOB STATUS as T for terminated. ciodb then closes 

the connection to ciod. 

Figure 1-33 illustrates these steps. 


Frontend Node 

Job 

Job Scheduler 

mpirun 

BGLBLOCK=I 

MPIRUN set to S 

BGLJOB 

STATUS=S 

Service Node 

mpirun_be 

Database 

STATUS=I 



ciodb 

Figure 1-33 Starting the job 

2 

1 

STATUS=T 

STATUS=S 

Job Definition 

JOB 

Complete 

For a list of block and job states, see Table 4-3 on page 162 and Table 4-4 on 

page 162. 

Logs of the various system processes and I/O nodes are described in 2.2.6, 

“Control system server logs” on page 61. 


5 

4 STDIN/STDOUT/ 

STDERR 

3 

Code Load 

& START 

CIOD CIOD 

I/O 

4 

GPFS 

NFS 

iDo 

Node Cards

1.9 System discovery 

System discovery is the process of finding all the hardware and communication 

links in a Blue Gene/L system and initializing them. The discovery process is also 

responsible for populating and updating the configuration database held on the 

Service Node. If a component is replaced, it is rediscovered and placed in the 

database. The old entries are marked as missing, allowing tracking of replaced 

hardware. For discovery work, you need to start several processes. We discuss 

these process in this section. 

1.9.1 The bglmaster process 

The bglmaster process is the master process that starts the Blue Gene/L system 

processes which provide an interface to be able to talk to the hardware after it is 

discovered through the idoproxy daemon. 

1.9.2 SystemController 

1.9.3 Discovery 

1.9.4 PostDiscovery 

When the racks are powered on they are in un-initialized state. In an 

un-initialized state, we are unable to communicate with the system, because the 

IDo chips do not have an IP address. We are, therefore, unable to communicate 

over the Service Network. To discover the system, we use the SystemController 

process. This process finds the IDo chips and allocates an IP address to each of 

them that was predefined at installation time in the Service Node database. This 

operation is done on the 100 Mb ethernet connections on the Service Card. 

After the SystemController has initialized the IDo chips, the Discovery process 

itself can then start to talk to them through the 1 Gb network connections on each 

Service Card. There are separate Discovery processes for each row of racks in 

your environment. Discovery turns on the hardware, finds the components, and 

populates (updates in the case of a power cycle or hardware change because 

the entries are already present) the relevant database table. If a device fails to 

respond to discovery, it is marked as missing. When completed, control of each 

component is passed over to idoproxy daemon. 

After Discovery has populated the database, PostDiscovery checks that the data 

is valid and cleans the database. It then adds location information of each 

component to the tables. 


1.9.5 CableDiscovery 

CableDiscovery is run after the Discovery process is completed. As the name 

suggests, it looks at the current configuration in the database and discovers Link 

Cards and the connected data cables. This process only needs to be performed 

on initial system installation or when a data cable or Link Card is replaced. 

1.10 Discovering your system 

In this section we provide a procedure that you can use to discover information 

about your Blue Gene/L system. Log on to your Service Node and execute the 

following steps: 

1. Start the idoproxy. 

cd /bgl/BlueLight/ppcfloor/bglsys/bin 

./bglmaster start 

2. Start SystemController. 

cd /discovery 

./SystemController start 

To view SystemController's messages, issue this command: 

./SystemController monitor 

3. Start a discovery up for each row of BGL racks. 

./Discovery0 start //This is for the first row of BGL racks. 

./Discovery1 start //This is for the second row of BGL racks. 

................................ 

To view Discovery0 messages: 

./Discovery0 monitor 

To view Discovery1 messages: 

./Discovery1 monitor 

................................ 

4. Start PostDiscovery. 

./PostDiscovery start 

To view PostDiscovery messages: 

./PostDiscovery monitor 

5. Use DB2 queries or Web page (if available) to verify all hardware reports as 

described in 2.2, “Identifying the installed system” on page 57. 


6. After you have checked all information, stop discovery for each of the racks: 

./Discovery0 stop //This is for the first row of BGL racks. 

./Discovery1 stop //This is for the second row of BGL racks. 

7. Stop PostDiscovery. 

./PostDiscovery stop 

8. Restart bglmaster to restart the system processes. 


./bglmaster restart 

9. Set the status of the Link Cards to good ready for CableDiscovery. 

source /discovery/dbprofile 

./mmcs_db_console 

connecting to mmcs_server 

connected to mmcs_server 

connected to DB2 

mmcs$ pgood_linkcards all 

OK 

mmcs$ quit 

10.Start CableDiscovery. 

./bglmaster stop 

cd /discovery 

./CableDiscoveryAll 

To view CableDiscovery messages 

./CableDiscovery monitor 

CableDiscovery should end with: 

DiscoverCables ended 

11.Start the Blue Gene/L system processes. 




1.10.1 Discovery logs 

SystemController, Discovery, PostDiscovery, and CableDiscovery all have logs 

of their activity. Theses logs are created in /bgl/BlueLight/logs/BGL. Example 1-2 

presents the activity logs as we observed on our system. 

Example 1-2 Discovery process logs 

supportsn:/bgl/BlueLight/logs/BGL # ls -l | grep CurrentLog 

lrwxrwxrwx 1 root root 58 Mar 31 14:04 CurrentLog-Discovery0 

-> /bgl/BlueLight/logs/BGL/Discovery0-2006-03-31-14:04:28.log 

lrwxrwxrwx 1 root root 61 Mar 31 14:04 

CurrentLog-PostDiscovery -> 

/bgl/BlueLight/logs/BGL/PostDiscovery-2006-03-31-14:04:43.log 

lrwxrwxrwx 1 root root 64 Mar 31 14:04 

CurrentLog-SystemController -> 

/bgl/BlueLight/logs/BGL/SystemController-2006-03-31-14:04:32.log 

supportsn:/bgl/BlueLight/logs/BGL # ls -l | grep Cable 

-rwxrwxr-x 1 root bgladmin 16493 Mar 8 12:04 

CableDiscoveryAll-2006-03-08-12:03:22.log 

-rw-r--r-- 1 root root 6294 Mar 28 14:09 

CableDiscoveryAll-2006-03-28-14:07:47.log 

1.11 Service actions 

When your system has a problem that requires a part to be replaced or that 

requires the system to be shutdown, you need to run a service action. A service 

action prepares the specified location by powering it down in a controlled 

manner. Parts can then be removed from the rack or the rack itself be turned off 

through the circuit breaker in the BPE. When servicing action is complete, we 

can then close the service action. This turns on and brings the location back into 

production. Two commands are provided to allow this to be done: 

PrepareForService and EndServiceAction. 

1.11.1 PrepareForService 

The syntax for this command is: 

PrepareForService LocationString [Verbose] [FORCE] 

Where: 

► LocationString is the location string of the part/card that needs to be 

serviced. Location codes are shown in 1.2.1, “Racks” on page 3. 


► Verbose provides extra command output. 

► FORCE is a keyword that indicates that a new Service Action should be started 

even if there is already an existing active service action for the resource. 

Supported locations are: 

► R00-M1-N0: specific NodeCard 

► R00-M1-N : all NodeCards in the Midplane 

► R11-M0: all Node and LinkCards in the Midplane 

► R37-M0-Ax: individual Fan Module 

► R01: Whole rack 

► R20-P1: Bulk Power Module 

► R00-M1-L3: LinkCard. This card requires that the tool turns off ALL Link and 

NodeCards in the neighborhood of the specified LinkCard. Neighborhood is 

defined as those cards that are in either the same row or the same column as 

the specified LinkCard. 

Figure 1-18 on page 19, Figure 1-19 on page 20, and Figure 1-20 on page 21 

illustrate these locations. 

At the end of the PrepareForService command, you are given a service action ID 

that must be noted for further use to return the hardware to production. 

Example 1-3 shows an how PrepareForService is used for a compute node 

replacement. 

Example 1-3 PrepareForService on a Node Card 

bglsn:/discovery # ./PrepareForService R00-M0-N0 

Logging to 

/bgl/BlueLight/logs/BGL/PrepareForService-2006-03-31-11:25:09.log 

Mar 31 11:25:10.169 EST: PrepareForService started 

Mar 31 11:25:35.363 EST: Freed any blocks using R000 

Mar 31 11:25:43.288 EST: Disabled this NodeCard's ethernet port on 

ServiceCard (R00-M0-S, FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), Port (11) 

Mar 31 11:25:43.681 EST: Marked NodeCard 

(203231503833343000000000594c31304b34323630304b35) missing 

Mar 31 11:25:43.711 EST: Deleted node hardware attrs for 34 nodes 

Mar 31 11:25:43.712 EST: Card has been successfully powered off! 

Mar 31 11:25:43.736 EST: Proceed with service on part 

(mLctn(R00-M0-N0), 

mCardSernum(203231503833343000000000594c31304b34323630304b35), 

mLp(FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A), mIp(10.0.0.25), mType(4)) 

Mar 31 11:25:43.736 EST: 


Mar 31 11:25:43.750 EST: Service Action ID 19 

Mar 31 11:25:43.755 EST: MyShutdownHook - Exiting PrepareForService 

Mar 31 11:25:43.756 EST: +++ This logfile is closed +++ 

In Example 1-4, the Node Card is turned off and is marked as missing in the 

database, together with the 34 nodes that are inserted into it (32 Compute Nodes 

and 2 I/O Nodes). 

Example 1-4 Checking hardware is removed from the Service Node database 

bglsn:/discovery # db2 "select location,status from bglnodecard where 

location = 'R00-M0-N0'" 

LOCATION STATUS 

-------------------------------- ------ 

R00-M0-N0 M 

1 record(s) selected. 

bglsn:/discovery # db2 "select location,status from bglprocessorcard 

where location like 'R00-M0-N0%'" 

LOCATION STATUS 

-------------------------------- ------ 

R00-M0-N0-C:J02 M 

R00-M0-N0-C:J03 M 

R00-M0-N0-C:J04 M 

R00-M0-N0-C:J05 M 

R00-M0-N0-C:J06 M 

R00-M0-N0-C:J07 M 

R00-M0-N0-C:J08 M 

R00-M0-N0-C:J09 M 

R00-M0-N0-C:J10 M 

R00-M0-N0-C:J11 M 

R00-M0-N0-C:J12 M 

R00-M0-N0-C:J13 M 

R00-M0-N0-C:J14 M 

R00-M0-N0-C:J15 M 

R00-M0-N0-C:J16 M 

R00-M0-N0-C:J17 M 

R00-M0-N0-I:J18 M 



There is a dedicated table in the database for the service actions. Actions on the 

same component cannot be done until the previous service action is closed. You 

can obtain open service actions by looking for the entries in 

BGLSERVICEACTION with a status of O (for Open), as shown in Example 1-5. 

Example 1-5 Showing open service actions 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # source /discovery/db.src 

Database Connection Information 

Database server = DB2/LINUXPPC 8.2.3 

SQL authorization ID = BGLSYSDB 

Local database alias = BGDB0 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select 

ID,LOCATION,STATUS from bglserviceaction where status='O'" 

ID LOCATION STATUS 

----------- -------------------------------- ------ 

19 R00-M0-N0 O 


If for some reason you end up with a service action for a component that is not 

really disabled for service, you can complete it by manually updating the 

database entry for that service action to C (for Closed), as shown in Example 1-6. 

This status can happen if a service action is initialized, but the system is then 

brought up by another method such as discovery, rather than using 

EndServiceAction. 

Example 1-6 Completing a service action manually 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select 

ID,LOCATION,STATUS from bglserviceaction where status='O'" 


----------- -------------------------------- ------ 

19 R00-M0-N0 O 


bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "update bglserviceaction 

set STATUS='C' where ID=19" 

DB20000I The SQL command completed successfully. 


glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2 "select 

ID,LOCATION,STATUS from bglserviceaction where ID=19" 


----------- -------------------------------- ------ 

19 R00-M0-N0 C 


PrepareForService logs its invocations to /bgl/BlueLight/logs/BGL. The naming 

convention for log files begin with “PrepareForService” and is followed by the 

date and time (as shown in Example 1-7). 

Example 1-7 PrepareForService logs 

bglsn:/bgl/BlueLight/logs/BGL # ls PrepareForService* 

PrepareForService-2006-03-16-10:45:40.log 





1.11.2 EndServiceAction 

The syntax for this command is: 

EndServiceAction id [Verbose] [Wait / NoWait] 

Where: 

► id is the service action ID that was finished by PrepareForService 

► Wait indicates that it should wait for the component and subcomponents to 

become active 

► NoWait indicates that it should return after turning on the component but not 

wait for it and its subcomponents to become active in the database 

Example 1-8 shows the return to service of the node card that we disabled in 

1.11.1, “PrepareForService” on page 46. 

Before starting EndServiceAction, you need to start the systemcontroller, 

discovery, and postdiscovery processes. These processes are required when 

the hardware comes back online, because the hardware needs to be 

rediscovered and marked available instead of missing in the database. When all 

of the expected hardware is back online, you can stop the systemcontroller, 

discovery, and postdiscovery processes. 


Example 1-8 EndServiceAction on a Node card 

bglsn:/discovery # ./EndServiceAction 19 

Mar 31 13:57:41.376 CST: EndServiceAction started 

Mar 31 13:57:55.069 CST: Disabled this NodeCard's ethernet port on 

ServiceCard (R00-M0-S, FF:F2:9F:15:1E:55:00:0D:60:EA:E1:AA), Port (11) 

Mar 31 13:57:55.453 CST: Marked NodeCard 

(203231503833343000000000594c31324b35323135303148) missing 

Mar 31 13:57:55.500 CST: Deleted node hardware attrs for 36 nodes 

Mar 31 13:57:55.517 CST: Card has been successfully powered off! 

Mar 31 13:57:55.596 CST: Powered Off NodeCard 

(203231503833343000000000594c31324b35323135303148) 

-snip- 

ar 31 13:59:21.252 CST: Enabled all of the NodeCard ethernet ports on 

ServiceCard (R00-M0-S, FF:F2:9F:15:1E:55:00:0D:60:EA:E1:AA) 

Mar 31 13:59:21.302 CST: Changed Midplane R00-M0's status from 'E' to 

'A' 

Mar 31 14:00:21.363 CST: @ Still waiting for 16 NodeCards in R00-M0 to 

become active 

Mar 31 14:07:21.498 CST: 

Mar 31 14:08:21.528 CST: All of R00-M0's NodeCards are active 

Mar 31 14:09:21.621 CST: @ Still waiting for 16 compute Processor cards 

in R00-M0 to become active 

Mar 31 14:11:21.669 CST: All of R00-M0's NodeCards are active 

Mar 31 14:11:21.684 CST: All of R00-M0's compute Processor cards are 

active 

Mar 31 14:11:21.706 CST: All of R00-M0's compute Nodes are active 

Mar 31 14:11:21.707 CST: 

Mar 31 14:11:35.531 CST: Enabling the PortB on each of this 

Midplane's LinkCards - to indicate that PortA is powered on 

Mar 31 14:11:36.537 CST: Enabling this Midplane's Port B output 

drivers 

Mar 31 14:11:37.726 CST: Enabling this Midplane's Port A input 

receivers 

Mar 31 14:11:38.946 CST: Ended Service Action Id 19 for R00-M0-N0 

Mar 31 14:11:38.952 CST: MyShutdownHook - Exiting EndServiceAction 

Mar 31 14:11:38.955 CST: +++ This logfile is closed +++ 


EndServiceAction logs its invocations to /bgl/BlueLight/logs/BGL. The naming 

convention for log files start with “EndServiceAction” and is followed by the date 

and time, as shown in Example 1-9. 

Example 1-9 EndServiceAction logs 

bglsn:/bgl/BlueLight/logs/BGL # ls EndServiceAction* 

EndServiceAction-2006-03-16-11:06:08.log 






1.12 Turning off the system 

When turning off your Blue Gene/L system, you need to be careful and to ensure 

that you do so in a controlled manner. Simply switching off racks can leave the 

system in state that might make it difficult to get it back to operational. To turn off 

your system, follow these steps: 

1. Prepare each individual rack for service using the following commands: 

cd /discovery 

./PrepareForService Rxx 

This preparation has the effect of shutting down all the Blue Gene/L 

hardware. 

Repeat the command, and change xx for each rack in your system. 

After the PrepareForService command is finished, note the serviceactionid 

that is displayed at the end of the command output for each command. 

2. Stop the Control System using the following commands: 


./bglmaster stop 

3. Stop DB2® on the Service Node using the following commands: 

su - bglsysdb 

db2force application all 

db2terminate 

db2stop 

4. Shutdown and turn off the Service Node. 

5. Turn off the racks. 


1.13 Turning on the system 

Having turned off your system properly, you cannot turn it on using these steps: 

1. Turn on the racks. 

2. Turn on and boot the Service Node. 

3. Check that DB2 has started. If it has not, start it with the following commands: 

su - bglsysdb 

dbstart 

For more details on getting DB2 started and to start it automatically when the 

system boots, see 2.3.4, “Check that DB2 is working” on page 87. 

4. Start the processes that rediscovers your hardware and initialize it using the 

following commands: 



cd /discovery 


./Discovery0 start 

Repeat the discovery command for each row of racks in your system, 

changing the last digit to each row number: 



5. Start the PostDiscovery process to check the discovered configuration and 

add position information. 


6. End the service action that was used to shut down the system. 

cd /discovery 

./EndServiceAction X 

Repeat the EndServiceAction for each rack changing using the previously 

saved serviceactionid. 

7. Use DB2 queries or a Web page (if available) to verify all hardware reports as 

described in 2.2, “Identifying the installed system” on page 57. 

8. After the last EndServiceAction completes and all the hardware is shown, 

stop all the processes previously launched for the system discovery using the 


./SystemController stop 

./Discovery0 stop 


Repeat the discovery command for each row of racks in your system, 

changing the last digit to each row number. 



9. Stop PostDiscovery. 

./PostDiscovery stop 

10.Restart the Blue Gene/L system processes, so that the system can go back 

into production. 

./bglmaster restart 


Chapter 2. Problem determination 

methodology 

2 

This chapter discusses how to identify the various components in an IBM System 

Blue Gene Solution system. It also includes a list of core Blue Gene/L sanity 

checks that you can use to ensure that your system is working properly. 

This chapter also provides a problem determination methodology that can help 

you identify the cause of Blue Gene/L system problems. Following the 

methodology helps you quickly find the issue, identify the Blue Gene/L system, 

and identify the problem area so that you can confirm and fix the problem. 


2.1 Introduction 

Whenever you have to work with a complex system, it is essential that you obtain 

the actual system configuration. This chapter provides a list tasks that enable 

you to determine the system configuration for your Blue Gene/L system. We also 

provide a set of checks to ensure that the components in your core Blue Gene/L 

system are functioning correctly. We also present what we consider the core of 

the Blue Gene/L system to be. 

Due to the numerous components in a Blue Gene/L system, we consider that the 

best way to approach a problem is to separate it into different problem areas. 

The methodology that we discuss here provides a process for this approach that 

includes three distinct areas: 

► Defining the problem. 

► Identifying the Blue Gene/L system. 

► Identifying the problem area. 

This approach allows someone with little Blue Gene/L experience to assess 

where the problem lies quickly, and perhaps more importantly, to determine 

whether there is a problem at all. Such problem determination is the key to 

maintaining a successful running system. The methodology then points to 

particular check lists to show how to practically determine the problem for each 

of the areas. 

In this book we discuss the Blue Gene/L system in two different ways: 

1. Core Blue Gene/L 

– Blue Gene/L racks 

– Service Node, including the Blue Gene/L processes, NFS, and DB2. 

– Network switches for Service and Functional Network 

2. Complex Blue Gene/L 

– Core Blue Gene/L 

– Front-End Nodes 

– MPI 

– GPFS 

– LoadLeveler 

Note: For our discussion, we assume that readers already have a working 

knowledge of the Linux operating system and TCP/IP. This knowledge is a 

prerequisite for understanding the environment that we use and tools that we 

present. 


2.2 Identifying the installed system 

You can determine the Blue Gene/L system configuration with a combination of 

the following tools and actions: 

► Blue Gene/L Web interface (BGWEB) 

► DB2 select statements of the DB2 database on the SN 

► Standard operating system (Linux) commands 

► Visually inspecting the hardware 

We discuss these tools and actions in the following paragraphs. 

2.2.1 Blue Gene/L Web interface (BGWEB) 

BGWEB is installed on the Service Node (SN). To connect to this service, point 

your browser to the following URL: 

http:///web/index.php 

Figure 2-1 gives an example of the BGWEB home page. 

Figure 2-1 BGWEB home page on the SN 

Chapter 2. Problem determination methodology 57

The Configuration section displays the structure of the system and expands 

further into more detail for each physical component of the core Blue Gene/L 

system. 

Note: You can run BGWEB from a Front-End Node (FEN). However, this is 

not supported officially. BGWEB requires a DB2 client to interface with the 

DB2 database on the SN and a Web server configured on the FEN. 

2.2.2 DB2 select statements of the DB2 database on the SN 

You can run SQL select statements to query information that is stored in the DB2 

database. Select statements are useful to query the amount of different 

components in the system. Example 2-1 shows a DB2 select statement that 

displays the number of compute nodes cards in a system. 

Example 2-1 DB2 select command displaying the number of compute nodes 

# db2 "select count (*)num_of_compute_node_cards from BGLPROCESSORCARD 

/ where ISIOCARD ='F' and STATUS 'M'" 

NUM_OF_COMPUTE_NODE_CARDS 

------------------------- 

64 


In this example, the user needs to create the appropriate execution environment 

by loading /discovery/db.src. This script sets up the default database 

environment and connects to the database (to run DB2 commands). Running 

this script should produce an output similar to that shown in Example 2-8. 

Example 2-2 Sourcing the /discovery/db.src file 

bglsn:/bgl/BlueLight/logs/BGL # source /discovery/db.src 





Note: The /discovery/db.src script is a copy of the 

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/db.src file. In 

addition, the directory /bgl/BlueLight/V1R2M1_020_2006-060110 represents 

the driver version that we used for this redbook. 


2.2.3 Network diagram 

2.2.4 Service Node 

The system administrator needs to understand the network configuration of the 

Blue Gene/L system. (For information about the functions of both the service and 

functional networks, see 1.3, “Blue Gene/L networks” on page 21. These 

networks are required in a core or a complex Blue Gene/L system configuration.) 

It is important to have an up-to-date diagram of the network that connects the 

Blue Gene/L system. This diagram should include the IP addresses of the 

system and network switches. 

There can be only one SN per installed Blue Gene/L system. (Refer to 1.4.2, 

“Service Node system processes” on page 31 for a more detailed description.) 

You can check the IP configuration for the service and functional networks on the 

SN in the following way: 

► Service network 

a. Using the Blue Gene/L Web interface, at the BGWEB home page, click 

Database Browser at the bottom of the page. 

b. Click the TBGLIDOPROXY database table link. A new page is displayed 

showing the idoproxy configuration as shown in Figure 2-2. 

Figure 2-2 The DB2 database table TBGLIDOPROXY (Service network) 

c. Use the command line on the SN as shown in Example 2-3. 

Example 2-3 Showing service network using DB2 CLI 

# db2 "select PROXYID,PROXYIPADDRESS from TBGLIDOPROXY" 

PROXYID PROXYIPADDRESS 

----------------------------------------------------------------------- 

BlueGene1 10.0.0.1 



Then check the IP range in the /etc/hosts: 

# grep 10.0.0.1 /etc/hosts 

10.0.0.1 bglsn_sn.itso.ibm.com bglsn_sn 

► Functional network 

a. Using the Blue Gene/L Web interface, at the BGWEB home page, click 

Database Browser at the bottom of the page. 

b. Click the TBGLIPPOOL database table link. A new page is displayed 

showing the IP range for the I/O nodes, as shown in Figure 2-3. 

Figure 2-3 Data from DB2 database table TBGLIPPOOL 

c. Using the command line on the SN, you can obtain the addresses for the 

functional network that are stored in the DB2 database and check the IP 

labels that are associated in /etc/hosts, as shown in Example 2-4. 

Example 2-4 Functional network info using DB2 CLI 

# db2 "select IPADDRESS from TBGLIPPOOL" 

IPADDRESS 

----------------------------------------------------------------------- 

172.30.2.1 

172.30.2.10 

..snip.. 

# grep 172.30.2 /etc/hosts 

172.30.2.1 ionode1 

172.30.2.2 ionode2 


172.30.2.3 ionode3 

..snip.. 

d. You can then compare these IP address to the output from /sbin/ip ad, 

as shown in Example 2-5. 

Example 2-5 Network interface configuration on the SN 

# ip ad 

..snip.. 

2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000 

link/ether 00:0d:60:4d:28:ea brd ff:ff:ff:ff:ff:ff 

inet 10.0.0.1/16 brd 10.0.255.255 scope global eth0 

inet6 fe80::20d:60ff:fe4d:28ea/64 scope link 

valid_lft forever preferred_lft forever 

..snip.. 


link/ether 00:11:25:08:30:90 brd ff:ff:ff:ff:ff:ff 


inet6 fe80::211:25ff:fe08:3090/64 scope link 


2.2.5 Front-End Nodes 

The Blue Gene/L system can have one or more Front-End Nodes (FENs). There 

does not seem to be any way to identify FENs apart from knowing that they are 

separate components within the Blue Gene/L configuration. FENs are nodes 

where user jobs are submitted through LoadLeveler or mpirun. So, the only way 

to identify the FENs is to know the components from where the jobs are 

submitted. 

These components should be included in the topology diagram. A possible check 

would be to see if MPI support and LoadLeveler are installed using the rpm 

-qa|grep mpi command . If LoadLeveler is installed, all the FENs should be 

listed in the llstatus output, as shown in Figure 4-12 on page 172. Refer to 1.5, 

“Front-End Node” on page 32 for more information. 

2.2.6 Control system server logs 

There are a number of logs that are generated on the SN. These logs are called 

the control system server logs. Table 2-1 on page 62 through Table 2-8 on 

page 64 show the logs that are generated and their purpose. There are also logs 

generated for diagnostics. We discuss these logs further in 3.4, “Diagnostics” on 

page 131. 


The default location of control system server logs is /bgl/BlueLight/logs directory. 

In this directory, there are two directories: 

► /bgl/BlueLight/logs/BGL 

This directory includes all the logs for the BGLMaster and its child daemons. 

There is also a log for each I/O node in the Blue Gene/L system. These logs 

are written through the /bgl NFS mount to the I/O nodes because the I/O 

nodes do not have any persistent storage. 

► /bgl/BlueLight/logs/diags 

There is a time stamped directory for every diagnostic run on the Blue Gene/L 

system. The directory looks similar to: 

/bgl/BlueLight/logs/diags/2006-0307-17:40:08_R000 

Table 2-1 Control system log for BGLMaster 

BGLMaster 

Name of log --bglmaster-current.log sym link to 

-bglmaster-.log 

Example bglsn-bglmaster-2006-0330-14:56:20.log 

Description Shows the initialization of BGLMaster and its child daemons, 

which involves parsing the 

/bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster.init file. Also, logs 

the status of its child daemons. 

Table 2-2 Control system log for idoproxy 

idoproxy (BGLMaster child daemon) 

Name of log -idoproxydb-.log sym link to 

--idoproxydb-current.log 

Example bglsn-idoproxydb-2006-0330-14:56:20.log 

Description Shows initialization of idoproxy, the complete name of the 

process, the log generated, the Blue Gene/L driver version, ip 

address and the ports it opens. Also, provides information about 

each block that is booted in the Blue Gene/L system. 


Table 2-3 Control system log for ciodb 

ciodb (BGLMaster child daemon) 

Name of log --ciodb-current.log sym link to 

-ciodb-.log 

Example bglsn-ciodb-2006-0330-14:56:20.log 

Description Shows initialization of ciodb, the log generated, the complete 

name of the process and the Blue Gene/L driver version. Also 

provides useful information about submitted jobs including the 

Blue Gene/L job ID and the I/O nodes (with IP addresses) used 

for the job. 

Table 2-4 Control system log for mmcs_db_server 

mmcs_db_server (BGLMaster child daemon) 

Name of log -mmcs_db_server-.log sym link to 

--mmcs_db_server-current.log 

Example bglsn-mmcs_db_server-2006-0330-14:56:20.log 

Description Shows initialization of mmcs_db_server, the log generated, the 

complete name of the process and the Blue Gene/L driver 

version. Also provides very useful information associating the 

booted block with the I/O node location codes, their log files, 

hostnames and IP addresses. It also provides useful runtime 

debug data. 

Table 2-5 Control system log for monitorHW 

monitorHW (BGLMaster child daemon) 

Name of log -monitorHW-.log sym link to 

--monitorHW-current.log 

Example bglsn-monitorHW-2006-0323-15:47:47.log 

Description Shows initialization of monitorHW, the log generated, the 

complete name of the process and the Blue Gene/L driver 

version. Also provides information about the monitoring that has 

taken place. 


Table 2-6 Control system log for perfmon 

perfmon (BGLMaster child daemon) 

Name of log -perfmon.pl-.log sym link to 

--perfmon.pl-current.log 

Example bglsn-perfmon.pl-2006-0329-18:24:38.log 

Description Shows initialization of perfmon, the log generated, the complete 

name of the process and the Blue Gene/L driver version. Also 

provides performance data on running Blue Gene/L jobs. This 

does not seem to be used at the time the redbook was written. 

Table 2-7 Control system log for I/O nodes 

I/O node log (One for each I/O node in the Blue Gene/L system) 

Name of log ---I:-.log 

Example R00-M0-N0-I:J18-U01.log 

Description Shows startup process of the I/O node, this includes the loading 

of the MCP, the startup scripts with their output and the partition 

it is associated with at boot up. When the I/O node is shutdown 

all the messages from the shutdown scripts and other shutdown 

messages are displayed. These files are appended to for every 

boot and shutdown. 

Table 2-8 The updateSchema.pl script log 

updateSchema.pl 

Name of log updateSchema-.log 

Example updateSchema-2006-04-01-13:18:29.log 

Description Shows updateSchema.pl updating the schema from the new 

driver version on the Blue Gene/L database. This only done 

when a driver update is applied. 

2.2.7 File systems (NFS and GPFS) 

Depending on the Blue Gene/L system configuration that you might have, you 

might have more than one file system available to the I/O nodes. Although 

Network File System (NFS) is required for the system to function (refer to 1.8, 

“Boot process, job submission, and termination” on page 36), General Parallel 


File System (GPFS) can also be used by the I/O nodes for reading and writing a 

jobs’ data. You need to be aware of the components that serve NFS and GPFS 

on the functional network: 

► NFS 

The SN exports (NFS) the /bgl directory over the functional network. Refer to 

2.3.6, “Check the NFS subsystem on the SN” on page 90 for more 

information. 

The I/O nodes can mount NFS file systems that are exported from servers 

which are different from either SN or FENs. In this case neither SN or FENs 

are required to mount these file systems. However, SN needs to know about 

these file systems because it has to pass the information down the I/O nodes 

using the /bgl/dist/etc/rc.d/rc3.d/S10sitefs file. If this file exists, you should 

check for a line that looks similar to the one shown in Example 2-6. 

Example 2-6 Additional NFS file systems to be mounted by the I/O nodes 

bglsn_# cat /bgl/dist/etc/rc.d/rc3.d/S10sitefs 

.. 

# Mount a scratch file system... 

mountSiteFs $SITEFS /bgscratch /bgscratch 

tcp,rsize=32768,wsize=32768,async 

.. 

► GPFS 

As previously mentioned, if a sitefs file exists, then it is possible to check 

whether GPFS has been enabled to run on the I/O nodes. Check for a line 

that looks similar to the one shown in Example 2-7. 

Example 2-7 The sitefs configuration for a Blue Gene/L system using GPFS 

bglsn_# cat /bgl/dist/etc/rc.d/rc3.d/S10sitefs 

.. 

# Uncomment the first line to start GPFS. 

# Optionally uncomment the other lines to change the defaults for 

# GPFS. 

echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs 

.. 

To identify whether GPFS is in use, run /usr/lpp/mmfs/bin/mmlscluster and 

/usr/lpp/mmfs/bin/mmlsconfig on the SN. The I/O nodes in your Blue 

Gene/L system as well as your SN should be listed. You must configure 

GPFS to mount the GPFS file systems automatically when the GPFS daemon 

is started on SN as well as on all I/O nodes that are a part of the cluster. For 

more information about GPFS running on the Blue Gene/L system, refer to 

Chapter 5, “File systems” on page 211. 


2.2.8 Job submission 

2.2.9 Racks 

Note: The /bgl/dist/etc/rc.d/rc3.d/S10sitefs is a symbolic link to 

/bgl/dist/etc/rc.d/init.d/sitefs. Be aware that this might also be found under 

/bgl/BlueLight/ppcfloor/dist/etc/rc.d, although this is not the advised 

location for it. 

You can run jobs on the Blue Gene/L system in three different ways: 

► You can run the sumbit_job command at the mmcs_console on the SN. Only 

the root user on the SN can run this command. 

► You can run the mpirun command from a FEN or SN. You can run this 

command an authorized, non-root user. 

► You can run the llsubmit command from a FEN or SN. This command is a 

LoadLeveler command, and you can run it as an authorized, non-root user. 

Note: It is not very likely that the submit_job command will be used a part of 

daily activity on the system. Using this command may be appropriate during 

certain system verification actions, however we do not encourage the use of it. 

You need to be able to identify how jobs are submitted on your system. It is likely 

that the jobs will be submitted by MPI or LoadLeveler programs from the FENs. 

For more information about running jobs with the submit_job command refer to 

2.3.8, “Check that a simple job can run (mmcs_console)” on page 96, and for 

information about mpirun and llsubmit, refer to Chapter 4, “Running jobs” on 

page 141. 

Here are two ways to determine the number of racks in the Blue Gene/L system: 

► Using the Blue Gene/L Web interface: 

At the BGWEB home page, click Configuration to display the racks that are 

configured in the IBM System Blue Gene Solution system (Figure 2-4 on 

page 67). 


Figure 2-4 Configuration of Blue Gene/L in BGWEB 

Note: If a piece of hardware has a red box around, it means this part of the 

system or a hardware within it is missing or has a hardware error. 

► Using a DB2 select statement: 

In Example 2-8, the number of records in the BGLBASEPARTITION view 

represents the number of racks that are available to the SN since the last 

system Discovery was performed. 

Example 2-8 Listing the active racks in the Blue Gene/L system 

# db2 "select BPID,STATUS from BGLBASEPARTITION where STATUS 'M'" 

BPID STATUS 

---- ------ 

R000 A 



2.2.10 Midplanes 

2.2.11 Clock cards 

Note: The select statements and database views in this section query for 

records with the STATUS field not equal to M (Missing), using the operator < >. 

These records can also have values of E (Error) and A (Available). 

Each rack contains two midplanes that are only detected when a service card is 

plugged into the midplane and then connected by ethernet to the service 

network. You can determine the number of midplanes that are available to the 

Blue Gene/L system in two ways: 


At the BGWEB home page, click Configuration to display the midplanes that 

are configured in the IBM System Blue Gene Solution system (Figure 2-4 on 

page 67). 


In Example 2-9, the number of records in the BGLMIDPLANE view 

represents the number of midplanes that are available to the system since the 

last system Discovery was performed. This example only shows one 

midplane being used. 

Example 2-9 Listing the midplanes available to the Blue Gene/L system 

# db2 "select LOCATION,POSINMACHINE,STATUS from BGLMIDPLANE where 

STATUS / 'M'" 

LOCATION POSINMACHINE STATUS 

-------------------------------- ------------ ------ 

R00-M0 R000 A 


Each rack has one clock card. Therefore, the number of clock cards is the same 

as the number of racks in the Blue Gene/L system. There is no way to identify the 

clock cards that are available to the Blue Gene/L system through the BGWEB or 

DB2. The only way to check cards is to check the bottom of each rack manually. 

For detail information about the clock card, refer to 1.2.8, “Clock card” on 

page 15. 


2.2.12 Service cards 

Blue Gene/L has one service card per midplane. Therefore, there are a 

maximum of two service cards per rack. You can determine the number of 

service cards that are available to the Blue Gene/L system in two ways: 


At the BGWEB home page, click Configuration, and then click the midplane 

link in which you are interested. A new page displays a table of the cards 

within that midplane. Figure 2-5 shows the page that is loaded. You can see 

the service card at the bottom of the table that is displayed. 

Note: To identify the cards within a given midplane, Blue Gene/L: Systems 

Administration, SG24-7178 advises you to query the relevant DB2 table. 

For example, if you want to find the number of service cards in a midplane, 

query the TBLSERVICECARD table and use the midplane serial number 

to ensure that you only find the cards in that midplane. In addition to the 

DB2 database tables, there are database views that do some of this work 

for you. 

Figure 2-5 BGWEB table showing the cards within a midplane 


2.2.13 Link cards 


In Example 2-10, the number of service cards is represented by the 

NUMSERVICECARDS field in the BGLSERVICECARDCOUNT view. This 

value is the number of service cards, per midplane, that are available to the 

system since the last system Discovery was performed. 

BGLSERVICECARDCOUNT generates information by querying the database 

alias BGLSERVICECARD. 

Example 2-10 Listing the number of service cards in the Blue Gene/L system 

# db2 "select * from BGLSERVICECARDCOUNT" 

MIDPLANESERIALNUMBER NUMSERVICECARDS 

--------------------------------------------------- --------------x'203937503631353900000000594C31304B35303238303036' 

1 


Note: BGLSERVICECARDCOUNT is a database view and does not use the 

STATUS ‘M’ statement in its SQL statement the way that other count 

database views that are available on the system do. You need to be aware 

that there should only be one service card per midplane. 

A full Blue Gene/L rack has eight link cards—four link cards per midplane. Here 

are two ways to determine the number of link cards that are available to the 

system: 



links. A new page displays a table of the cards within that midplane. 

Figure 2-5 on page 69 shows the page that is loaded. You can see the link 

cards identified in the Type column. 



Example 2-11 shows the number of link cards that are represented by the 

NUMLINKCARDS field in the BGLLINKCARDCOUNT view. This value is the 

number of link cards, per midplane, that are available to the system since the 

last system Discovery was performed. 

Example 2-11 Listing the link cards available to the Blue Gene/L system 

# db2 "select * from BGLLINKCARDCOUNT" 

MIDPLANESERIALNUMBER NUMLINKCARDS 

--------------------------------------------------- -----------x'203937503631353900000000594C31304B35303238303036' 

4 


2.2.14 Link card chips 

As previously discussed, there are four link cards per midplane. In addition, each 

link card contains six link chips. These chips are used to link signals between 

compute processors on different midplanes. For more information about the link 

card and chips, refer to Figure 1-17 on page 18. 

You can determine the number of link chips that are available to the system in 

two ways: 




Figure 2-5 on page 69 shows the page that is loaded. You can see the link 

cards in the table. To view the link card details, click one of the link card 

hardware names. Figure 2-6 shows the BGWEB link card detail. 


Figure 2-6 BGWEB table showing the link chips on a link card 

Figure 2-7 shows the remainder of the same BGWEB page that is displayed 

in Figure 2-6 which gives details on the connection status of each link chip. 

Figure 2-7 shows an example of a 2x2 system. Therefore, it has X, Y, and Z 

cables between the midplanes and racks that use all six chips. 


Figure 2-7 BGWEB table showing cable connection data for a link card 


In Example 2-12, the number of available link chips is represented by the 

LINKCHIPS field in the output from the DB2 statement. This value is the 

number of link chips, per link card, that are available to the system since the 

last system Discovery was performed. The SERIALNUMBER field represents 

the serial number of the individual link cards. This also shows the status of the 

link cards. 

Example 2-12 Listing the link chips per link card on the system 

# db2 " SELECT serialNumber, (select count(*) from BGLLinkchip where / 

BGLLinkcard.serialNumber=BGLLinkchip.CardSerialNumber and 

BGLLinkcard.status / 'M' ) linkchips,status FROM BGLLinkcard where 

STATUS 'M'" 

SERIALNUMBER LINKCHIPS STATUS 

--------------------------------------------------- ----------- -----x'203937503438383700000000594C31314B35303630303146' 

6 A 

x'203937503438393200000000594C31344B3530353930314B' 6 A 

x'203937503438393200000000594C31354B35303737303044' 6 A 

x'203937503438383700000000594C33304B35323336303034' 6 A 



2.2.15 Link summary 

You can view the configuration of the X, Y, and Z cables on the Blue Gene/L 

systems from the BGWEB interface. This view provides the best way to get a 

visual idea of the 3D torus cabling of your system. You can view a link summary 

in two ways: 


a. At the BGWEB home page, click Configuration. 

b. Click Link Summary at the top of the page (Figure 2-8 and Figure 2-9). 

Figure 2-8 BGWEB Link Summary output for the X and Y links between racks 


Figure 2-9 BGWEB Link Summary output for Z links between midplanes 

Note: In our environment for this redbook, we did not have a system with X, Y, 

and Z cables to test. So, the output in Figure 2-8 and Figure 2-9 is from 

another system with more than four node cards in its configuration. 


Example 2-13 shows a DB2 select statement that identifies the Z links in a 

Blue Gene/L system. This particular example is a complex statement from the 

BGWEB source, but it does give a clear view of the link between the 

midplanes. 

Example 2-13 Displaying the Z links between two midplanes 

# db2 "SELECT CHAR(LEFT(SUBSTR(source,4,3),3),3) AS source, / 

CHAR(LEFT(SUBSTR(destination,4,3),3),3) AS destination FROM 

bglsysdb.TbglLink / WHERE source LIKE 'Z_%' AND sourceport = 'P2' FOR 

READ ONLY WITH UR " 

SOURCE DESTINATION 

------ ----------- 

000 001 

001 000 


Example 2-14 gives simpler DB2 select statement that shows the links within 

the Blue Gene/L system from ASIC-to-ASIC on each link card in use. In order 

to determine that the ASIC values 2 and 3 are for the Z cables, refer to 

Figure 1-17 on page 18. 


2.2.16 Node cards 

Example 2-14 Displaying the links within the Blue Gene/L system 

# db2 "select 

FROMLCLOCATION,FROMASIC,TOLCLOCATION,TOASIC,NUMBADWIRES,STATUS from 

tbglcable" 

FROMLCLOCATION FROMASIC TOLCLOCATION TOASIC NUMBADWIRES STATUS 

-------------- -------- ------------ ------ ----------- ------ 

R00-M1-L0 2 R00-M0-L0 2 0 A 

R00-M1-L0 3 R00-M0-L0 3 0 A 

R00-M1-L1 2 R00-M0-L1 2 0 A 

R00-M1-L1 3 R00-M0-L1 3 0 A 

R00-M1-L2 2 R00-M0-L2 2 0 A 

R00-M1-L2 3 R00-M0-L2 3 0 A 

R00-M1-L3 2 R00-M0-L3 2 0 A 

R00-M1-L3 3 R00-M0-L3 3 0 A 

R00-M0-L0 2 R00-M1-L0 2 0 A 

R00-M0-L0 3 R00-M1-L0 3 0 A 

R00-M0-L1 2 R00-M1-L1 2 0 A 

R00-M0-L1 3 R00-M1-L1 3 0 A 

R00-M0-L2 2 R00-M1-L2 2 0 A 

R00-M0-L2 3 R00-M1-L2 3 0 A 

R00-M0-L3 2 R00-M1-L3 2 0 A 

R00-M0-L3 3 R00-M1-L3 3 0 A 


The number of node cards can vary for each Blue Gene/L system. You can 

determine the number of node cards that are available in two ways: 




Figure 2-5 on page 69 shows the page that is loaded. You can see the node 

cards identified in the Type column of the table. 


2.2.17 I/O cards 


In Example 2-15, the number of node cards is represented by the 

NUMNODECARDS field in the BGLNODECARDCOUNT view. This value is 

the number of node cards, per midplane, that are available to the system 

since the last system Discovery was performed. 

Example 2-15 Listing the number of node cards per midplane 

# db2 "select * from BGLNODECARDCOUNT" 

MIDPLANESERIALNUMBER NUMNODECARDS 

--------------------------------------------------- -----------x'203937503631353900000000594C31304B35303238303036' 

4 


Note: There are no DB2 database views to help separate the number of I/O 

and Computer/Processor cards in the Blue Gene/L system. Database view 

BGLPROCESSORCARDCOUNT only displays the total number of processor 

cards on the system, including I/O cards. 

A node card can have one or two I/O cards installed. If it has two I/O cards 

installed, it is called an I/O rich node card. You can determine the number of I/O 

cards that are available to the system in two ways: 


a. At the BGWEB home page, click Configuration. 

b. Click the midplane links. A new page displays a table of the cards within 

that midplane. Figure 2-5 on page 69 shows the page that is loaded. 

c. Click the individual node cards listed in the table to view the processor 

cards. Figure 2-10 and Figure 2-11 show the full page. The full page 

displays the cards with the I/O nodes identified by a Yes in the Is IO Card 

column. 


Figure 2-10 Top of the page for the Node card view in BGWEB page 

Figure 2-11 Node card view in BGWEB page (I/O node card marked as Yes) 


Figure 2-12 shows the detailed view of a I/O node that shows the two chips on 

the card. 

Figure 2-12 I/O node card detail showing the IP addresses of the I/O nodes 


The I/O nodes in a node card within a particular midplane are linked to the 

node card by their serial number (license plate). In Example 2-16, the 

IONODECARDS field is the I/O node count per node card. This example also 

shows the status of the node cards. 

Example 2-16 The number of I/O node cards per midplane 

# db2 " SELECT serialNumber, (select count(*) from BGLProcessorCard 

where / BGLNodeCard.serialNumber=BGLProcessorCard.nodeCardSerialNumber 

and / BGLProcessorCard.status 'M' and isiocard = 'T') 

ionodecard,status FROM / BGLNodeCard where STATUS 'M'" 

SERIALNUMBER IONODECARD STATUS 

--------------------------------------------------- ----------- -----x'203231503833343000000000594C31304B34323635303134' 

1 A 

x'203231503833343000000000594C31304B3432383030314B' 1 A 

x'203231503833343000000000594C31304B34323630304B35' 1 A 

x'203937503538373400000000594C31304B34313534303032' 1 A 



2.2.18 Compute or processor cards 

Compute cards are also referred to as processor cards. Each node card holds 16 

compute cards. Thus, you can use the number of node cards in a midplane to 

determine the number of compute cards using the following equation: 

(Number of Node cards) x 16 = (number of compute cards) 

The number of compute nodes within the system is two times the number of 

compute cards. Therefore, in a midplane that has four node cards, the number of 

compute cards is 4 x 16 = 64 compute cards, resulting in (64 x 2) = 128 compute 

nodes in the system. 

Here are two ways to determine the number of compute cards that are available 

to the system: 


a. At the BGWEB home page, click Configuration, and then click the 

midplane links. A new page displays a table of the cards within that 

midplane. Figure 2-5 on page 69 shows the page that is loaded. 

b. Click the individual node cards that are listed in the table to view the 

computer (processor) cards (Figure 2-10 on page 78 and Figure 2-11 on 

page 78). 

c. Click the processor card links to expand the detail for each processor 

card, as shown in Figure 2-13. 

Figure 2-13 Processor card detail view through BGWEB interface 



The compute cards in a node card within a particular midplane are linked to 

the midplane by its serial number (license plate). In Example 2-17, the 

COMPUTENODES field is the compute cards count per node card. This field 

also shows the status of the node cards. 

Example 2-17 The number of compute cards per node card 

# db2 "SELECT serialNumber, (select count(*) from BGLProcessorCard 

where / BGLNodeCard.serialNumber=BGLProcessorCard.nodeCardSerialNumber 

and / BGLProcessorCard.status 'M' and isiocard = 'F') 

computecards,status FROM / BGLNodeCard where STATUS 'M'" 

SERIALNUMBER COMPUTECARDS STATUS 

--------------------------------------------------- ------------ -----x'203231503833343000000000594C31304B34323635303134' 

16 A 

x'203231503833343000000000594C31304B3432383030314B' 16 A 

x'203231503833343000000000594C31304B34323630304B35' 16 A 

x'203937503538373400000000594C31304B34313534303032' 16 A 


Note: To find specific STATUS values of the database records, change the not 

equal to operator (), as used in the DB2 examples with the STATUS field 

operator, to equals (=). 

Another way of quickly checking the number of processor cards that serve as 

compute nodes is to run the DB2 command shown in Example 2-18. 

Example 2-18 Checking the number of processor cards 

# db2 "select count (*)num_of_compute_node_cards from BGLPROCESSORCARD 

/ where ISIOCARD ='F' and STATUS 'M'" 

NUM_OF_COMPUTE_NODE_CARDS 

------------------------- 

64 



To check the number of processor cards that serve as I/O node cards, run the 

DB2 command shown in Example 2-19. 

Example 2-19 Checking the node cards with I/O nodes 

# db2 "select count (*)num_of_io_node_cards from BGLPROCESSORCARD where 

/ 

ISIOCARD ='T' and STATUS 'M'" 

NUM_OF_IO_NODE_CARDS 

-------------------- 

4 


2.3 Sanity checks for installed components 

If the core components of your Blue Gene/L system are not running properly, 

your system will not function correctly. We recommend that you follow this list of 

checks to ensure that the system is in a healthy state: 

► Check the operating system on the SN 

► Check communication services on the SN 

► Check that BGWEB is running 

► Check that DB2 is working 

► Check that BGLMaster and its child daemons are running 

► Check the NFS subsystem on the SN 

► Check that a block can be allocated using mmcs_console 

► Check that a simple job can run (mmcs_console) 

► Check the control system server logs 

► Check remote shell 

► Check remote command execution with secure shell 

► Check the network switches 

► Check the physical Blue Gene/L racks configuration 

Note: One component in the system can prevent another from running 

correctly. For example, if DB2 is not running, it might be down because of an 

operating system or communications issue and, therefore, a job cannot run. 


BGWEB 

DB2 

Figure 2-14 illustrates the core Blue Gene/L configuration that we used to 

provide examples of the systems checks in this section. 

BLUE GENE/L 

System Processes 

Service Node running 

SLES9 PPC 

64 bit 

IP: 10.0.0.1 

netmask: 255.255.0.0 

IP: 172.30.1.1 

netmask: 255.255.0.0 

IP: 192.168.00.49 

netmask: 255.255.255.0 

NFS 

Public LAN 

Switch 

Figure 2-14 Diagram of a core Blue Gene/L configuration 

2.3.1 Check the operating system on the SN 

eth0 

eth1 

eth5 

Service 

Network 

Switch 

Functional 

Network 

Switch 

You need to perform two checks for the operating system on the SN: 

1. Check the /etc/passwd and /etc/shadow files for the required Blue Gene/L 

users. In a core Blue Gene/L configuration, without FENs, we would only 

need the users root, bglsysdb, and bgdb2clim, as shown in Example 2-20. 

Example 2-20 Blue Gene/L user checking 

Gbit ido 

Node card 3 

Node card 2 

Node card 1 

Node card 0 

BLUE GENE Rack 00, 

Front half of Midplane: 

R00-M0 

Clock card 

master 

# egrep "root|bglsysdb|bgdb2cli" /etc/passwd /etc/shadow 

/etc/passwd:root:x:0:0:root:/root:/bin/bash 

/etc/passwd:bglsysdb:x:1000:1000::/dbhome/bglsysdb:/bin/bash 

/etc/passwd:bgdb2cli:x:1001:1001::/dbhome/bgdb2cli:/bin/bash 

/etc/shadow:root:$1$5zzzJBvz$XTd9evpJ8d1cVvDw5c3hV/:13210:0:10000:::: 

slave 

Service 

card 


etc/shadow:bglsysdb:$1$SwI1..4e$iGNeJ3bSSOHXD1Dy5TM250:13222:0:99999:7 

::: 

/etc/shadow:bgdb2cli:$1$Lyzz.trF$npMmXlHv5.XPf.ijiBFGC1:13213:0:99999:7 

::: 

2. Check for any full or nearly full file systems using the command /bin/df. In a 

non-customized SN installation, with DB2 and Blue Gene/L code installed, the 

file system layout looks similar to the output shown in Example 2-21. Full file 

systems can cause problems with many system processes. 

Example 2-21 File system checking on the SN 

# df -k 

Filesystem 1K-blocks Used Available Use% Mounted on 

/dev/sdb3 70614928 4752552 65862376 7% / 

tmpfs 1898508 8 1898500 1% /dev/shm 

/dev/sda4 489452 95872 393580 20% /tmp 

/dev/sda1 9766544 4075948 5690596 42% /bgl 

/dev/sda2 9766608 712428 9054180 8% /dbhome 

If you discover issues during the check of the operating system, you can take the 

following corrective actions: 

1. If a user does not exist in /etc/passwd and /etc/shadow, then you need to 

create the user. 

2. If any of the file systems are full or are nearly full, you need to either clean 

that file system (remove unnecessary files or add disk space) to ensure that 

the operating system, database, and other Blue Gene/L processes are not 

affected. 

2.3.2 Check communication services on the SN 

The Service and Functional networks are required for the Blue Gene/L system to 

function correctly. To check these networks, perform the following 

communication checks: 

1. Verify that both the carrier and the network is up with the /usr/sbin/ethtool 

command. Example 2-22 shows the output for a working interface on the SN. 

Note the Speed, Duplex, and Link detected fields. 

Example 2-22 Ethernet adapter characteristics 

# /usr/sbin/ethtool eth0 

Settings for eth0: 

Supported ports: [ TP ] 

Supported link modes: 10baseT/Half 10baseT/Full 


100baseT/Half 100baseT/Full 

1000baseT/Full 

Supports auto-negotiation: Yes 

Advertised link modes: 10baseT/Half 10baseT/Full 



Advertised auto-negotiation: Yes 

Speed: 1000Mb/s 

Duplex: Full 

Port: Twisted Pair 

PHYAD: 0 

Transceiver: internal 

Auto-negotiation: on 

Supports Wake-on: umbg 

Wake-on: g 

Current message level: 0x00000007 (7) 

Link detected: yes 

2. Check TCP/IP configuration using the /sbin/ip ad command. Example 2-23 

shows output showing the loopback, eth0, and eth1 interfaces configured and 

up on the SN. 

Example 2-23 IP configuration on the SN 

# /sbin/ip ad 

1: lo: mtu 16436 qdisc noqueue 

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 

inet 127.0.0.1/8 brd 127.255.255.255 scope host lo 

inet6 ::1/128 scope host 








link/ether 00:11:25:08:30:90 brd ff:ff:ff:ff:ff:ff 


inet6 fe80::211:25ff:fe08:3090/64 scope link 


...etc.. 

3. Verify that the IP configuration is correct on the network interfaces. Refer to 

the network diagram for the system as discussed in 2.2.3, “Network diagram” 

on page 59. 


4. Use the /bin/ping command to check network connectivity of the SN 

interfaces for the functional and service networks from another system. 

Note: It is likely there will be another system on the functional network, 

such as an FEN. However, for the service network, you might need to 

connect a test system to perform this check. 

5. Check the lights on the network interfaces on the SN. 

If you discover issues during the check of the communication services, you can 

take the following corrective actions: 

1. If the /usr/sbin/ethtool output is not as expected, then check the settings 

on the interfaces and ensure that the ethernet cables are plugged in correctly 

at the SN and the switch. 

2. The /sbin/ip ad command should show the interfaces as UP. If they are not 

in UP state, check the configuration files and activate them using appropriate 

commands or scripts (/etc/init.d/network or /sbin/ifconfig). 

2.3.3 Check that BGWEB is running 

As mentioned at the beginning of 2.2, “Identifying the installed system” on 

page 57, BGWEB is a very useful tool for gathering data on many aspects of the 

Blue Gene/L system. To check if BGWEB is working properly, follow these steps: 

1. From the SN or a remote system, try to connect to BGWEB using the 

following URL: 


Note: There might be a firewall between the system where the Web page is 

loaded and the SN. A tunnel can be created to forward port TCP:80 on the SN 

to a local port (in this case, local TCP:5519 port). The URL would then need to 

change to something like: 

http://localhost:5919/web/index.php 

2. Check that the apache server processes are running on the SN using the 

/bin/ps command: 

ps -ef | grep httpd 


3. Check whether apache is configured to start automatically on start with the 

/sbin/chkconfig command: 

# chkconfig --list apache 

apache 0:off 1:off 2:off 3:on 4:off 5:on 

6:off 

If you discover issues during the check of BGWEB, you can take the following 

corrective actions: 

1. If apache is not running then try and start it using the start script 

/etc/rc.d/apache on the SN: 

/etc/rc.d/apache start 

2. If apache has not been configured to start when the SN boots, run (as root) 

the /sbin/chkconfig command: 

# chkconfig -s apache 35 

3. If there are issues with the apache configuration, check the 

/etc/httpd/httpd.conf file. 

2.3.4 Check that DB2 is working 

The DB2 database needs to be running on the SN because the Blue Gene/L 

relies on it to operate. To check that DB2 is working properly, follow these steps: 

1. Check that you can connect to the database instance as shown in 

Example 2-2 on page 58. 

2. Check that the DB2 user exists by connecting the to the SN using the 

following command: 

# ssh bglsysdb@ 

Although highly unlikely, if secure shell is not configured, use the 

/usr/bin/telnet command. 

3. Check that the user’s bglsysdb password is the same as the password in the 

/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file. In Example 2-24 the 

database_password field in the db.properties file represents the password for 

the user bglsysdb. 

Example 2-24 Output of the /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file 

# cat db.properties 

database_name=bgdb0 

database_user=bglsysdb 

database_password=bglsysdb 

... 


If you discover issues during the check of DB2, you can take the following 


1. If DB2 is not running, you need to start it. Run the commands: 

# su - bglsysdb 

# /dbhome/bglsysdb/sqllib/adm/db2start 

2. If required, update the /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties file to 

reflect the configuration shown in Example 2-24 on page 87. 

3. If DB2 will not start, even though the user exists and the password for 

bglsysdb is configured correctly in the db.properties file, go through the steps 

in 2.3.1, “Check the operating system on the SN” on page 83 and 2.3.2, 

“Check communication services on the SN” on page 84. 

4. If DB2 has not been started automatically when the SN was booted, then 

check that the database instance has been set to start automatically with the 

/dbhome/bglsysdb/sqllib/adm/db2set command: 

# /dbhome/bglsysdb/sqllib/adm/db2set -i bglsysdb 

DB2COMM=tcpip 

DB2AUTOSTART=YES 

If the DB2AUTOSTART field has the value YES, then it is set to start 

automatically. If this is not set, you can enable it using the 

/opt/IBM/db2/V8.1/instance/db2iauto command: 

# /opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb 

Also, check that the db2fmcd entry has not been commented out of or deleted 

from the /etc/inittab file. 

# grep db2fmcd /etc/inittab 

fmc:2345:respawn:/opt/IBM/db2/V8.1/bin/db2fmcd #DB2 Fault Monitor 

Coordinator 

Note: The previous DB2 commands can be run as the root user or the DB2 

user bglsysdb. 

5. If after all these checks DB2 still does not work correctly, you need to call your 

DB2 support. 


2.3.5 Check that BGLMaster and its child daemons are running 

BGLMaster and three of its child daemons must be running for the Blue Gene/L 

system to operate properly. These daemons are: 

► idoproxy 

► ciodb 

► mmcs_server 

To ensure that BGLMaster is running properly, follow these steps: 

1. Check the status of BGLMaster on the SN using the following command: 

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster status 

Example 2-25 shows the expected output from BGLMaster on a system that 

is not using hardware (monitor) or performance (perfmon) monitoring 

daemons for BGL. 

Example 2-25 Checking the BGLMaster status 


idoproxy started [18622] 

ciodb started [18623] 

mmcs_server started [18624] 

monitor stopped 

perfmon stopped 

2. We advise to double check that there is only one process for the BGLMaster 

and each of and its child daemons by running the following command: 

# ps -ef | egrep "BGLMaster|idoproxy|mmcs_db_server|ciodb" 

If there is more than one instance of a process, this needs to be cleaned up 

because BGLMaster is only aware of the process it has most recently started 

for each daemon name. It is also unaware of daemon processes if they are 

not started by the bglmaster command. 

If you discover issues during the check of BGLMaster, you can take the following 


1. If you need to start a particular child daemon, use the command. 

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start 

Caution: Restarting the BGLMaster terminates all communications between 

the SN and the racks. This action terminates all running jobs and booted 

partitions on the Blue Gene/L system. 


2. If you need to restart BGLMaster, you can run the following command: 

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster restart 

3. If there are still issues with BGLMaster, we suggest that you run through this 

sequence of commands: 

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster stop 

# ps -ef | egrep "BGLMaster|idoproxy|mmcs_db_server|ciodb" 

# /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start 


Note: You can edit the file /bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster.init to 

determine what is started automatically when you run 

/bgl/BlueLight/ppcfloor/bglsys/bin/bglmaster start. For more 

information, see Blue Gene/L: System Administration, SG24-7178. 

2.3.6 Check the NFS subsystem on the SN 

NFS is an integral part of the Blue Gene/L boot process, as described in 1.8, 

“Boot process, job submission, and termination” on page 36. It can also be used 

for the reading and writing of data by the jobs that are submitted on the system. 

(This is discussed in 2.2.7, “File systems (NFS and GPFS)” on page 64 and in 

more depth in Chapter 5, “File systems” on page 211.) Thus, NFS needs to be 

functioning correctly on the SN, otherwise a block will not be able to boot. 

To check that the NFS subsystem is functioning properly, check that the /bgl file 

system is exported from the SN over the functional network. The 

/usr/sbin/showmount -e command shows what file systems are exported and 

also gives a good indication whether NFS is working. 

# showmount -e bglsn 

Export list for bglsn: 

/bgl 172.30.0.0/255.255.0.0 


If the showmount command returns an error, depending on the message, it might 

be possible to identify the potential area of the problem or part of the problem 

with NFS: 

1. showmount -e points to a potential issue with the port mapper service 

# showmount -e 

mount clntudp_create: RPC: Port mapper failure - RPC: Unable to 

receive 

Possible fix: 

# /etc/init.d/portmap restart 

# /etc/init.d/nfsserver restart 

2. showmount -e points to a potential issue with rpc.mountd or nfsd 

# showmount -e 

mount clntudp_create: RPC: Program not registered 

Possible fix: 

# /etc/init.d/nfsserver restart 

Note: Refer to the check lists in Chapter 5, “File systems” on page 211 for 

more NFS related checks. 

2.3.7 Check that a block can be allocated using mmcs_console 

Note: The Blue Gene/L blocks are also referred to as partitions. 

A good way to ensure that the core Blue Gene/L system is working correctly is to 

check whether a block can be booted. This check ensures that the 

communication between the SN and the racks and files used during the boot 

process is functioning. (These topics are covered in 1.8, “Boot process, job 

submission, and termination” on page 36.) 

To check that a block can be allocated, follow these steps: 

1. Connect to the mmcs_db_console on the SN: 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./mmcs_db_console 




mmcs$ 


2. After you are connected to the console, try to allocate a predefined block. The 

following is an example of a successful boot of the block called R000_128: 

mmcs$ allocate R000_128 

OK 

mmcs$ 

It is possible to list the predefined blocks and the currently booted blocks 

within the mmcs_db_console using the list bglblock command. The list 

command queries the DB2 database alias BGLBLOCK. 

3. Monitor the block from the bglsn-bgdb0-idoproxydb-current.log and 

bglsn-bgdb0-mmcs_db_server-current.log that is located in 

/bgl/BlueLight/logs/BGL. 

4. Check the block state from the mmcs_db_console with the list bglblock 

command, as shown in Example 2-26. 

Example 2-26 Checking a block (R000_128) 

mmcs$ list bglblock R000_128 

OK 



_numpsets = 0 

_numbps = 0 

_owner = 


_sizex = 0 

_sizey = 0 

_sizez = 0 

_description = Generated via genSmallBlock 

_mode = C 

_options = 

_status = I 


_mloaderimg = 


_blrtsimg = 


_linuximg = 


_ramdiskimg = 




_createdate = 2006-03-06 17:56:20.600708 


The _status field shows the correct status of the block. (Table 4-3 on 

page 162 explains these values.) 

You can also gather the block information from the BGWEB interface. At the 

BGWEB home page, click Runtime to show the currently booting or initialized 

dynamic or predefined blocks. Predefined block information is displayed 

whatever the current status. Figure 2-15 shows an example of both dynamic 

and predefined blocks. 

Figure 2-15 BGWEB Block Information page 

An additional check on the core system that you can perform to ensure that a 

block can be allocated is to check the I/O nodes are running and are connected 

to the functional network. Take the following steps: 

1. Go to the point where the predefined block shows that it has been allocated in 

/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log. 

Example 2-27 shows the text that is generated when block R000_128 is 

allocated at the mmcs_db_console. 

Example 2-27 Text generated in mmcs_db_server log when a block is initialized 

..snip.. 

Apr 05 17:20:57 (I) [1090516192] test1 allocate R000_128 

Apr 05 17:20:57 (I) [1090516192] test1 

DBMidplaneController::addBlock(R000_128) 

..snip.. 

Apr 05 17:20:57 (I) [1090516192] test1:R000_128 

BlockController::connect() 

Apr 05 17:20:57 (I) [1090516192] test1:R000_128 I/O node: 

R00-M0-N0-I:J18-U01 log file: 

/bgl/BlueLight/logs/BGL/R00-M0-N0-I:J18-U01.log 





.snip.. 







Apr 05 17:20:58 (I) [1090516192] test1:R000_128 

BlockController::load_microloader() loading microloader 


Apr 05 17:20:58 (I) [1090516192] test1:R000_128 

BlockController::start_microloaders() starting mailbox polling 

Apr 05 17:20:58 (I) [1090516192] test1:R000_128 

BlockController::start_microloaders() starting microloader boot image 

on 136 nodes 

Apr 05 17:20:58 (I) [1092621536] test1:R000_128 MailboxMonitor thread 

starting 

Apr 05 17:20:58 (I) [1090516192] test1:R000_128 

BlockController::start_microloaders() startMicroLoader starting 136 

nodes 

Apr 05 17:20:59 (I) [1090516192] test1:R000_128 

DBBlockController::boot(): making switch settings for block R000_128 

Apr 05 17:20:59 (I) [1090516192] test1:R000_128 

DBBlockController::boot(): completed switch settings for block R000_128 

Apr 05 17:20:59 (I) [1090516192] test1:R000_128 BlockController::load() 

loading program /bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf 


loading program /bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf 


loading program /bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts 

Apr 05 17:21:08 (I) [1090516192] test1:R000_128 

BlockController::start() starting 136 nodes 

Apr 05 17:21:08 (I) [1090516192] test1:R000_128 starting 128 nodes with 

entry point 0x00000290 

Apr 05 17:21:08 (I) [1090516192] test1:R000_128 starting 8 nodes with 

entry point 0x00800000 

Apr 05 17:21:08 (I) [1090516192] test1:R000_128 

DBBlockController::waitBoot(R000_128) 

Apr 05 17:21:36 (I) [1092621536] test1:R000_128 DBBlockController: 

contacting I/O node {119} at address 172.30.2.4:7000 


contacted I/O node {119} at address 172.30.2.4:7000 


..snip.. 


contacted I/O node {34} at address 172.30.2.5:7000 

Apr 05 17:21:38 (I) [1090516192] test1:R000_128 

DBBlockController::waitBoot(R000_128) block initialization successful 

..snip.. 

2. Using the date pattern from when the block was allocated (Example 2-27 on 

page 93), it is possible to create a for loop to identify the I/O nodes that were 

used for the block and to ensure that they are responding to a ping over the 

network, as shown in Example 2-28. 

Example 2-28 Checking the I/O nodes for the booted blocks 

# for i in `grep "R000_128" 

/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log | grep / 

"Apr 05 17:2" |grep contact| awk '{print $14}'| cut -f1 -d':'| uniq / 

|sort`; do echo Pinging ionode IP $i; ping -c1 $i >/dev/null 2>&1; / 

echo Return Code:$?; done 

Pinging ionode IP 172.30.2.1 

Return Code:0 


Return Code:0 


Return Code:0 


Return Code:0 


Return Code:0 


Return Code:0 


Return Code:0 


Return Code:0 

Note: This test can be useful when the booting of a block is hanging. If a block 

cannot boot and if there is no indication where the problem might be, try to 

boot node card sized blocks that are generated by the gensmallblock 

command to isolate the hardware problem. 


2.3.8 Check that a simple job can run (mmcs_console) 

If a block can be booted from the mmcs_db_console, then a good test is to see if 

you can run a simple job on the block using a program such as hello.rts. You 

can find an example of this code in Unfolding the IBM eServer Blue Gene 

Solution, SG24-6686. 

To check that a simple job can run, follow these steps: 

1. When the block has booted successfully, in this case R000_128, select the 

block to run the job, using select_block: 

mmcs$ select_block R000_128 

OK 

mmcs$ 

Note: There is no need to select the block if the previously run command 

at the mmcs prompt was allocate. 

2. Submit the job using submit_job: 

mmcs$ submit_job /bglscratch/test1/hello.rts /bglscratch/test1 

OK 

jobId=257 

Note: In addition, it is also possible to select the block and submit the job at 

the mmcs_db_console prompt with the submitjob command. 

3. Monitor the job from bglsn-bgdb0-mmcs_db_server-current.log and 

bglsn-bgdb0-ciodb-current.log, which are located in /bgl/BlueLight/logs/BGL. 

4. Check the status of the job from the mmcs_db_console with the list bgljob 

command. In Example 2-29, job 257 has a status of T (Terminated). 

Example 2-29 Checking the status of a job 

mmcs$ list bgljob 257 

OK 

==> DBJob record 

_jobid = 257 

_entrydate = 2006-03-20 14:55:56.497880 

_username = root 


_jobname = Mon Mar 20 14:55:53 2006 

.520.1096840416 


_executable = /bgl/applications/Examples/hello.rts 

_outputdir = /bgl/applications/Examples 

_status = T 

_errtext = 

_action = D 

_exitstatus = 0 

_mode = C 

_starttime = 2006-03-20 14:55:56.258832 

_nodesused = 128 

_strace = -1 

_stdininfo = 1024 

_stdoutinfo = 1024 

_stderrinfo = 1024 

The list bgljob command in the mmcs_db_console queries the DB2 

alias BGLJOB. In the same way, you can use the list bgljob_history 

command to gather data on previously run jobs. You can query the 

BGLJOB and BGLJOB_HISTORY aliases using the 

/dbhome/bgdb2cli/sqllib/bin/db2 command. 

You can also retrieve job information from the BGWEB interface. At the 

BGWEB home page, click Runtime, and then click Job Information. 

Figure 2-16 and Figure 2-17 present examples of job information views 

through the BGWEB interface. 


Figure 2-16 BGWEB showing job Information page 

Figure 2-17 BGWEB showing detailed Job ID information 


An additional checks that you can perform is that you can map the Blue Gene/L 

job ID to a block using the BGWEB interface: 

1. Click Runtime and then Block Information. 

2. Select a link for a particular block. It gives details on the block, including the 

current jobs that are running on it, as shown in Figure 2-18. 

Figure 2-18 BGWEB showing block details and jobs for block data 

2.3.9 Check the control system server logs 

There are certain logs that you can check, depending on what information is 

required. For information about the control system server logs, see 2.2.6, 

“Control system server logs” on page 61. There, we include an explanation about 

the information that is available in each log file. 


To check the control system server log, you need to do the following: 

► Block monitoring 

The block can be monitored from bglsn-bgdb0-idoproxydb-current.log and 

bglsn-bgdb0-mmcs_db_server-current.log, which is located in 

/bgl/BlueLight/logs/BGL. Run the following command on each log: 

# tail -F /bgl/BlueLight/logs/BGL/bglsn-bgdb0-idoproxydb-current.log 

# tail -F 

/bgl/BlueLight/logs/BGL/bglsn-bgdb0-mmcs_db_server-current.log 

► Job monitoring 

When the job is submitted, you can monitor it by checking the 

bglsn-bgdb0-ciodb-current.log. Use the following command: 

# tail -F /bgl/BlueLight/logs/BGL/bglsn-bgdb0-ciodb-current.log 

2.3.10 Check remote shell 

In a complex Blue Gene/L system, to execute commands on remote machines 

(FENs), you can implement rsh and rshd. Here, we show what you need to check 

to ensure that rsh is configured correctly: 

1. Check that the SN and FENs can talk to each other using /usr/bin/rsh in 

both directions across the functional network. 

– From SN, use the following commands: 

Use rsh date to check to the FEN, which should be repeated to each 

FEN. 

# rsh bglfen1_fn date 

Mon Apr 3 14:41:44 EDT 2006 

Use rsh date to check to the SN itself. 

# rsh bglsn_fn date 

Mon Apr 3 14:41:44 EDT 2006 

– From the FENs, use the following commands: 

Use rsh date to check to the SN itself. 

# rsh bglsn_fn date 

Mon Apr 3 14:41:44 EDT 2006 

Use rsh date to check to the FEN, which should be repeated to each 

FEN. 

# rsh bglfen1_fn date 

Mon Apr 3 14:41:44 EDT 2006 


2. Check that rsh is enabled with the /sbin/chkconfig command. The following 

example is the expected output when rsh is enabled. 

# /sbin/chkconfig --list rsh 

xinetd based services: 

rsh: on 

You can double check that rsh is enable by looking at the /etc/xinetd.d/rsh file 

and checking that there is not an extra entry for the stanza field disabled with 

the value yes. Example 2-30 shows the contents of the /etc/xinetd.d/rsh file 

with rshd enabled. 

Example 2-30 Checking rshd stanza in xinetd configuration 

# cat /etc/xinetd.d/rsh 

# default: off 

# description: 

# The rshd server is a server for the rcmd(3) routine and, 

# consequently, for the rsh(1) program. The server provides 

# remote execution facilities with authentication based on 

# privileged port numbers from trusted hosts. 

# 

service shell 

{ 

socket_type = stream 

protocol = tcp 

flags = NAMEINARGS 

wait = no 

user = root 

group = root 

log_on_success += USERID 

log_on_failure += USERID 

server = /usr/sbin/tcpd 

# server_args = /usr/sbin/in.rshd -L 

server_args = /usr/sbin/in.rshd -aL 

instances = 200 

disable = no 

} 

3. Check the ~/.rhosts file for each user or the /etc/hosts.equiv file. For further 

details, refer to “Remote command execution setup” on page 151. 

Note: If the rsh checks are done after a job has failed, you need to run them 

as the owner (user) of the job that failed. 


If there is an issue with rsh, then check the following: 

1. Check for correct name resolution, and resolve identical from all nodes 

involved. Name resolution should be local (/etc/hosts). 

2. Check that the ~/.rhosts has the correct IP labels (host names) and users or 

that /etc/hosts.equiv has the correct IP labels (host names). 

3. Enable rsh on the SN and the FENs, if required, with the /sbin/chkconfig 

command. 

# chkconfig rsh on 

# chkconfig --list rsh 


rsh: on 

2.3.11 Check remote command execution with secure shell 

In a complex Blue Gene/L configuration, you can implement secure shell (ssh) 

for remote command execution. Although you can also use ssh for remote 

command execution between SN and I/O nodes, in this section, we only cover 

checks for ssh between SN and FENs. 

Note: For details about setting up secure shell, see “Configuring ssh and scp 

on SN and I/O nodes” on page 237 and 7.2, “Secure shell” on page 395. 

To check remote command execution with ssh, follow these steps: 

1. Check that the SN and FENs can talk to each other in both directions through 

the functional network using /usr/bin/ssh. 

– From the SN: 

Use ssh date to check to the FEN, which should be repeated to each 

FEN. 

# ssh bglfen1_fn date 

Mon Apr 3 14:41:44 EDT 2006 

Use ssh date to check to the SN itself. 

# ssh bglsn_fn date 

Mon Apr 3 14:41:44 EDT 2006 

– From the FENs: 

Use ssh date to check to the SN itself. 

# ssh bglsn_fn date 

Mon Apr 3 14:41:44 EDT 2006 


Use ssh date to check to the FEN, which should be repeated to each 

FEN. 

# ssh bglfen1_fn date 

Mon Apr 3 14:41:44 EDT 2006 

2. Check that sshd is enabled with the /sbin/chkconfig command. The 

following command shows sshd enabled: 

# chkconfig --list sshd 

sshd 0:off 1:off 2:off 3:on 4:off 5:on 

6:off 

3. For further checks, refer to 7.2, “Secure shell” on page 395. 

Note: If the checks are being done after a job has failed the checks need to be 

run as the owner (user) of the job that failed. 

If there is an issue with ssh, check the following: 

1. Ensure that the ~/.ssh/known_hosts and ~/.ssh/authorized_keys have the 

correct entries for root and all users able to submit jobs. 

2. Enable sshd on the SN and the FENs, if required, with the /sbin/chkconfig 

command. 

# chkconfig sshd 23 

# chkconfig --list sshd 

sshd 0:off 1:off 2:on 3:on 4:off 5:off 

6:off 

For further actions, refer to 7.2, “Secure shell” on page 395 and to ssh man 

pages. 

2.3.12 Check the network switches 

A good understanding of your network switches and the topology that is used for 

the Blue Gene/L system configuration is required to solve any problems related 

to the network switches. (See also 2.2.3, “Network diagram” on page 59.) We 

recommend that you check the network switches for errors and that you perform 

a general health check on them. Refer to the network switches documentation 

and get additional help from your site network administrating team. 

If there are any issues with the switches, you need to resolve them to ensure 

reliable connectivity and network performance. 


2.3.13 Check the physical Blue Gene/L racks configuration 

It is also useful to perform a physical inspection of the Blue Gene/L racks to help 

identify any obvious problems. 

1. Check that the status lights on the service cards and the node cards are in the 

expected state, as explained in 1.2, “Hardware components of the Blue 

Gene/L system” on page 2. 

2. Check that all cables are connected to the to the interfaces and are plugged 

into the correct place. 

3. Check the clock card cables on the clock card and the service card. 

4. Ensure that the master/slave switch is set to the correct setting. 

If there are any issues with the physical configuration of the Blue Gene/L racks, 

then these issues need to be addressed by qualified service technicians. 

Note: Node cards that are connected to a powered switch with a carrier will 

not show they have a carrier until they are discovered by the discovery 

process. Refer to 1.9, “System discovery” on page 43 for more information. 

2.4 Problem determination methodology 

In this section, we describe a methodology that you can use to locate issues in a 

Blue Gene/L system. We developed this methodology by first building a core 

Blue Gene/L system, as listed in “Core Blue Gene/L” on page 56, and injecting 

errors into that system to simulate faults that might occur. In the process of 

investigating these errors, we then developed this practical problem 

determination methodology. 

We then created a more complex Blue Gene/L system with FENs, LoadLeveler, 

MPI, and GPFS, as listed in “Complex Blue Gene/L” on page 56. We then 

evolved the methodology further to ensure that these components were also 

addressed. 


NO 

Use monitoring 

tools, hardware 

diagnostics, 

RAS database and 

so forth 

Figure 2-19 illustrates this methodology. In this illustration, the steps of the 

methodology are in the circles and the check lists are in the squares. 

Anything 

changed? 

YES 

Call IBM 

software 

support 

START 

Problem 

definition 

HW SW 

HW/SW? 


identified? 

YES 

NO NO Can you fix 

it? 

END 

YES 

YES 

YES 

Figure 2-19 Problem determination methodology for Blue Gene/L system 

The three steps in our methodology are: 

1. Define the problem. 

2. Identify the Blue Gene/L system. 

3. Identify the problem area. 

Anything 

changed? 


identified? 

YES 

Call IBM 

software 

support 

If you cannot find the problem area in one of the check lists where you think the 

problem has occurred, then we recommend that you run through the check lists 

for the other relevant components that you have in your system. If the cause of 

the problem is not obvious, we recommend that you start with 2.5, “Identifying 

core Blue Gene/L system problems” on page 108 for help with identifying the 

problem. 

NO 

Core 

SW 

Compilers 

GPFS 

Analyze 

components 

NFS 

Chapter 2. Problem determination methodology 105 

LL 

MPI

This methodology also allows someone with little experience of a Blue Gene/L 

system to assess where the problem lies quickly and, perhaps more importantly, 

to determine whether there is a problem at all. 

Following the methodology, you can find the appropriate checks to confirm in 

more detail where the problem lies. If it is not clear where an issue is in a 

complex system, then you can break down the problem into smaller chunks and 

check that each part of the system is working correctly. 

We also recommend that your support or administration staff document the 

diagnosis and resolution steps for any problem that you encounter to help with 

potential problems in the future. 

Problems on the Blue Gene/L system generally fall into the falling categories: 

► Block initialization 

► Runtime issues 

► General hardware problems 

► Results from diagnostics 

► Monitoring tool outputs 

You can then check the appropriate components to ensure that they are in a 

functional state using individual checklists for each component in the current 

Blue Gene/L environment. This methodology is demonstrated in Chapter 6, 

“Scenarios” on page 265 through a number of different scenarios. 

We recommend that you follow the seven steps in the following list to ensure that 

you determine the problem, resolve it, and document it correctly. The most 

important steps are to get a clear description of the problem symptoms and to 

understand the system that you are dealing with. 

We recommend that you: 

1. Define the problem. 

2. Identify the Blue Gene/L system. 

3. Identify the problem area. 

The first three points are the three main areas that we cover in this book. 

However, a normal problem determination methodology extends also to the 

following steps: 

4. If possible, check to see if the issue has occurred before. 

5. Generate an action plan to resolve the issue. 

6. After you have determined the problem, take corrective measures. 

7. Document the problem for future administration purposes. 


2.4.1 Define the problem 

When you have answered the following questions, you should have enough data 

to indicate where to start looking for the problem: 

► What is the description of the problem? 

► Are there logs or output to demonstrate the problem? 

► Is the problem reproducible? 

► Is this a hardware or a software problem? 

2.4.2 Identify the Blue Gene/L system 

The Blue Gene/L system might be running within a simple or complex 

configuration, and it is important that you know the system. A process to 

understand what configuration that you have is described in 2.2, “Identifying the 

installed system” on page 57. 

2.4.3 Identify the problem area 

When you have answered the questions in 2.4.1, “Define the problem” on 

page 107, then go through the following high-level checks to pin point the 

problem area on the Blue Gene/L system: 

► Has anything changed recently on the system? 

► Where are we seeing the problem? 

The questions listed in 2.4.1, “Define the problem” on page 107 should give 

you a good indication of where to start looking. As mentioned in 2.3, “Sanity 

checks for installed components” on page 82, the functionality of one part of 

the Blue Gene/L is dependant on another part functioning correctly. 

Therefore, even though the initial starting point in trying to determine a 

problem area might seem obvious, it does not mean this is what is causing 

the overall problem. 

The following list includes the main components that make up a complex Blue 

Gene/L environment. Each component has a separate checklist to identify 

whether it is working as expected: 

– Software 

2.5, “Identifying core Blue Gene/L system problems” on page 108 

4.4.9, “LoadLeveler checklist” on page 186 

“The mpirun checklist” on page 166 

5.3.10, “GPFS Checklists” on page 255 

5.2.4, “NFS checklists” on page 218 


– Hardware 

3.2, “Hardware monitor” on page 114 

3.4, “Diagnostics” on page 131. 

2.3.12, “Check the network switches” on page 103 

2.3.13, “Check the physical Blue Gene/L racks configuration” on 

page 104 

2.5 Identifying core Blue Gene/L system problems 

Here is a basic checklist that you can use to reveal core Blue Gene/L 

system-related issues: 

► Check what has changed on the system. 

► Perform some basic checks from the SN: 

– Check that DB2 is working 

– Check that BGWEB is running 

– Check that BGLMaster and its child daemons are running 

– Check that a block can be allocated using mmcs_console 

– Check that a simple job can run (mmcs_console) 

► Check the control system server logs 

► Check the RAS Events or use RAS drill down in the BGWEB for relevant 

errors when the issue occurred. See Chapter 3, “Problem determination 

tools” on page 113 for more information. 

► In the BGWEB, check the Runtime link to obtain information about the job 

that has been run and the blocks in use. 

Examples are shown in Figure 2-16 on page 98 and Figure 2-17 on page 98. 

For more information, see Chapter 3, “Problem determination tools” on 

page 113. 

► Check the Configuration link in the BGWEB for hardware issues. 

Examples are shown in 2.2, “Identifying the installed system” on page 57. For 

more information, see Chapter 3, “Problem determination tools” on page 113. 

► If there is no indication where the problem might be, we recommend that you 

use the following sequence of checks: 

– 2.3.1, “Check the operating system on the SN” on page 83 

– 2.3.2, “Check communication services on the SN” on page 84 

– 2.3.6, “Check the NFS subsystem on the SN” on page 90 

– 2.3.12, “Check the network switches” on page 103 

– 2.3.13, “Check the physical Blue Gene/L racks configuration” on page 104 


2.6 Identifying IBM LoadLeveler jobs to BGL jobs (CLI) 

This section gives an overview on how to identify the jobs that are submitted to 

Blue Gene/L by the IBM LoadLeveler. The tasks and actions that we present 

here are not intended to replace the procedures that are described in the 

LoadLeveler documentation. Rather, we provide some tips. 

We start by setting the ‘-verbose’ option in the LoadLeveler commands file 

(.cmd), as shown in Example 2-31. 

Example 2-31 LoadLeveler job command file with ‘-verbose’ option 

bglsn:/bglscratch/test1 # cat ior-gpfs.cmd 

#@ job_type = bluegene 

#@ executable = /usr/bin/mpirun 

#@ bg_size = 128 

##@ bg_partition = R000_128 

##@ arguments = -verbose 2 -exe /bglscratch/test1/hello-file-2.rts 

-args 6 

#@ arguments = -verbose 2 -exe 

/bglscratch/test1/applications/IOR/IOR.rts -args -f 

/bglscratch/test1/applications/IOR/ior-inputs 

##@ output = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out 

##@ error = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err 

#@ output = /bglscratch/test1/ior-gpfs.out 

#@ error = /bglscratch/test1/ior-gpfs.err 

#@ environment = COPY_ALL 

##@ notification = error 

##@ notify_user = loadl 

#@ class = small 

#@ queue 

Then, you can check the job’s stderr(2) file for detail information, as shown in 

Example 2-32. 

Example 2-32 Verbose information in job’s stderr(2) file 

..snip.. 

BE_MPI (Debug): Adding job 

bglfen1.itso.ibm.com.16.0 to the DB... 

BRIDGE (Debug): rm_get_jobs() - Called 

BRIDGE (Debug): rm_get_jobs() - Completed 

Successfully 

SCHED_BRIDGE (Debug): Partition 

RMP24Mr104437181 - No BG/L job assigned to this partition 

BRIDGE (Debug): rm_add_job() - Called 


BRIDGE (Debug): rm_add_job() - Completed 

Successfully 

BE_MPI (Debug): Job bglfen1.itso.ibm.com.16.0 

was successfully added to the DB 

BE_MPI (Debug): Quering the DB job ID 

BE_MPI (Debug): DB job ID is 199 

..snip.. 

Here, we can see the interaction between LoadLeveler and the Blue Gene/L 

database: 

► LL job bglfen1.itso.ibm.com.16.0 

► Partition Id : RMP24Mr104437181 

► BGL job ID: 199 

You can map the information in this list to the job information, as reported by the 

llq command on the Front-End Node, as shown in Example 2-33. 

Example 2-33 LoadLeveler queue information (llq command) 

test1@bglfen1:/bglscratch/test1/applications/IOR> llq 

Id Owner Submitted ST PRI Class 

Running On 

------------------------ ---------- ----------- -- --- ------------ 

----------bglfen1.12.0 

test1 3/24 09:45 R 50 small 

bglsn 

bglfen1.13.0 test1 3/24 09:46 R 50 small 

bglfen1 


bglfen2 


bglfen1 

bglfen1.17.0 test1 3/24 09:48 I 50 small 

Also, you can identify the job that is running (Example 2-33) in the log file 

~/loadl/log/StarterLog, from which we extracted the lines shown in Example 2-34. 

Example 2-34 Job log file 

..snip.. 

03/24 09:46:23 TI-0 bglfen1.13.0 Prolog not run, no program was 

specified. 

03/24 09:46:23 TI-0 bglfen1.13.0 run_dir = 

/home/loadl/execute/bglfen1.itso.ibm.com.13.0 


03/24 09:46:23 TI-0 bglfen1.13.0 Sending request for executable to 

Schedd 

03/24 09:46:23 TI-0 03/24 09:46:23 TI-0 bglfen1.13.0 User environment 

prolog not run, no program was specified. 

03/24 09:46:23 TI-0 LoadLeveler: 2539-475 Cannot receive command from 

client bglfen1.itso.ibm.com, errno =2. 

03/24 09:46:23 TI-0 bglfen1.13.0 llcheckpriv program exited, termsig = 

0, coredump = 0, retcode = 0 

03/24 09:46:23 TI-0 bglfen1.itso.ibm.com.13.0 Sending READY status to 

Startd 

03/24 09:46:23 TI-0 bglfen1.13.0 Main task program started (pid=9064 

process count=1). 

03/24 09:46:23 TI-0 bglfen1.itso.ibm.com.13.0 Sending RUNNING status to 

Startd 

03/24 09:46:24 TI-0 bglfen1.13.0 Blue Gene partition id 

RMP24Mr104230153. 

03/24 09:46:24 TI-0 bglfen1.13.0 Blue Gene Job Name 

bglfen1.itso.ibm.com.13.0. 

..snip.. 

Here, we have the following relevant information: 

► LoadLeveler ID: bglfen1.13.0 

► LoadLeveler long job name : bglfen1.itso.ibm.com.13.0 

► Partition name : RMP24Mr104230153 

To cross-check with the Blue Gene/L database, we can query the TBGLJOB and 

TBGLJOB_HISTORY tables. The TBGLJOB table shows the currently running 

jobs, and the TBGLJOB_HISTORY table shows the jobs that have finished. 

You can get all the job statistics from these tables, including the BGL job ID, 

using the DB2 statements shown in Example 2-35. 

Example 2-35 Querying the DB2 database for LoadLeveler jobs 

bglsn:/bglscratch/test1 # db2 "select jobid,jobname,blockid from 

TBGLJOB_HISTORY where jobname = 'bglfen1.itso.ibm.com.15.0'" 

JOBID JOBNAME BLOCKID 

----------- ------------------------------------------------ 

---------------- 

198 bglfen1.itso.ibm.com.15.0 

RMP24Mr104232176 




Chapter 3. Problem determination tools 

In this chapter we introduce problem determination tools that are available on the 

Blue Gene/L system. 

We start with an overview of each tool and then provide more detail on each tool 

in individual sections. We include information about when to use the tool 

according to the problem determination methodology, and how to interpret the 

output of the tool. 

We discuss the following tools: 

► Hardware monitor 

► Web interface 

► Diagnostics 

► MMCS console 

3 

In addition to these tools, Blue Gene/L provides operational logs, which are 

explained in 2.2.6, “Control system server logs” on page 61. 



This chapter explains how to monitor and analyze your system by using available 

problem determination tools. Among these tools, we focus on the hardware 

monitor, Web interface, diagnostic suite, and MMCS (Midplane Management 

Control System) console. 

Each section contains subsections that illustrate a procedure of how to start the 

tool and how to check the results. 

3.2 Hardware monitor 

The hardware monitor is a tool that is used to capture information about a 

specified piece of hardware of the Blue Gene/L system. After you start the 

hardware monitor on the running system, it keeps monitoring and storing the 

environmental information. 

3.2.1 Collectable information 

The hardware monitor gathers information about the following Blue Gene/L rack 

hardware: 

► Fans 

► Bulk power modules 

► Service cards 

► Node cards 

► Link cards 

For each of the devices presented in the previous list, multiple data is available. 

For example, monitoring the fans captures temperature, voltage, speed and 

status flags. See Table 3-1 for the complete list of the collectable data. Note that 

the data for fan modules and bulk power modules are collected by monitoring the 

service cards. 

The hardware monitor stores all of these information in the DB2 database. They 

are accessible through the Environmental and the Database section in the Web 

interface, or you can query the database directly using SQL commands. 


3.2.2 Starting the tool 

Table 3-1 The complete list of the collectable data 

Hardware component Data collected 

Service card Temperature, voltage, status flags 

Link card Temperature, voltage, power, status flags 

Node card Temperature, voltage, status flags 

Fan modules Temperature, voltage, speed, status flags 

Bulk power modules Temperature, voltage, status flags 

Hardware monitor and RAS events 

Among the information stored in the DB2 database, the hardware monitor 

generates a reliability, availability, and serviceability (RAS) event when it finds 

status information outside the normal range. Those events can be examined by 

issuing a query to the DB2 database, or they can be viewed using the Web 

browser interface. 

Specifying MONITOR facility on the RAS Event Query page returns the events 

generated by the hardware monitor. See 3.3, “Web interface” on page 119 for 

details about the Web interface. 

The hardware monitor also records error messages in its own log file. When it is 

created, the file is stored in the directory /bgl/BlueLight/logs/BGL as 

-monitorHW-.log. The hardware monitor writes error 

messages when it cannot contact or recognize any piece of hardware (that might 

not have been removed, added, or replaced). 

Here, we discuss briefly how to start the hardware monitor. For more detailed 

information, see Blue Gene/L: System Administration, SG24-7178. 

You can start the hardware monitor manually from the command line, or you can 

start it automatically with bglmaster. 

Chapter 3. Problem determination tools 115

Using the GUI 

If you have a graphical user interface (GUI) on the Service Node (SN), run the 

commands shown in Example 3-1 to start the hardware monitor. 

Example 3-1 Starting the hardware monitor 


./startmon 

This opens a new window, which is shown in Figure 3-1. 

Figure 3-1 Initial screen of the hardware monitor 

To start monitoring on the hardware: 

1. Click Settings → Start Monitoring. 

2. In the Start Monitoring window, click the drop-down list box and select ALL 

LINK CARDS. 

3. Click Update. 

4. Repeat this procedure for all available cards. 


Using the console 

You can also start the hardware monitor without a GUI. Issuing the commands in 

Example 3-2 allows you to start monitoring from a console. 

Example 3-2 Start monitoring row 0 from a console 

$ cd /bgl/BlueLight/ppcfloor/bglsys/bin 

$ ./startmon --row 0 --autostart --nodisplay 

Example 3-2 launches the hardware monitor for row 0 of the Blue Gene/L 

system, monitoring all the cards with default interval time. The default time 

interval is 5 minutes for service cards and 30 minutes for the other cards. 

Tip: You can use the --row --autostart option for the GUI startup as 

well. This option opens the window with all the cards monitored with the 

default interval time. 

3.2.3 Checking the results 

There are two ways of checking the environmental data that is collected by the 

hardware monitor. One is by using the Web interface, and the other is by using 

the GUI window of the hardware monitor. Both methods provide the same 

information, because they are referring to the same DB2 tables. 

To check the collected data through the Web interface: 

1. Point a Web browser to: 


2. Click Environmental. 

3. Select one of the information from the top drop list box. 

4. Specify the range for the date. 

5. Specify location if needed. 

6. Click send. 

This procedure returns a table that includes the data that was collected by the 

hardware monitor in the specified time interval. Figure 3-2 shows an example of 

querying a temperature of a service card. 


Figure 3-2 Querying the environmental data 

Also, as previously mentioned, the hardware monitor generates an RAS event on 

a particular circumstance, such as hardware component cannot be contacted. 

Figure 3-3 on page 119 illustrates one of the error messages that is reported on 

the RAS event pages. 

Note: It is good idea to check both environmental and RAS event pages, 

because some information might only come up in only one of these two pages. 


3.3 Web interface 

Figure 3-3 RAS events generated by the hardware monitor 

The Web interface is one of the major tools that you can use for problem 

determination. It provides a number of ways to analyze your Blue Gene/L 

system. All the information is stored in the DB2 database that is running on the 

SN, and a Web browser interface gives an easy way of selecting the necessary 

information. 

The Web interface provides the following basic sections: 

► Configuration 

► Runtime 

► Environmental 

► RAS events 

► Diagnostics test results 

► Database browser 



Figure 3-4 shows the top page of the Web interface. 

Figure 3-4 Top page of the Web interface 

Tip: Whenever the Blue Gene logo is shown at the top, left corner of a Web 

page, clicking the logo brings your Web browser back to the top page. 

For the Web interface, there is no particular procedure that is required to start or 

to stop the tool. However, there are certain checks to make sure that the tool is 

working correctly: 

1. Make sure that the DB2 database is running because all the information 

accessed through the Web interface is stored in the DB2 database. 

If required, use the appropriate procedures to start the database, as shown in 

2.3.4, “Check that DB2 is working” on page 87. 

2. Check whether an Apache Web server is correctly configured and running, 

including the PHP module. 

Make sure that the Apache Web server starts when the system restarts and 

that the name resolution and apache directories exist as they are defined in 

the configuration file. For details see 2.3.3, “Check that BGWEB is running” 

on page 86. 


3.3.2 Checking the result 

When the Web server and the DB2 are working correctly, the Web interface is 

accessible by pointing a Web browser to: 


This top page provides links to six basic categories of information with a short 

description. 

Configuration 

The Configuration page provides information about the current status of 

hardware. It includes four subcategories: 

► Hardware browser 

► Link summary 

► Service actions 

► Problem monitor 

Hardware browser 

The hardware browser provides a Web page that gives you a simple and 

effective way to find which hardware is recognized and which is marked as 

missing. 

If a piece of hardware is in a missing state, then that particular hardware is not 

available as a system resource. Hardware is marked missing, for example, when 

a service action is performed (maintenance operation) or when a piece of 

hardware is in error state. 

Throughout the Web pages, each piece of hardware is represented by its name 

based on the naming convention (as listed in 1.2.11, “Rack, power module, card, 

and fan naming conventions” on page 19). Also, when the hardware browser 

finds a system resource in a missing state, it highlights the name of the missing 

system with a red box, as shown in Figure 3-5 on page 122. 

Note: If you know the name of the particular hardware in which you are 

interested, you can enter the name of that hardware in the text box titled Find 

Hardware, and then press Enter. This jumps to the detailed information page 

of that hardware. 


Figure 3-5 Missing hardware in a red box 

The hardware browser provides the exact location when you find a system 

resource in a missing state and helps to investigate the cause of that state. 

You can also use information that you obtain from the hardware browser with 

other tools to concentrate on a specific issue. For example, you can use the 

location of the hardware and search for RAS events that are related to that 

location. 

Link summary 

The link summary page gives an overall picture of how the Blue Gene/L system 

is wired. It shows which midplane is connected to the other midplane in X, Y, and 

Z direction. 

This page helps you to make sure that all the data cables are connected, 

detected, and configured correctly. 

Service actions 

The service actions page show service actions in progress and the history of 

completed service actions. The service action consists of two phases: 

PrepareForService and EndServiceAction. Between these two actions, a 

hardware service engineer can remove or replace defective or suspect 

hardware. 


An entry of service action in progress means that PrepareForService has been 

performed and the ID for the service has been assigned. To complete the 

service, EndServiceAction must be performed for that ID. 

Using the service actions allow a specified hardware to be turned off and on for 

maintenance purpose. 

Problem monitor 

The problem monitor generates a list of hardware that has been detected with 

some sort of problem that needs to be solved. It gives a location of hardware and 

a short description of the problem. 

A location is presented as a link (see Figure 3-6) and helps a you to jump quickly 

to the detailed page for the piece of hardware in question. 

Link 

Figure 3-6 Problem monitor showing a midplane is in error state 

Runtime 

The Runtime page provides the current status of jobs and blocks. It includes four 

subcategories: 

► Block information 

► Job information 

► Midplane information 

► Utilization 

For our discussion, we focus on the block information and the job information. 


Block information 

The block information page provides a table of available blocks and their current 

status (Figure 3-7). The Status and Job Status columns contain the information 

that might need attention. 

Figure 3-7 List of blocks page and related information 

For example, if a block is in Booting status for a long time, there might be 

problem in booting that block. Alternatively, if a job is in the Ready to start 

status for a long time, this status might also indicate that there are some 

problems with loading the job. 

Tip: Clicking the name of each column title returns a new table which is sorted 

by the selected column. 

For the complete list of the types of status for a block, see Table 4-3 on page 162, 

and for the complete list of the types of status for a job see Table 4-4 on 

page 162. 

Note: A block that is created by LoadLeveler acts differently. For a job 

submitted through LoadLeveler, if a predefined block is not specified, one 

block is created and allocated dynamically. The dynamically created block 

remains in the table until the LoadLeveler job ends. Then, it is released 

(freed), and information about this block is lost. 


Clicking a specific block ID reveals detailed information for that block. The 

Overlapping Blocks table provides useful information, especially if the users of 

the Blue Gene/L system are using mpirun to submit their jobs (see Figure 3-8). 

It is also helpful to check the Hardware used by this block table, especially after 

you have changed a configuration of the system or you have replaced an I/O 

card. The table tells whether a particular block is using the correct hardware. 

Figure 3-8 Overlapping blocks and hardware in use 

Job information 

Job information is provided in two main tables. One includes information for the 

current jobs, while the other includes the history of jobs submitted. As noted 

earlier, the status column is also important for the job information table. If a job is 

in Error status, for example, it is good idea to click the job ID to see the details of 

the job. Checking the Error Text can help you to determine the problem. 


In addition, the Show RAS events for this job link at the bottom of the page 

(Figure 3-9) jumps to the RAS event page that shows RAS events only related to 

the job in question. 

Figure 3-9 Detailed job information 

Environmental 

You can use the environmental page to view hardware information that is 

collected by the hardware monitor as described in 3.2, “Hardware monitor” on 

page 114. Refer to that section for more information. 

RAS events 

The RAS event page allows you to search for all RAS events that ar stored in the 

DB2 database. Because the Blue Gene/L system generates a vast amount of 

events, the RAS event page allows an interface that accepts a number of 

variables to pick up only the desired information. Specifying a time range, block, 

job ID, and location are the typical examples (Figure 3-10 on page 127). 

This RAS event page is one of the most frequently used tools for problem 

determination, because it gives you an idea of when and where the problem 

occurred. For example, the Entry Data suggests the cause of the problem. 


Figure 3-10 Entry Data in the RAS event query result 

Another feature of the RAS event page is the RAS drill down. The RAS drill down 

is accessible from the link that is located at the bottom of the page. This drill 

down provides a number of fatal or advisory RAS events that occurred in a 

specified period of time for each location. 

Clicking the link in front of the location name expands the table to show more 

information. Compute card, for example, provides a link to a table of RAS events 

that are specific to the compute card when you fully expand a row in the drill 

down table, as shown in Figure 3-11 on page 128. 


Figure 3-11 RAS drill down showing the number of errors and their location 

Diagnostics test results 

Diagnostic test results show the result of a diagnostic test in various ways. The 

first page provides a brief test result for each block. If the last result column is a 

color other than green (Figure 3-12), you should look into the details by clicking 

the block ID. 

Figure 3-12 The top page for Diagnostics 


For detailed information about the diagnostic test, including how to run the 

diagnostics, see 3.4, “Diagnostics” on page 131. 

Details of the diagnostic test results are provided in the form of a table. Each row 

in the table represents a single test case with its test results. A number in the 

result column shows the count for hardware that passed or failed the test. All of 

the numbers are in the form of a link, and clicking one of the numbers brings you 

to a more detailed test result that is focused on each piece of hardware 

(Figure 3-13). 

Figure 3-13 Testcase details 

Each piece of hardware has its individual log files for every test case that is 

performed in a diagnostic test. Those log files give you an idea for what each test 

case is looking. Moreover, if you look into a log file of hardware which is reported 

as failed, you might find a reason for the failure marked in red, as shown in 

Figure 3-14. 


Figure 3-14 Testcase highlighting the cause of failure 

The diagnostic test results page for each block ID also provides two links at the 

bottom of the page: 

► Show all failed tests for 

► Full log for run on 

Although both links are self explanatory, checking the red and blue lines for the 

second link is useful. This second link provides a summary of the test at the end 

and also provides information about the current status of the system. As an 

example, the diagnostics test can mark a midplane unavailable (or M for missing), 

depending on the test result. 

Database browser 

The database browser page provides the complete list of database tables and 

views that are used in the Blue Gene/L system. All table and view names are in a 

form of link. Clicking them shows you a structure of definition and the actual data 

stored in the DB2 database. 

Although the Database browser page allows you to look into each table or view, it 

is not always the best tool to explore the content of the database. For example, if 

the total number of currently active compute cards for the entire system is in 

question, there is another, better way to obtain this information. Because the 

name of tables and views listed in the database browser page are indeed used in 

the system, looking up those names and using DB2 commands might help in 

some occasions (see Example 3-3). 

Example 3-3 Sample DB2 command with one of the names from database browser 

$ db2 “select count(*) from TBGLPROCESSORCARD where stauts=’A’” 

1 

----------- 

68 



3.4 Diagnostics 

The diagnostics suite performs a number of tests on the Blue Gene/L rack to 

determine system health. The test suite consists of two major sections: 

► Blue Gene/L Compute and IO nodes (BLC ASICs) for a midplane 

► Blue Gene/L Link chips (BLL ASICs) for a whole rack 

Because every data that is related to the Blue Gene/L system is stored in the 

DB2 database, the result of diagnostics is also stored in the database. These 

results are accessible through the Web interface Diagnostic section. For 

information about how to view these results, refer to “Diagnostics test results” on 

page 128. 

A system administrator can run diagnostics anytime that they are necessary. 

However, we recommend that you use diagnostics especially in the following 

circumstances: 

► When you encounter a problem and cannot isolate whether the problem is 

caused by software or hardware error, run diagnostics. 

Because the diagnostic suite checks the entire hardware, it helps you to 

isolate and to determine what the problem is. In this case, the diagnostic suite 

is used for an error detection. 

► When you replace hardware, run diagnostics immediately after the 

replacement. 

The main focus of running the tests is not only on the newly inserted 

hardware but also on all the hardware in a rack. The diagnostic suite might 

find a piece of hardware with few correctable errors for multiple test cases. 

This type of error might indicate a piece of hardware that requires attention. In 

this case, the diagnostic suite is used as a precaution. 


3.4.1 Test cases 

The diagnostics suite includes a numbers of test cases. Each of the following 

tables shows a list of test cases. Table 3-2 shows a list for BLC, which is for 

compute and I/O nodes. Table 3-3 on page 133 shows a list for BLL, which is for 

link chips. The tables also include a short description of each the test cases. 

Table 3-2 List of available testcases for BLC 

Test case Description 

blc_powermodules Queries the status of each power module on a nodecard. 

blc_voltages Queries the 1.5V and 2.5V power rail on a nodecard. 

blc_temperatures Queries all temperature sensors on a nodecard. 

bs_trash_0 Generates random instruction and executes them on the 

PPC instructions unit 0. 

bs_trash_1 Same as bs_trash_0 but using unit 1 instead of unit 0. 

dgemm160 Tests the floating point unit on the BLC ASIC. 

dgemm160e Extended diagnostics based on dgemm160 for testing the 

floating point unit. 

dgemm3200 Tests both the floating point unit and the memory subsystem 

on the BLC ASIC for a problem not found in the earlier tests. 

dgemm3200e Extended diagnostics based on dgemm3200 for testing the 

floating point unit and memory subsystem. 

dr_bitfail Writes to and reads from all DDR memory locations, and 

flags all failures. Does a simple memory test and prints a log 

description that attempts to identify the failing component. 

Identifies specific failing bits by ASIC and DRAM pin when 

possible. 

emac_dg Tests the ethernet function in loopback on the BLC ASIC. 

Specifically checks BLC ASICs, PHYs, and connection 

between the two. 

gi_single_chip Tests whether the global interrupt port is accessible and 

whether the global interrupt wires can be forced to 0 and 1 

using the local global interrupt loopback. 

gidcm Tests whether the communication through the global 

interrupt barrier network is fully functional. One of the 

compute node sends out signals in a specific pattern and the 

rest of the nodes receive the signals. 



linpack Runs single node linpack. Aims to examine the hardware 

health using a program that an ordinary user would submit. 

mem_l2_coherency Exercises all paths from L2 to connected memory and IO 

modules for testing the BLC on-chip cache function. 

ms_gen_short This runs memory pattern tests which is a complete memory 

test that checks for BLC ASIC function and the external 

DRAM. 

power_module_stress This test allows the compute cards to use as much power as 

possible and see whether power module can cope with 

whether the power module can cope with the power surge. 

ti_abist SRAM ABIST (SRAM Array Built-In Self Test) tests all 

on-chip SRAM arrays. 

ti_edramabist EDRAM ABIST (EDRAM Array Built-In Self Test) tests all 

on-chip Embedded DRAM arrays. 

ti_lbist LBIST (Logic Built-In Self Test) tests the on-chip logic for 

operation at frequency, using random patterns. 

tr_connectivity Checks basic connectivity of the collective network. 

tr_loopback Tests the on-chip functionality of the collective unit on the 

BLC ASIC. 

tr_multinode Thorough check of the collective network. 

ts_connectivity Checks basic connectivity of the torus network. 

ts_loopback_0 Tests the on-chip functionality of the torus unit on BLC ASIC 

0. 

ts_loopback_1 Same as ts_loopback_0 but using unit 1 instead of unit 0. 

ts_multinode Thorough check of the torus network. 

Table 3-3 List of available testcases for BLL 


bll_lbist LBIST (Logic Built-In Self Test) tests the on-chip logic 

for operation at frequency, using random patterns. 

bll_lbist_linkreset Performs a resets or re-initializes linkchips. 

bll_lbist_pgood Resets linkchips to simple state after turning on the 

system. 




bll_powermodules Queries the status of each power module on a linkcard. 

bll_temperatures Queries all temperature sensors on a linkcard. 

bll_voltages Queries all 1.5V and 2.5V power rail on a linkcard. 

There are several ways to run the diagnostic suite on the Blue Gene/L system. 

Among those, we recommend that you use the following scripts: 

/bgl/BlueLight/ppcfloor/bglsys/bin/submit_rack_diags.ksh 

/bgl/BlueLight/ppcfloor/bglsys/bin/submit_midplane_diags.ksh 

Both of these scripts require at least one argument, which is a midplane or a rack 

identifier (see Example 3-4). If none is specified, the script uses R00 as a default 

rack identifier and R000 as a default midplane identifier. 

Example 3-4 Specifying an identifier for the scripts 

$ /bgl/BlueLight/ppcfloor/bglsys/bin/submit_rack_diags.ksh R01 

$ /bgl/BlueLight/ppcfloor/bglsys/bin/submit_midplane_diags.ksh R010 

The scripts also accept optional arguments. For a list of acceptable arguments, 

issue the command ./rundiag -h, which is located in the 

/bgl/BlueLight/ppcfloor/bgldiag/ directory, or for detailed instructions (the manual 

page), issue ./rundiag -help from the same directory. 

You can also run a subset of tests from the diagnostic suite by choosing from a 

menu which test to run in interactive mode, as shown in Example 3-5. 

Example 3-5 Evoking the diagnostic suite menu 

$ cd /bgl/BlueLight/ppcfloor/bgldiags/common 

$ ./rundiag -host localhost -block 

--- Blue Gene/L System Diagnostic Console --- 

1 Power, Packaging & Cooling tests 

2 Link chip diagnostic 

3 Compute & IO node BIST engine tests 

4 Compute & IO single node bootstrap tests 

5 Compute & IO single node kernel tests 

6 Compute & IO node global interrupt tests 

7 Compute & IO multi-node tests 

8 IO node only tests 


9 Compute & IO node BLADE exercisers (e.g. dgemm3200e) 

10 Compute & IO node power modules stress test 

11 Compute & IO node BLRTS tests 

19 Enter remarks or notes for this run (accepts also: ‘remarks’ or 

‘r’) 

20 Print the current remarks or notes for this run (accepts also: 

‘printremarks’ or ‘p’) 

exit exit (short: e,q) 

Note: Running a diagnostics suite using the provided script creates its own 

block to run the tests. The block will include all the available I/O nodes that are 

installed in the system. For certain configurations, for example when one I/O 

card is installed per node card but only one ethernet cable is connected to the 

node card, the diagnostic tests might cause a problem for this block. 

Because an I/O card supports two ethernet cables and the diagnostic block 

expects to have two ethernet connections, the block might fail to boot with an 

error message no ethernet link. Thus, depending on your configuration, some 

test cases might be unable to execute. 

If you have such a configuration, use the rundiag command with the -block 

option instead of the script. Make sure that you specify a bootable (or 

valid) block for the option. 

Tip: You can run testcases interactively from the menu, or you can add the 

-batch option to run test cases in batch mode. 

For more detailed information, refer to the diagnostic documentation that comes 

with each driver release, which is located in /bgl/BlueLight/ppcfloor/bgldiags/doc. 


The results of the diagnostic suite are stored in the DB2 database and in log files. 

You can use the Web interface for easy access to the content that is stored in the 

database. (See “Diagnostics test results” on page 128 for more information.) 

You can also check the result in the log files. The log files are stored in the 

/bgl/BlueLight/logs/diags/ directory (on the SN). Each time the diagnostic suite 

runs, it creates a directory with the start time, followed by the block ID and the 

midplane or rack identifier as the directory name. Inside this directory, the 

diagnostics store various log files, scripts, and a directory for each individual test 

case. 


While the Web interface provides a log file for each node in each test case, it is 

still worth looking at the files in the diags directory. Here is a typical example of 

such a case. 

We assume that one of the test cases has failed, and a large number of nodes 

has reported errors. You might first want to look into the individual error logs 

through the diagnostics test results page. Although it depends on the problem 

that you encounter, often some nodes return a different error message from 

others. In such a case, instead of clicking each link for a log file on the Web 

interface, sometimes it is more efficient to look into the files under the logs 

directory. Some files might have a larger file size, which indicates that the 

particular node produced a larger amount of error messages. 

Because those log files are plain text, if you are looking for a particular message, 

you can use tools such as awk, sed, grep and so on to filter messages. 

3.5 MMCS console 


The Midplane Management Control System (MMCS) console is a tool that 

provides various commands to control and maintain blocks and jobs 

(MMCS_DB). Although this tool is not designed for a problem determination, it is 

an important and useful tool to obtain the current status of the system. 

To launch the MMCS console, issue the following commands on the SN: 


$ ./mmcs_db_console 

These commands open a shell that has a prompt which starts with mmcs$. If 

launching the mmcs_db_console does not provide the mmcs$ prompt and exits with 

an error message, check the following: 

► mmcs_db_server, idoproxydb, and ciodb must be running on the SN. 

► A path to the DB2 client libraries must be included in your PATH environment 

variable. Running the following command updates the PATH variable: 

$ . ~bgdb2cli/sqllib/db2profile 



Most of the commands return the results to the console, just like an ordinary 

shell. Some of the commands that affect the system status might update the DB2 

database as well. 

To list all the available commands on mmcs_db_console, type help at the mmcs$ 

prompt. This command returns a list of commands and their syntax. If you know 

the command for which you are looking, then help shows the syntax 

and a short description of the desired command. 

Among the number of useful commands that are provided by the MMCS console, 

we focus on the ones that are particularly helpful for problem determination: 

Interacting with nodes 

Generally, opening an interactive login shell on the I/O node is not desired 

(because of the amount of memory required to open a shell). However, MMCS 

console provides a tool, called write_con , which is useful if you have to 

look into an I/O node. The write_con utility allows you to submit a command to 

run on the I/O node. By using the option, nodes and IDo chips can be 

specified. 

Example 3-6 shows how to send the hostname command to the system on all I/O 

nodes in the block. The same example also demonstrates the usage of target 

option. The {i} option specifies all I/O nodes of the selected block. Each node is 

represented by the numbers in curly braces. In the example, the first I/O node 

returned its host name, ionode4, and it was recognized as node 119 in the 

system. 

Example 3-6 Example of using write_con and target option 

mmcs$ redirect R000_128 on 

OK 

mmcs$ {i} write_con hostname 

OK 

mmcs$ Apr 06 18:59:41 (I) [1079031008] {119}.0: h 

Apr 06 18:59:41 (I) [1079031008] {119}.0: ostname 

ionode4 

$ 

Apr 06 18:59:41 (I) [1079031008] {102}.0: h 

Apr 06 18:59:41 (I) [1079031008] {102}.0: ostname 

ionode3 

$ 


Apr 06 18:59:41 (I) [1079031008] {17}.0: h 

Apr 06 18:59:41 (I) [1079031008] {17}.0: ostname 

ionode2 

The use of the target option also allows you to specify one particular I/O node. 

For example, to send the hostname command to the I/O node, which is 

recognized as #17, use {17} instead of {i}, as shown in Example 3-7. 

Example 3-7 Targeting a specific node 


OK 

mmcs$ {17} write_con hostname 

OK 

mmcs$ Apr 06 19:33:34 (I) [1079031008] {17}.0: h 

Apr 06 19:33:34 (I) [1079031008] {17}.0: ostname 

ionode2 

Tip: If you need to know the physical location of the given node number, the 

locate command provides a list for the selected block. 

The redirect command, also used in Example 3-7, enables the output of the 

sent command to be displayed in the MMCS console. The output is also 

recorded in a log file for the specified node. The log file is located in 

/bgl/BlueLight/logs/BGL directory. 

Booting a block and submit a job 

You can use the MMCS console to check whether a block boots successfully and 

a job runs correctly. Example 3-8 illustrates the booting of a block and the 

submitting of a simple job, which is a good check when isolating a problem. 

Example 3-8 A sequence of submitting a job 


$ ./mmcs_db_console 




mmcs$ list_blocks 

OK 



OK 


OK 

R000_128 root(1) connected 

mmcs$ submit_job /bgl/hello/hello.rts /bgl/hello 

OK 

jobID=285 

mmcs$ list_jobs 

OK 

mmcs$ free R000_128 

OK 

mmcs$ quit 

After issuing the list_jobs command, do not forget to check the output of the 

program and confirm that it ran as expected. 



Chapter 4. Running jobs 

4 

This chapter describes the parallel programming, compilers, and job submission 

environment on the BlueGene/L system. Also, we briefly discuss the Message 

Passing Interface (MPI), which is the foundation for large scale scientific and 

engineering applications. The topics covered are: 

► Parallel programming environment 

► Compilers 

► Job submission mechanisms (how they plug into the BlueGene/L 

environment) 

– Submit job (submit_job command) from Midplane Management Control 

System (MMCS) 

– mpirun command (a stand-alone program for submitting jobs) 

– LoadLeveler, which is the IBM job management for AIX and Linux (batch 

queuing system) 


4.1 Parallel programming environment 

The Message Passing Interface (MPI) is a parallel programming environment 

that has been ported and modified to suit the Blue Gene/L system. In Blue 

Gene/L terminology, each partition or block is a set of compute nodes and I/O 

nodes. Compute node kernel (CNK) is a very light weight kernel that developed 

specifically for Blue Gene/L system and supports a very limited set of system 

calls (about 35% of the Linux kernel system calls). 

All of the I/O calls (such as open, read, write and so forth) for the compute nodes 

are shipped to the I/O nodes for doing the job. Also, the compute nodes cannot 

be talked directly. They can be talked to through I/O nodes in the external 

network. At any instance, only a single user can allocate and use the partition, in 

the sense that no context switching is possible on the compute nodes. 

Each compute node is seen as a single process or thread for any application that 

is executing code on the system. Multiple compute nodes form the virtual 

communication network when any application executes. This process is not 

specific to the Blue Gene/L system, because the parallel applications forms this 

for the virtual set of processes that are executing. The MPI_COMM_WORLD 

argument that is specified in each MPI library call binds all of these virtual sets of 

processes under one group. 

There are three basic communication modes in MPI: 

► Point-to-point communication (for example: MPI_Send/Recv and so forth) 

► Collective communication (for example: MPI_Scatter, MPI_Barrier and so 

forth) 

► Collective communication and computation (for example: MPI_Reduce and so 

forth) 

Note: Certain non-blocking communication calls (for example: 

MPI_Isend/IRecv) exists in the MPI library that are out of scope of our 

discussion here. These are simply point-to-point communication calls. 

Let us briefly look at each of the modes: 

► Point-to-point communication mode describes the communication across 

processes, which in our case takes place across compute nodes (one 

compute node runs one process at a point in time). In order to achieve faster 

communication throughout the compute nodes, a three dimensional torus (3D 

Torus) network was designed on the Blue Gene/L system. Each compute 

node is connected to six neighbor compute nodes using the Torus network, 

as shown in Figure 1-25 on page 26. 


► With collective communication mode, the communication for I/O calls and 

some MPI calls are passed through the collective network (see Figure 1-26 

on page 27). However, one of the MPI calls, MPI_Barrier, uses a different 

network (see Figure 1-27 on page 29) that is implemented into the Blue 

Gene/L system. This is because this call specifically requires every process to 

come to a synchronized state before processing can continue. In summary, in 

order to achieve high bandwidth and low latency, MPI is designed on the Blue 

Gene/L system to take advantage of the underlying network topology. 

► Collective communication and computation mode is similar in nature to the 

collective communication mode. However, it depends on the implementation 

details of MPI, which is out of our scope of discussion here. 

Example 4-1 shows the sample Hello world! program C code, which we use 

throughout different sections in this chapter while doing compilation and job 

submission on the system. 

Example 4-1 Sample “Hello world!” program in C 

#include "mpi.h" 

#include 

int main(int argc, char *argv[]) 

{ 

int numprocs; /* Number of processors */ 

int MyRank; /* Processor number */ 

/* Initialize MPI */ 

MPI_Init(&argc, &argv); 

/* Find this processor number */ 

MPI_Comm_rank(MPI_COMM_WORLD, &MyRank); 

/* Find the number of processors */ 

MPI_Comm_size(MPI_COMM_WORLD, &numprocs); 

printf ("Hello world! from processor %d out of %d\n", MyRank, 

numprocs); 

/* Shut down MPI */ 

MPI_Finalize(); 

printf (“I am the Root process\n”); 

return 0; 

} 

Chapter 4. Running jobs 143

4.2 Compilers 

The following steps explain the program shown in Example 4-1 for users who do 

not have experience in MPI. 

1. The MPI environment is initialized using the MPI_Init call, by which all the 

processes recognize and start simultaneously. 

2. MPI_Comm_rank gives the unique identity of each process in the total 

MPI_COMM_WORLD (set of processes) and MPI_Comm_size gives the information 

about how many processes are in the MPI_COMM_WORLD. MPI_COMM_WORLD is a 

dynamic parameter that is initialized and updated during the application 

execution. We explain this in 4.3.2, “The mpirun program” on page 150. 

3. Each process prints the Hello world! message along with its identification 

and the total number of processes for the current execution. 

4. The MPI_Finalize call terminates the MPI environment and all processes 

except the main process that executes this program (the root process). 

Note: The root process (rank zero or master process or thread) for an MPI 

application should not be confused with operating system root user ID. 

5. The root process executes the printf () function (C programming call), and 

the program returns. 

Note: For details about the MPI programming standard, refer to: 

http://www.mcs.anl.gov/mpi 

The compiler is one of the major software package that needs to be discussed 

before we head onto the execution environment. Depending on your site and the 

size of your Blue Gene/L system, you can install a number of FENs to balance 

the load of the user community. (Refer to Unfolding the IBM eServer Blue Gene 

Solution, SG24-6686 for an overview of the required software on the Front-End 

Nodes (FENs) and Service Node (SN). 

To use the XLC/XLF compilers on the FENs (Linux PPC64 platform), you will 

need to get a formal license agreement. Because Blue Gene/L has a different 

processor architecture (PowerPC 440) and there is no way to compile the job on 

the Compute node, the applications need to be cross-compiled. XLC/XLF 

compilers do require additional add-on RPMs for the Blue Gene/L system in 

order to compile user applications. On the SN, the Blue Gene/L control system 

RPMs are installed under the /bgl directory (shared file system across FENs and 


I/O nodes). The CNK supports the applications that are compiled either with the 

blrts-gnu-gcc/g77 or by the blrts_xlc/xlf compilers. 

The cross-compiler environment can be summarized by indicating the following 

required components: 

► Front-End Node running SuSE SLES 9 on a PPC64 (POWER4 and 

POWER5) 

► PowerPC-Linux-GNU to generate PowerPC-blrts-GNU 

► GNU tool chain for Blue Gene/L 

► IBM XL cross compilers for Blue Gene/L 

Currently, to build binaries (executables) for Blue Gene/L, the IBM XL compilers 

require the following: 

► Installation of IBM XLC V7.0/XLF V9.1 compilers for SuSE SLES9 

Linux/PPC64 

► Installation of the Blue Gene/L add-on that includes Blue Gene/L versions of 

the XL run-time libraries, compiler scripts, and configuration files. 

– The GNU Blue Gene/L tool chain: 

gcc, g++, and g77 v3.2 

binutils (as, ld, and so forth) v2.13 

GLIBC v2.2.5 

– Blue Gene/L support is supplied through patches. You apply the patches 

and build the tool chain; IBM supplies scripts to download, patch, and build 

everything. 

Note: For further reference on the compiler refer to Unfolding the IBM eServer 

Blue Gene Solution, SG24-6686. 

4.2.1 The blrts tool chain 

The blrts tool chain is the GNU tool chain, but it has been adapted so that it 

generates code and programs that run in the Blue Gene/L environment. The 

following packages (Open source and GPL license) are required for building a 

tool chain: 

► binutils-2.13 

► gcc-3.2 

► gdb-5.3 

► glibc-2.2.5 

► glibc-linuxthreads-2.2.5 


Because these RPMs can be downloaded from open source community Web 

sites, IBM provides the patches and a script for building the toolchain. This script 

is located in the /bgl/BlueLight/ppcfloor/toolchain/buildBlrtsToolChain.sh 

directory and is used for applying the patches and building the blrts-gnu directory 

where the compilers (powerpc-bgl-blrts-gnu-gcc/g77), debuggers (gdb), and so 

forth are installed. 

The default cross compiler used for generating Blue Gene/L code is 

powerpc-bgl-blrts-gnu-gcc. 

Note: Refer to the readme file for the tool chain, which is available on the 

customer site, when downloading the updated drivers. 

Compilation process using blrts-gnu-gcc/g77 

You compile and link user applications on the FENs only. In Example 4-2, we 

compiled the sample Hello world! program using the blrts-gnu-gcc compiler. 

This is a parallel program that requires MPI libraries and header files that are 

included while generating the executable code. 

Example 4-2 Compiling the “Hello world!” program using blrts-gnu-gcc 

test1@bglfen1:~/Examples/codes> ls 

hello-world.c 

test1@bglfen1:~/Examples/codes> 

/bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gcc -o 

hello-world.rts hello-world.c -I/bgl/BlueLight/ppcfloor/bglsys/include 

-L/bgl/BlueLight/ppcfloor/bglsys/lib -lmpich.rts -lrts.rts 

-lmsglayer.rts -ldevices.rts 


hello-world.c hello-world.rts 

test1@bglfen1:~/Examples/codes> 

Tip: To simplify the syntax of the compile command line, you can supply the 

compiler options, the include files, and libraries from a makefile. 

In this example, we compiled the hello-world.c program using the 

powerpc-bgl-blrts-gnu-gcc compiler and linked it using the MPI library 

(lmpich.rts), the (lrts.rts) and the messaging layer and devices layer 

(lmsglayer.rts, ldevices.rts) that are specific to the Blue Gene/L environment. 

Note: The include directory path is required for compiling MPI programs 

because it contains the mpi.h file (plus additional header files). 


Table 4-1 includes a list of libraries that are required to compile and link 

applications. 

Table 4-1 Libraries and their associated RPMs in the Blue Gene/L driver 

Libraries RPMs (under which they are present) 

libmpich.rts.a bglmpi-2006.1.2-1 

libmsglayer.rts bglmpi-2006.1.2-1 

librts.rts.a bglcnk-2006.1.2-1 

libdevices.rts.a bglcnk-2006.1.2-1 

You can also compile and link parallel applications using the mpicc command, as 

shown in Example 4-3. Because MPI is already installed as a separate package 

(RPM), installing the Blue Gene/L driver adds the following scripts: 

► mpicc (C compiler) 

► mpicxx (C++ compiler) 

► mpif77 (FORTRAN compiler) 

Example 4-3 Compiling the “Hello world!” program using the mpicc command 


hello-world.c 

test1@bglfen1:~/Examples/codes> mpicc -o hello-world.rts hello-world.c 



test1@bglfen1:~/Examples/codes> file hello-world.rts 

hello-world.rts: ELF 32-bit MSB executable, PowerPC or cisco 4500, 

version 1 (embedded), statically linked, not stripped 

If the compile process is successful, you should move the executable code into 

the /bgl or /bglscratch file systems (depending on which file systems are shared 

across the SN, FENs, and I/O nodes) in order to make the binary available for job 

execution. For further information about mpicc refer to the 

/bgl/BlueLight/ppcfloor/bglsys/bin/mpicc script. 

Note: Versions of the previously mentioned RPMs vary depending on the Blue 

Gene/L driver release. 


4.2.2 The IBM XLC/XLF compilers 

This section discusses the add-ons and extra RPMs that are essential for 

compiling applications using the IBM XLC/XLF compilers. The basic XLC/XLF 

compilers on SLES9 are installed on the FENs as a set of three RPMs for each of 

the C and Fortran compilers (with two of the RPMs common across both). These 

RPMs are required to wrap around the generally available XLC/XLF compilers on 

the FEN systems because these are tuned especially for the Blue Gene/L 

version of PowerPC 440. 

While compiling applications using these compilers, applications can take 

advantage of the underlying processor architecture and, in turn, yield good timing 

and performance results. Also, depending on the application, you can experiment 

with optimization flags for best results. 

The XL compilers require the blrts tool chain for functioning. Some of the bltrs 

tool chain features are: 

► The assembler and linker from the blrts tool chain are used to create 

programs that run in the Blue Gene/L environment 

► Runtime libraries are built with the blrts tool chain 

► Runtime routines from the blrts tool chain (glibc) are used for applications 

► Maintain binary compatibility with the gcc compiler that are in the blrts tool 

chain (that is, you can link .o files from the XL compilers and the gcc 

compilers and they should run) 

► Support many of the other tools in the blrts tool chain (that is, gdb, gmon, 

and so forth) 

There are four XLC/XLF RPMs that you must install on the FENs for compiling 

the applications on the system: 

► bgl-vacpp-7.0.0-5.ppc64.rpm (C/C++ compiler for Blue Gene/L) 

► bgl-xlf-9.1.0-5.ppc64.rpm (Fortran compiler for Blue Gene/L) 

► bgl-xlmass.lib-4.3.0-5.ppc64.rpm (MASS mathematical library for Blue 

Gene/L) 

► bgl-xlsmp.lib-1.5.0-5.ppc64.rpm (dummy SMP library in case the -qsmp 

option has been used - for pre-compiled code) 

You can download these from the Blue Gene/L customer site. 


Tip: The xlmass and xlsmp options are common across C & Fortran compilers. 

Because XLC/XLF compilers support the -qsmp option, the dummy SMP 

library is provided to avoid any errors in programs that require the use of this 

flag. These versions can be updated at a later stage. Refer to the IBM 

compiler download site for the latest information about the releases. 

Compiling using blrts_xlc/xlf 

With the IBM blrts_xlc/xlf compilers, user applications are compiled on the FEN 

nodes. Then, the executables files are copied to the shared file system (NFS 

mounted /bgl or /bglscratch) that mounts to the I/O nodes. After the executable 

files are mounted, the process follows similar compiler steps as described in 

“Compilation process using blrts-gnu-gcc/g77” on page 146 and as illustrated in 

Example 4-4. The advantage of using the blrts_xlc/xlf compiler over the GNU 

compiler is that the IBM compilers provide numerous additional flags that can 

optimize and tune large scale applications on the Blue Gene/L system. 

Example 4-4 Compiling the “Hello World!” program using the blrts_xlc compiler 


hello-world.c 

test1@bglfen1:~/Examples/codes> blrts_xlc -o hello-world.rts 

hello-world.c -I/bgl/BlueLight/ppcfloor/bglsys/include 

-L/bgl/BlueLight/ppcfloor/bglsys/lib -lmpich.rts -lrts.rts 

-lmsglayer.rts -ldevices.rts 



4.3 Submitting jobs using built-in tools 

You can submit a job on a Blue Gene/L system in three ways: 

► Using the submit_job command from Midplane Management Control System 

(MMCS) 

► Using the mpirun command (a stand-alone program for submitting jobs) 

► Using LoadLeveler, which is the IBM job management for AIX and Linux 

(batch queuing system) 

4.3.1 Submitting a job using MMCS 

The system administrator can submit jobs on Blue Gene/L/L by logging on to the 

MMCS console. As shown in 2.2.8, “Job submission” on page 66, the main idea 


of providing this access for the administrator is to check the status of the 

partitions , I/O node, and IP addresses to know what file systems have been 

mounted on the I/O node(s) and so on. For submitting a job, refer to 2.3.8, 

“Check that a simple job can run (mmcs_console)” on page 96. 

4.3.2 The mpirun program 

The mpirun program is a stand-alone program that is used to execute parallel 

applications on the system. It is command that is bundled in the MPI RPM of the 

Blue Gene/L driver set and is named bglmpi-2006.1.2-1 (for the Blue Gene/L 

driver version used in this book). This RPM is also installed along with the control 

system on the SN. Users are not allowed to log on to the SN. So, mpirun is 

designed so that how it operates through FENs and SNs is transparent to the 

user. 

The mpirun program is divided into two components, namely front-end mpirun 

(which executes on the FEN) and back-end mpirun (which executes on the SN). 

This distinction is made to ensure secure access to the control system database. 

Normally, the DB2 client is installed on the SN only. 

Let us consider an example in which we submit a job using mpirun on the 

front-end node. This node communicates to the back-end mpirun and queries the 

database about the partition state (in our case, a predefined one). Depending on 

the query result, it makes a decision on whether to go ahead and boot the 

partition (or, if it has already been allocated, to return a message). See 

Figure 4-1. 

mpirun mpirun_be mpirun_be mpirun 

rsh / ssh rshd / sshd rshd / sshd rsh / ssh 

Front end node Blue Gene/L Service node Front end node 

Figure 4-1 Front-end and back-end mpirun communication 


Service Node running 

SLES9 PPC 

64 bit 

IP: 10.0.0.1, bglsn_sn 

netmask: 255.255.0.0 

IP: 172.30.1.1, bglsn_fn 

netmask: 255.255.0.0 

IP: 192.168.00.49, bglsn 

netmask: 255.255.255.0 eth5 

Public 

Network 

Remote command execution setup 

In this section, we discuss two remote execution environments that we used in 

our sample environment, as shown in Figure 4-2. 

► Remote shell (rsh/rshd) 

► Secure shell (ssh/sshd) 

eth0 

eth1 

FrontEnd Node running 

SLES9 PPC 

64 bit 

IP: 172.30.1.41, bglfen1_fn 

netmask: 255.255.0.0 

IP: 192.168.100.41, bglfen1 

netmask: 255.255.255.0 

eth0 

eth1 

Functional 

Network 

Switch 

Service 

Network 

Switch 

FrontEnd Node running 

SLES9 PPC 

64 bit 

IP: 172.30.1.42, bglfen2_fn 

netmask: 255.255.0.0 

IP: 192.168.100.42, bglfen2 

netmask: 255.255.255.0 

1 2 

Figure 4-2 Sample environment used for this redbook 

eth0 

eth1 

Gbit ido 

Node card 3 

Node card 2 

Node card 1 

Node card 0 

BLUE GENE Rack 00, 

Front half of Midplane: 

R00-M0 

Clock card 

master 

Remote shell (rsh/rshd) 

The system administrator must enable the rshd service (part of the xinetd 

system) on the SN and the FENs so that users can execute remote commands 

between the SN and FENs. 

slave 

Service 

card 


The system administrator can set up remote shell server (rshd) on SN and FENs 

by first checking whether rsh is enabled with the /sbin/chkconfig command as 

follows: 

# /sbin/chkconfig --list rsh 


rsh: on 

If the status is off, you can enable the service as follows: 

# /sbin/chkconfig --set rsh on ; /etc/init.d/xinetd start rsh 

Note: The xinetd stores the child daemons’ configuration files in the 

/etc/xinetd.d directory. Alternately, you can check whether rshd is enabled in 

the rsh configuration file that located in this directory. If the line shown in bold 

in the following example is present or if this line is missing, then rshd is 

enabled. 

# cat /etc/xinetd.d/rsh 

# default: off 

# description: 

# The rshd server is a server for the rcmd(3) routine and, 

# consequently, for the rsh(1) program. The server provides 

# remote execution facilities with authentication based on 

# privileged port numbers from trusted hosts. 

# 

service shell 

{ 

socket_type = stream 

protocol = tcp 

flags = NAMEINARGS 

wait = no 

user = root 

group = root 

log_on_success += USERID 

log_on_failure += USERID 

server = /usr/sbin/tcpd 

# server_args = /usr/sbin/in.rshd -L 

server_args = /usr/sbin/in.rshd -aL 

instances = 200 

disable = no 

} 


To allow remote commands to execute without being prompted for a password 

between SN and FENs, the remote shell server (rshd) must be aware of the 

identities that are allowed for these operations. We tested two ways to implement 

this type of execution: 

► System-wide implementation (/etc/hosts.equiv could pose a potential security 

threat) 

► User-based implementation which requires populating the ~/.rhosts file in the 

user’s home directories (permission 600) 

Example 4-5 shows the contents of the /etc/hosts.equiv file and the ~/.rhosts file 

that we created for our environment (one SN, two FENs, service network, 

functional network, and public network). Figure 4-2 on page 151 illustrates this 

environment. 

Example 4-5 The rsh setup using /etc/hosts.equiv 

test1@bglfen1:~> cat /etc/hosts.equiv 

# 

# hosts.equiv This file describes the names of the hosts which are 

# to be considered "equivalent", i.e. which are to be 

# trusted enough for allowing rsh(1) commands. 

# 

# hostname 

bglfen1 

bglfen2 

bglfen1_fn 

bglfen2_fn 

bglsn 

bglsn_fn 

Example 4-6 shows the ~/.rhosts file for user test1 on SN and FENs. We 

recommend this method, because it allows for more access granularity. 

Example 4-6 rsh setup on a per-user basis (user test1) 

test1@bglfen1:~>cat ~/.rhosts 

bglfen1 test1 bglfen1 

bglfen2 test1 

bglsn test1 

bglfen1_fn test1 

bglfen2_fn test1 

bglsn_fn test1 


Note: A similar configuration is performed on service node and the Front-End 

Nodes (the ~/.rhosts file or the /etc/hosts.equiv file are distributed to SN and 

FENs). Best cluster practices require that you set up the same user identities 

(user ID and name) on all nodes (in our case, test1). 

After remote shell is set up, you need to check to ensure that it is working as 

expected. You have to test remote command execution through rsh from each 

node to every other node. Check that the SN and FENs can talk to each other 

using /usr/bin/rsh in both directions across the functional network: 

► From SN: 

– Using the date command, check connection to the FEN, this should be 

repeated to each FEN. 

test1@bglsn_fn:~> rsh bglfen1_fn date 

Mon Apr 3 14:41:44 EDT 2006 

– Using the date command, check to the SN itself. 

test1@bglsn_fn:~> rsh bglsn_fn date 

Mon Apr 3 14:41:44 EDT 2006 

► From the FENs: 

– Using the date command, check to the SN itself. 

test1@bglfen1_fn:~> rsh bglsn_fn date 

Mon Apr 3 14:41:44 EDT 2006 

– Using the date command, check connection to the FEN, this should be 

repeated to each FEN. 

test1@bglfen1_fn:~> rsh bglfen2_fn date 

Mon Apr 3 14:41:44 EDT 2006 

If there is any problem in executing the date command across nodes, check the 

rshd server on the SN and FENs and the /etc/hosts.equiv or the ~/.rhosts files. 

Secure shell (ssh/sshd) 

As an alternate way for remote command execution (which is needed by the job 

submission process if your environment requires enhanced security) is secure 

shell (ssh/sshd). You can set up secure shell using mpirun. The mpirun 

command has an option for specifying which shell to execute when submitting a 

job (-shell). We recommend that you set up secure shell on SN and FENs in 

such a way that allows remote command execution (for the designated users) 

but without being able to login on to the service node. 


Note: This behavior (remote command execution allowed but login not 

allowed) is desired for security reasons. Remote shell (rsh/rshd) allows this 

behavior because another daemon is used for servicing the login requests, 

usually either rlogind or telnetd. However, by default, the secure shell server 

(sshd) allows for both remote command execution and login, and you can alter 

this behavior to serve your purpose. Refer to 7.2.4, “Using ssh in a Blue 

Gene/L environment” on page 406 and the no-pty option in authorized_keys 

file. 

Environment setup for mpirun 

You can use the mpirun command to submit (parallel) jobs to the Blue Gene/L 

system. You must set up the correct environment for mpirun to function properly. 

This section presents the tasks that you need to accomplish to set up mpirun 

correctly. 

Front-End Nodes (FENs) 

You can set up the Front-End Node in two ways: 

► Export the MMCS_SERVER_IP variable in the ~/.bashrc or ~/.cshrc (depending on 

the user shell) using one of the following commands: 

– export MMCS_SERVER_IP=SN_IP_Addr (in ~/.bashrc) 

– setenv MMCS_SERVER_IP SN_IP_Addr (in ~/.cshrc) 

► Using one of the numerous option of the mpirun command. For example, you 

can use the -host option to submit a job instead of specifying the 

MMCS_SERVER_IP environment variable. 

In addition to MMCS_SERVER_IP, there are other variables that you can specify in 

the user’s shell to control several aspects of the job execution (however, most 

options can be overwritten using the mpirun command line arguments while 

submitting jobs from FENs). 

Service Node 

There are three basic settings that are required for the user environment in order 

to execute the job successfully using mpirun: 

► BRIDGE_CONFIG_FILE 

This bridge configuration file includes the images that are required to boot the 

dynamic partition on the Blue Gene/L system. For example, you can use 

either the -partition or -shape command line arguments while submitting 

jobs. You can use the -partition option only if you have a predefined 

partition stored in the Blue Gene/L database. Also, if you want to specify the 

shape of the partition (configuration of the Blue Gene/L internal networks), 

you can use the -shape parameter (for a dynamically generated partition). 


► DB_PROPERTY 

This setting is defined for the mpirun back-end to access the database on the 

SN while trying to check the availability of the partition that is requested and 

its state. For information about block states refer Table 4-3 on page 162. 

► Sourcing the db2profile 

The db2profile (script file) sets the DB2 database environment variables 

(including, but not limited to, binaries and library path for back-end mpirun). 

You should set the variables for every user on the system in their ~/.bashrc or 

.cshrc files, as shown in Example 4-7. 

Example 4-7 Contents of bridge config file and db.properties file 

test1@bglfen1:~> cat bridge_config_file.txt 

BGL_MACHINE_SN 

BGL_MLOADER_IMAGE 

BGL_BLRTS_IMAGE 

BGL_LINUX_IMAGE 

BGL_RAMDISK_IMAGE 

test1@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> cat db.properties 




# database_password=db24bgls 

database_schema_name=bglsysdb 

system=BGL 

min_pool_connections=1 

# Web Console Configuration 

mmcs_db_server_ip=127.0.0.1 

mmcs_db_server_port=32031 

mmcs_max_reply_size=8029 

mmcs_max_history_size=2097152 

mmcs_redirect_server_ip=default 

mmcs_redirect_server_port=32032 

Bridge config file is present in the 

/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config 


These environment variables are required for every user in the Blue Gene/L 

system before submitting the jobs. Example 4-8 shows the sample ~/.bashrc file 

for user test1. 

Example 4-8 Simple ~/.bashrc file 

test1@bglfen1:~> cat ~/.bashrc 

hstnm=`hostname` 

if [ $hstnm = "bglsn" ] 

then 

export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt 

export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties 

source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile 

fi 

Job setup validation 

After you have configured the remote shell and set up the environment, you can 

test user applications using the test that is provided with mpirun. Example 4-9 

shows mpirun command checks across the FENs and SN. An Exit status: 0 

message indicates that the environment is configured properly. 

Example 4-9 Mpirun using only_test_protocol argument 

test1@bglfen1:/bglscratch/test1> mpirun -only_test_protocol -exe 

/bglscratch/test1/hello-world.rts -np 128 -verbose 1 

FE_MPI (Info) : Initializing MPIRUN 

FE_MPI (Info) : Scheduler interface library 

loaded 

FE_MPI (WARN) : 

====================================== 

FE_MPI (WARN) : = Front-End - Only checking 

protocol = 

FE_MPI (WARN) : = No actual usage of the BG/L 

Bridge = 


====================================== 

BE_MPI (WARN) : 

====================================== 

BE_MPI (WARN) : = Back-End - Only checking 

protocol = 

BE_MPI (WARN) : = No actual usage of the BG/L 

Bridge = 

BE_MPI (WARN) : 

====================================== 


BRIDGE (Info) : The machine serial number 

(alias) is BGL 

FE_MPI (Info) : Back-End invoked: 

FE_MPI (Info) : - Service Node: bglsn 

FE_MPI (Info) : - Back-End pid: 6806 (on 

service node) 

FE_MPI (Info) : Preparing partition 

FE_MPI (Info) : Adding job 

FE_MPI (Info) : Job added with the following 

id: 123 

FE_MPI (Info) : Starting job 123 

FE_MPI (Info) : Waiting for job to terminate 

FE_MPI (Info) : BG/L job exit status = (0) 

FE_MPI (Info) : Job terminated normally 

BE_MPI (Info) : Starting cleanup sequence 

BE_MPI (Info) : == BE completed == 

select: Interrupted system call 

FE_MPI (Info) : == FE completed == 

FE_MPI (Info) : == Exit status: 0 == 

Note: The mpirun command has numerous command line arguments. To find 

out about them in detail, refer to the mpirun user guide that was installed 

during your Blue Gene/L system setup. 

Job tracking using the Web interface 

After the job environment has been configured, you can proceed to job 

submission. When you have installed and configured the Blue Gene/L drivers, 

you can use the Web interface to check your system. You can access the Web 

interface by pointing your Web browser to the following URL: 


For more details about the Web interface, see 3.3, “Web interface” on page 119. 

Note: On the SN, the system administrator should set up the Web interface 

(Web server configuration) to allows browsing from the local network. For 

more information, refer to 2.3.3, “Check that BGWEB is running” on page 86. 

System administrators define the set of partitions (blocks) based on the Blue 

Gene/L system configuration at their site. For example, if we have a four-rack 

system, a set of predefined blocks can be 32-node blocks, 128-node blocks, 

2-Rack, 4-Rack, apart from the normal midplane, and rack configuration. This 

setup is very much dependent on the site configuration. When this configuration 


is defined, the user community can browse through the Web site and can know 

the availability of the partitions and their states. 

Figure 4-3 shows the Web interface. 

Figure 4-3 Web interface 


In Figure 4-3, clicking Runtime displays the total number of available predefined 

partitions and the dynamically created partitions on the system. Each row 

contains information about the partitions’ state, as shown in Figure 4-4. 

Figure 4-4 Partitions available on the system 

Note: When you submit a job using mpirun with the -shape option or through 

IBM LoadLeveler, the partition is created dynamically and the name of the 

block allocated starts with RMP. 

Table 4-2 briefly describes the columns shown in Figure 4-4. 

Table 4-2 Description of the job information Web page 

Column Description 

Block ID Indicates the name of the partition or block 

Owner Indicates who has created the block 

Description Indicates how the block is created (using LoadLeveler or 

predefined in the database) 



Status Indicates the status of the block 

Time status updated Indicates the last time status of the block is modified 

Time created Indicates the time that the block was created 

Size Indicates the size of the block (32, 128, 512, 1024, 2048, 

and so forth) 

Job status Indicates the job’s current status (refer Table 4-4 on 

page 162) 

Before submitting the job, you should check the job and block Web interface to 

see which partitions are free for use. Another way to check the availability of the 

blocks is to use the DB2 command CLI on SN (which can be done by the system 

administrator only), as shown in Example 4-10. 

Example 4-10 The DB2 CLI command to get the block information about SN 

bglsn:~ # db2 'connect to bgdb0 user bglsysdb using bglsysdb' 





bglsn:~ # db2 "select substr(blockid,1,16)blockid,STATUS,OWNER from 

BGLBLOCK" 

BLOCKID STATUS OWNER 

---------------- ------ --------------------------- 

R000_128 I root 

R000_J102_32 F 

R000_J104_32 F 

R000_J106_32 F 

R000_J108_32 F 



The Web page shown in Figure 4-4 also includes links to Job Information and 

Block Information. Table 4-3 describes the block states. 

Table 4-3 Block states on the Blue Gene/L system 

Block states Description 

(E)rror An initialization error has occurred. You must issue an allocate, 

allocate_block, or free command to reset the block 

(F)ree The block is available to allocate. 

(A)llocated The block has been allocated by a user. IDo connections have 

been established, but the block has not been booted. 

(C)onfiguring The block is in the process of being booted. This is an internal 

state for communicating between LoadLeveler and MMCS DB. 

(B)ooting The block is in the process of being booted but has not yet 

reached the point that CIOD has been contacted on all I/O nodes. 

(I)nitialized The block is ready to run application programs. CIODB has 

contacted CIOD on all I/O nodes. 

(D)e-allocating The block is in the process of being freed. This state is an internal 

state for communicating between LoadLeveler and the MMCS 

DB. 

Until now, we have discussed the partition states. Initially, when mpirun boots the 

partition, the job state is not known. However, after the block is booted, the job 

status (LoadLeveler) changes from Queued to Start, then to Running and, finally, 

to Terminated. Table 4-4 describes the job status types after the block has 

booted. 

Table 4-4 Blue Gene/L job states (without IBM LoadLeveler) 

Job status Description 

(E)rror An initialization error occurred. The _errtext field contains the 

error message. 

(Q)ueued The job has been created in the BGLJOB database table but 

has not yet started. 

(S)tart The job has been released to start but has not yet started 

running. 

(R)unning The job is running. 

(T)erminated The job has ended. 

(D)ying The job has been killed but has not yet ended. 


Note: Job states shown in Table 4-4 differ from the jobs submitted through 

LoadLeveler. Refer to 4.4, “IBM LoadLeveler” on page 167 for more 

information. 

4.3.3 Example of submitting a job using mpirun 

Each job or application runs in its own block and takes over the entire block. (A 

block cannot be shared, entirely or partially, at any time between two different 

jobs.) Therefore, each job can take anywhere from 32 compute nodes (the 

smallest block size) up to the entire system. 

Note: Details of the job submission process are different depending on 

whether you are using a job scheduler such as LoadLeveler, directly invoking 

mpirun, or submitting jobs through the SN through mmcs_db_console. 

Submitting a job to the Blue Gene/L system using mpirun requires several 

command line arguments, as shown in Example 4-11. In this example, we use 

rsh, which is the default remote command program. 

Example 4-11 Job submission using the mpirun command (by default uses rsh) 

test1@bglfen1:/bglscratch/test1> mpirun -partition -np 

-exe 

-cwd -args mpirun -partition -np 

-exe 

-cwd -args

Figure 4-5 shows a diagram of the job submission process using mpirun. 

User 

1 

Administrator 

Edit/Compile 

Job 

4 

Create/ 

Allocate 

block 

5 

Service 

Node 

Figure 4-5 Job submission process 

The job submission process is as follows: 

1. The user edits, compiles, and so forth the job code on one of the FENs. 

2. The user moves the job to the Cluster Wide File System (CWFS). The CWFS 

can be any supported type of file system, such as the NFS or GPFS that is 

available to the FENs, the SN, and the I/O nodes. The job and work files can 

reside permanently on the CWFS or can be copied there from the FEN’s local 

file system. 

3. The job scheduler assigns a block to the user. 

4. Alternatively to step 3, the system administrator working through the SN can 

assign a block to the user. 

5. The SN initializes the block, (that is, it boots the Compute Nodes and I/O 

nodes). 

6. The block’s I/O nodes load the job code and data from the CWFS and begin 

its execution. 

7. All job I/O runs to and from the CWFS. The job executes on the Compute 

Node. Its data travels across the collective network to the I/O node and 

across the Functional Network (Gigabit Ethernet) to the CWFS. 


3 

Job Scheduler 

2 

Job/data on 

file system 

Boot I/O nodes 

Compute nodes 

User job 

Cluster-Wide File System 

Job/data into 

6 7 

block 

Block 

Block 

Job I/O 

Block 

Blue Gene/L 

Block

Job runtime information (Web interface) 

After the job is submitted using the mpirun, you can view its status through the 

Web interface. Figure 4-6 shows information about the jobs’ state. 

Figure 4-6 Information about each running job 

Table 4-5 explains the columns shown in Figure 4-6. 

Table 4-5 Job information columns 


Job ID Currently running job on the system 

Block ID On which partition is the job running 

User name Displays the user running the job 

Job name The mpirun job name (indicates from which FEN job is 

submitted) 

Mode Under which mode the job is or will be running (mpirun 

option) 

Status Relates to the discussion in Table 4-4 

Status last modified Relates to the change of each state (for example, from Q 

to S to R to T) 


Users and system administrators can check the job status and obtain detailed 

information by following the Job ID link. Figure 4-7 shows a sample job (ID 240). 

Figure 4-7 Detailed description of a sample job 

The following paragraphs provide an overview of the mpirun checklist for the 

required options to submit a job on the Blue Gene/L system. 

The mpirun checklist 

Use this mpirun checklist for the required options to submit a job on the Blue 

Gene/L system: 

► Check for the environment variable echo $MMCS_SERVER_IP on the FEN. If this 

variable is not set to the SN IP address or it is empty, set it. Refer to the FEN 

environment setup as described in “Environment setup for mpirun” on 

page 155. 


► Check for the bridge configuration and database properties files and 

environment variables, respectively, ($BRIDGE_CONFIG_FILE and 

$DB_PROPERTY). If these variables are empty or if the files have the wrong 

contents, correct them and then set the variables. 

► Check whether the db2profile file (which is in the 

/bgl/BlueLight/ppcfloor/bglsys/bin directory) is sourced on the SN. 

► Set and verify the remote command execution environment as described in 

“Remote shell (rsh/rshd)” on page 151 or “Secure shell (ssh/sshd)” on 

page 154. 

► Test the environment by using mpirun with the -only_test_protocol option to 

check the job submission through FEN and SN (refer to Example 4-9). 

4.4 IBM LoadLeveler 

IBM LoadLeveler is a software package that provides utilities for job submission 

and scheduling (workload management) on various UNIX platforms. Blue 

Gene/L is one of the platforms supported by IBM LoadLeveler. 

Due to the characteristics of the Blue Gene/L platform, to use LoadLeveler, in 

addition to the basic LoadLeveler knowledge, you need information that is 

specific to the Blue Gene/L environment. You can find information about 

LoadLeveler in the official IBM manual IBM LoadLeveler Using and 

Administering Guide, SA22-7881. LoadLeveler software comes separately from 

the Blue Gene/L software. Depending on the system type, a different version of 

LoadLeveler is installed. For the latest documentation, see: 

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic= 

/com.ibm.cluster.loadl.doc/llbooks.html 

In this section, we present an overview of how LoadLeveler works in the Blue 

Gene/L environment. This information is essential for identifying and analyzing 

problems. 

4.4.1 LoadLeveler overview 

A LoadLeveler cluster is a collection of nodes (stand-alone systems and LPARs) 

that you can use to run your computing jobs. LoadLeveler manages the defined 

resources and schedules jobs based on resource availability and workload 

characteristics. Jobs are run on the nodes. LoadLeveler uses a central instance 

for managing the entire cluster. Thus, this can be considered a management 

domain type of cluster. 


A node part of a LoadLeveler cluster has a file structure that contains a 

LoadLeveler code and configuration. Some nodes in the cluster are assigned 

special functionalities that are carried out by daemons. The most important 

daemons are Master, (also known as the Central Manager or Negotiator), 

Schedd, and Startd. 

Figure 4-8 illustrates the daemons and their relationship on a single-node 

LoadLeveler cluster. The job to be submitted goes to the Central Manager, which 

dispatches it to the Scheduler. After analyzing the job characteristics 

(requirements) and checking for available resources, the Scheduler sends the 

job to the Start daemon. The Start daemon starts the Starter process to run the 

job. 

In case of a parallel job, the Start daemon initiates multiple Starters. Each Starter 

runs a parallel task. However, this is true only in the case of “classic” parallel 

systems, which run their jobs on SMP machines (each node running a full copy 

of the operating system), clustered by a traditional clustering infrastructure. For 

Blue Gene/L, LoadLeveler works in a different way, because it does not interact 

with Compute nodes (which do not provide the kernel support for running 

multiple processes at the same time) or I/O nodes. 

Negotiator 

(Central Manager) 

start 

Job 

dispatching 

Status 

reporting 

Master 

Schedd 

Figure 4-8 LoadLeveler daemons on a single-node cluster 

To expand a single-node cluster into a multi-node cluster, multiple instances of 

the daemons can run on each node. Not all the daemons are required to run on 

every node. Depending on which daemons are running on a node, the node has 

a specific role in a LoadLeveler cluster. 


start 

start 

Job 

scheduling 

Status 

reporting 

Job 

finishing 

Startd 

Starter 

Job 

starting

The most important node is the Central Manager node, which runs the Central 

Manager daemon. There is only one Central Manager instance in a LoadLeveler 

cluster, even though one (or more) alternate Central Managers can be defined 

for failover and recovery purposes. The remainder of the nodes in the cluster 

have the option of running the Scheduler daemon, the Start daemon, or both. 

Figure 4-9 shows LoadLeveler daemons running on different nodes in a 

multi-node cluster. 

Master 

AIX 

Master 

Node 2 

Node 5 

Schedd 

Startd 

Schedd 

Central Manager node 

Master 

(Node 1) 

Negotiator 

Figure 4-9 A multi-node LoadLeveler cluster 

Schedd 

Startd 

Master 

SLES 

Startd 

AIX RH 

Master 

Node 3 

Node 4 

Schedd 

Startd 

Schedd 

Startd 

Note: The LoadLeveler Central Manager daemon serves as the single point of 

control, storage, and management of the cluster and job information. This 

daemon must be running for the LoadLeveler cluster to function. 

It is possible to have a mixed LoadLeveler cluster, that is nodes in the cluster 

running different operating systems. The operating systems supported by 

LoadLeveler are IBM AIX, SUSE Linux, or Red Hat Linux. For details on versions 

that are supported, check the IBM LoadLeveler readme files that come with the 

product. 

Note: In a mixed cluster, the binary files are not compatible between different 

operating systems. You must configure the cluster so that jobs are scheduled 

on the appropriate nodes. See IBM LoadLeveler Using and Administering 

Guide, SA22-7881 for further information. 


4.4.2 Principles of operation in a Blue Gene/L environment 

In a Blue Gene/L environment, LoadLeveler code runs on the SN and several 

FENs. These nodes become part of the LoadLeveler cluster, which is in fact a 

LoadLeveler cluster that consists of Linux nodes with the usual LoadLeveler 

configuration (and nothing specific to Blue Gene/L). 

Because LoadLeveler does not run on the Blue Gene/L system or the I/O nodes, 

it relies on the bridge API from the SN to access these nodes. (Figure 4-10 

illustrates that the Blue Gene/L system is not actually part of the LoadLeveler 

cluster.) LoadLeveler calls the bridge API functions through the control server to 

set up the Blue Gene/L partitions. After the partitions are created, the job 

information is passed to the mpirun front-end and back-end programs, which 

have the role of submitting the job to the Blue Gene/L system. 

Central 

Manager 

mpirun 

starter 

startd 

mpirun 

starter 

startd 

Figure 4-10 The Blue Gene/L nodes are outside of the LoadLeveler cluster 

The LoadLeveler Central Manager daemon (CM or negotiator) has been 

enhanced with bridge API calls for Blue Gene/L operations. It uses the bridge 

API function calls to query the partition information from the Blue Gene/L 

database and uses these function calls to carry out operations such as: 

► Adding or removing a partition record 

► Sending operations to a service controller 

► Checking the status of the partitions 


Real Machine

The system administrator has to define the Central Manager on the SN when 

configuring LoadLeveler cluster. Thus, the SN must be part of the LoadLeveler 

cluster. Besides the Blue Gene/L file structure and the database, other 

LoadLeveler file structures and locations are set up similar to a regular 

LoadLeveler cluster. 

Note: Although you can configure an alternate Central Manager (as in a 

regular LoadLeveler cluster), failover to a node other than the SN is not 

desirable. As the result of a failover, the alternate Central Manager no longer 

has access to the Blue Gene/L control server and database. 

The LoadLeveler Scheduler daemon (schedd) schedules jobs according to 

workloads and resources on each node in the cluster. In a Blue Gene/L system, 

schedd treats mpirun jobs as common jobs (not Blue Gene/L specific) that run on 

the SN or FENs. However, the SN is reserved for Blue Gene/L administrative 

workload. Therefore, it is not desirable that schedd runs on the SN. The system 

administrator can choose to run schedd on all or some FENs. Thus, schedd is not 

shown in Figure 4-10 and Figure 4-11 on page 171. 

The scheduler (schedd) does not schedule the job on the Blue Gene/L compute 

nodes. It decides which FEN runs the mpirun and passes the job information to 

the LoadLeveler Start daemon (startd) on that node. 

Service Node 

Central Manager 

FEN 

Startd 

Starter 

FEN 

Startd 

Starter 

bridge 

Actions 

Figure 4-11 Central Manager accesses Blue Gene/L nodes through the bridge API 

DB 

Status Updates 

Control 

Daemon 

Physical 

Machine 

Usually, the FENs are where users submit jobs into the LoadLeveler queue. Not 

all of the FENs need to run schedd either. However, the LoadL_admin file 

specifies at least one node as the public scheduler (schedd). The LoadLeveler 


Start daemon (startd) runs on each FEN. The startd daemon receives mpirun 

information from schedd. Then, startd starts the LoadLeveler Starter process 

(starter), which starts the mpirun job on the local node. 

Figure 4-12 shows the output from the LoadLeveler llstatus command with the 

LoadLeveler cluster running. 

Figure 4-12 The llstatus output on a Blue Gene/L system 

4.4.3 How LoadLeveler plugs into Blue Gene/L 

The LoadLeveler software works similarly on many platforms. The following 

tasks apply specifically to Blue Gene/L: 

► Configuring LoadLeveler for Blue Gene/L 

► Making the Blue Gene/L libraries available to LoadLeveler 

► Setting Blue Gene/L specific environment variables 

► Using Blue Gene/L specific keywords in job command file 

In the following sections, we discuss these tasks in more detail. 

4.4.4 Configuring LoadLeveler for Blue Gene/L 

To enable LoadLeveler to recognize a Blue Gene/L system, the LoadL_config file 

must contain specific keywords, as shown in Example 4-13 with the 

recommended values. Only the LoadLeveler system administrator can change 

these keywords in LoadL_config, and they should not be changed while 

LoadLeveler is running. See also IBM LoadLeveler Using and Administering 

Guide, SA22-7881. 


Example 4-13 Blue Gene/L specific configuration keywords 

BG_ENABLED = true 

BG_CACHE_PARTIONS = true 

BG_ALLOW_LL_JOBS_ONLY = false 

BG_MIN_PARTITION_SIZE = 32 

The keyword BG_ENABLED is essential. Setting it to true tells LoadLeveler that this 

is a Blue Gene/L cluster. LoadLeveler then uses the Blue Gene/L bridge API to 

talk with the Blue Gene/L control system. 

Setting the keyword BG_CACHE_PARTITIONS to true tells LoadLeveler to reuse 

existing partitions which have been previously allocated by LoadLeveler. 

The keyword BG_ALLOW_LL_JOBS_ONLY is to false to allow users to run mpirun 

programs without using LoadLeveler. 

The keyword BG_MIN_PARTITION_SIZE specifies that the smallest number of 

computing nodes allowed in a partition is 32. 

4.4.5 Making the Blue Gene/L libraries available to LoadLeveler 

The Blue Gene/L bridge API library is provided with the Blue Gene/L code, 

together with the DB2 libraries. In order for LoadLeveler to access these libraries, 

the system administrator needs to set up appropriate symbolic links. 

On the SN or FEN, these libraries are mostly referred to at the /usr/lib64 or 

/usr/lib directories. However. the actual library binary can reside in another 

location. Instead of copying the binaries to /usr/lib64 or /usr/lib, using symbolic 

links avoids the situation where two different binaries of the same library exist in 

two separate locations. 

Note: Alternatively, symbolic links might be broken when there are changes in 

software levels. Broken links sometimes are not detected and can cause 

problems. 

Therefore, the libraries used by LoadLeveler are listed for reference purposes. 

You should check the links and the binary checksums when users encounter 

errors. A brief description of each library is provided to help identify problems. 

Note: The library directory paths and names can be system-specific. You can 

use the ldconfig and ldd commands to check library links and dynamic 

dependencies. 


The binaries are located in /bgl/BlueLight/ppcfloor/bglsys/lib64. Their symbolic 

links are in /usr/lib64. The following group of libraries belongs to the 64-bit code 

version: 

libbgldb.so Provide the interface to the database tables on the SN 

libtableapi.so for bridge API. 

libbglmachine.so Provides the interface to the Blue Gene/L hardware. 

libbglbridge.so Provides a 64-bit bridge API library that is used to access 

the state of and send orders to the Blue Gene/L system. 

libsaymessage.so Provides a 64-bit library that is used by Blue Gene/L 

software for producing log messages. 

Some libraries come in both 64-bit and 32-bit versions (from both DB2 and 

LoadLeveler). Although the libraries’ name is the same, you should link the 64-bit 

version with /usr/lib64 and the 32-bit version with /usr/lib. 

We use libdb2.so as an example. You can find the file libdb2.so in 

/opt/IBM/db2/V8.1/lib64 and /opt/IBM/db2/V8.1/lib. Create proper links to point to 

the appropriate version. 

The libdb2.so library is a standard 64-bit DB2 client library. It is used by 64-bit 

programs to connect to the DB2 database for queries and updates. 

The LoadLeveler libraries that are located in /opt/ibmll/LoadL/full/lib are: 

► libllapi.so 

► libsched_if.so 

► libsched_if32.so 

► libllpoe.so 

The first two libraries are in 64-bit format and need to be linked to /usr/lib64. The 

remaining two libraries are in 32-bit format and need to be linked to /usr/lib. 

libllapi.so Provides LoadLeveler's 64-bit API library. It is used by the 

LoadLeveler daemons, commands, and by external 64-bit 

programs that need to access the LoadLeveler API. 

libsched_if.so Include the interfaces between mpirun and 

libsched_if32.so LoadLeveler. The mpirun program uses these API calls to 

get job parameters from LoadLeveler (partition in which to 

start and so on). 

libllpoe.so Provides 32-bit version of libllapi.so. Although the binary 

name is libllpoe.so, point the link to /usr/lib/libllapi.so. 


Note: In a conventional LoadLeveler installation (using rpm and the install_ll 

script that is provided), libllpoe.so, libsched_if.so, and libsched_if32.so are 

copied to the appropriate directories and do not need to be linked. 

Example 4-14 LoadLeveler script to set up required links 

DVR_DIR=/bgl/BlueLight/ppcfloor 

cd /usr/lib64 

ln -f -s /opt/IBM/db2/V8.1/lib64/libdb2.so.1 libdb2.so.1 

ln -f -s libdb2.so.1 libdb2.so 

ln -f -s $DVR_DIR/bglsys/lib64/libbgldb.so.1 libbgldb.so.1 

ln -f -s libbgldb.so.1 libbgldb.so 

ln -f -s $DVR_DIR/bglsys/lib64/libtableapi.so.1 libtableapi.so.1 

ln -f -s libtableapi.so.1 libtableapi.so 

ln -f -s $DVR_DIR/bglsys/lib64/libbglmachine.so.1 libbglmachine.so.1 

ln -f -s libbglmachine.so.1 libbglmachine.so 

ln -f -s $DVR_DIR/bglsys/lib64/libbglbridge.so.1 libbglbridge.so.1 

ln -f -s libbglbridge.so.1 libbglbridge.so 

ln -f -s $DVR_DIR/bglsys/lib64/libsaymessage.so.1 libsaymessage.so.1 

ln -f -s libsaymessage.so.1 libsaymessage.so 

ln -f -s /opt/ibmll/LoadL/full/lib/libllapi.so libllapi.so.1 

ln -f -s libllapi.so.1 libllapi.so 

cd /usr/lib 

ln -f -s /opt/IBM/db2/V8.1/lib/libdb2.so.1 libdb2.so.1 

ln -f -s libdb2.so.1 libdb2.so 

4.4.6 Setting Blue Gene/L specific environment variables 

You can use the following variables in the Blue Gene/L environment. You set 

these variables to point to the appropriate configuration files and SN. The 

variables are: 

► BRIDGE_CONFIG_FILE 

► DB_PROPERTY 

► MMCS_SERVER_IP 


Because these variable are user specific, you can set them in the ~/.bashrc file 

under the user’s home directory. Example 4-15 shows the content of the 

~/.bashrc that we used in our test environment. 

Example 4-15 Setting up environment variables with ~/.bashrc 

test1@bglsn_~/>cat ~/.bashrc 

# .bashrc 

# User specific aliases and functions 

# Source global definitions 

if [ -f /etc/bashrc ]; then 

. /etc/bashrc 

fi 


. /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties 

export 

BRIDGE_CONFIG_FILE=/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.config 


export MMCS_SERVER_IP=bglsn.itso.ibm.com 

The bridge.config file contains the Blue Gene/L system serial number and the 

names and locations of the images to be loaded onto compute and I/O nodes. 

The db.properties file contains DB2 database information and the Blue Gene/L 

Web console configuration. 

Note: These environment variables are required for the LoadLeveler 

commands to work properly. Checking the values of these variables and the 

contents of the configuration files is an important task in problem 

determination. 

4.4.7 LoadLeveler and the Blue Gene/L job cycle 

LoadLeveler is a complex job scheduling subsystem. A job can go through 

multiple stages from the time it is submitted into the queue until it finishes. A job 

in the queue can frequently be seen in one of these states: Idle, Starting, 

Running, Held, Remove Pending, and so on. For detail information, see IBM 

LoadLeveler Using and Administering Guide, SA22-7881. 


Table 4-6 provides a quick overview of the LoadLeveler job states, which are 

referred to in the problem determination process. 

Table 4-6 Summary of job states in LoadLeveler 

State Brief description Remarks 

(I)dle The job is being considered to run on a 

system, although no system has been 

selected. 

The system here is a FEN, not the Blue 

Gene/L system or nodes. 

User (H)old The job has been put on user hold. There are many reasons as to why a job is 

put on hold. 

(ST)arting The job is starting: dispatched, received by 

target system, job env. is being set up. 

(R)unning The job is running: dispatched and started 

on designated system. 

Remove 

Pending (RP) 

Job information is being passed to mpirun. 

LoadLeveler has completed passing job 

information to mpirun. See different Blue 

Gene/L job states. 

The job is in the process of being removed The mpirun job is being removed 

according to Blue Gene/L job status. 

A LoadLeveler job in Blue Gene/L goes through different states. During the job 

starting process, the job information is passed to the mpirun front end. At this 

point, the job is in Running state in the LoadLeveler queue. However, the mpirun 

process picks up the job from there and starts the tasks on a Blue Gene/L 

partition. Then , the Blue Gene/L job goes through different states in the partition. 

Note: The Blue Gene/L job status is different from the LoadLeveler job states 

(see Table 4-6 and Table 4-4 on page 162). 

The user who submits a job into LoadLeveler queue usually checks the job 

status through LoadLeveler commands. However, using LoadLeveler commands 

you cannot see the Blue Gene/L partition and job status. The system 

administrator has access to the Blue Gene/L service console to check the Blue 

Gene/L database. 

The user and system administrator have two different views of the job. Seeing 

where the job is in its life cycle can help determine its status. Also, if a job fails, 

the user and system administrator have to trace back to determine where the 

failure has occurred. 


4.4.8 LoadLeveler job submission process 

Throughout the steps in the LoadLeveler job submission process, we use a 

simple mpirun job running a “Hello world!” program. Example 4-16 lists a sample 

job command file that we used in our environment. See , “Job command file” on 

page 198 for brief descriptions of the keywords that this file uses. 

Example 4-16 A sample LoadLeveler job command file named hello.cmd 

#@ job_type = bluegene 

##@ executable = /usr/bin/mpirun 

#@ executable = /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun 

#@ bg_size = 128 

#@ arguments = -verbose 4 -exe /bgl/hello/hello.rts 

#@ output = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out 

#@ error = /bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err 

#@ environment = 

COPY_ALL:BRIDGE_CONFIG_FILE=/bgl/BlueLight/ppcfloor/bglsys/bin/bridge.c 

onfig:DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties:MMCS 

_SERVER_IP=bglsn.itso.ibm.com 

#@ notification = error 

#@ notify_user = loadl 

#@ class = small 

#@ queue 

Note: In the steps in this process, we assume that the user is logged in to one 

of the FENs. 

The following steps detail the LoadLeveler job’s life cycle: 

Step 1: Submitting the job 

A user submits a job from an FEN using the following LoadLeveler command: 

llsubmit hello.cmd 

The command returns a message that indicates that the job has been submitted. 

The job is associated with a LoadLeveler Job ID, and the job information is sent 

to the LoadLeveler Central Manager (CM). The CM daemon runs on the SN and 

receives job information through IP communication (the IP port on which the CM 

is listening). This TCP/IP port is specified in the LoadL_config file. 

As shown in Example 4-17, the llq command does not display the Blue Gene/L 

job information at this point, because the job has only been queued into 

LoadLeveler. 


Example 4-17 Output of the llq -b command 

loadl@bglfen1:/bgl/loadl> llq -b 

Id Owner Submitted LL BG PT Partition Size 

________________________ __________ ___________ __ __ __ ________________ _____ 

bglfen1.47.0 loadl 3/29 12:14 I 

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 preempted 

Figure 4-13 shows the LoadLeveler job submission process on Blue Gene/L. 

Service Node 


Job Info 

FEN 

llsubmit 1 

Startd 

Starter 

FEN 

Startd 

Starter 

Figure 4-13 Job submission process 

bridge 

DB 


Control 

Daemon 

Actions 

Physical 

Machine 

Step 2: Getting Blue Gene/L information 

The LoadLeveler CM receives the job information (at this point, the status of the 

job is I for idle). Meanwhile, the CM retrieves a snapshot of the Blue Gene/L 

system using the bridge API (from the Blue Gene/L database). The information 

requested by CM includes: 

► A list of nodes 

► The locations of those node 

► The status of those nodes 

The bridge API calls are responded to by the control server, which accesses the 

DB2 database for this information. The control server daemon also updates the 

database with the status of the Blue Gene/L hardware. 


Figure 4-14 shows the process of LoadLeveler retrieving information from the 

Blue Gene/L database through the bridge API. 

Service Node 


FEN 

Startd 

Starter 

BGL Info 

Figure 4-14 Getting Blue Gene/L information 

Step 3: Deciding on the partition to use 

With the information received from the previous step, the CM constructs a 3D 

model of the Blue Gene/L system in memory. The CM then uses the 3D model to 

determine the partition to use for the job. Part of this decision process is whether 

to reuse an existing partition or to create a new partition. Figure 4-15 illustrates 

this process. 

Note: LoadLeveler has full responsibility in manipulating the partitions that it 

creates for running jobs. 


2 

FEN 

Startd 

Starter 

bridge 

DB 


Control 

Daemon 

Actions 

Physical 

Machine

Service Node 


3 

BGL/Snapshot 

FEN 

Startd 

Starter 

Figure 4-15 Deciding on the partition 

bridge 

Step 4: Updating the Blue Gene/L database 

The chosen partition for the job is either an existing one or a new one. If it is a 

new partition, it is created dynamically. The CM uses the bridge API to insert the 

partition record into the database (see Figure 4-16). At this point in the process, 

nothing happens (yet) on the Blue Gene/L system. 

Service Node 


BGL/Snapshot 

FEN 

Startd 

Starter 

FEN 

Startd 

Starter 

bridge 

DB 

DB 

Figure 4-16 Updating the Blue Gene/L database 

4 

partition 

FEN 

Startd 

Starter 


Control 

Daemon 

Actions 


Control 

Daemon 

Actions 

Physical 

Machine 

Physical 

Machine 


Steps 5 and 6: Initializing the partition 

For the dynamically created partition, the CM uses the bridge API to change the 

partition state to allocating (A). This change in state triggers the booting of the 

partition under the control of the Blue Gene/L daemons (see Figure 4-17). The 

state of the Blue Gene/L partition now changes to I for initializing and 

LoadLeveler, represented by the root user on the SN, is the owner of this 

dynamic partition. 

Note: The state of the job in LoadLeveler queue is idle (I) but the state of the 

Blue Gene/L partition is initializing (also I). Although the two states are both 

abbreviated to I, the meaning of the state is obviously different. At the end of 

this step, the job in LoadLeveler queue is changing to ST for “starting,” which is 

a transition state, for a very short time. 

Service Node 


FEN 

Startd 

Starter 

Figure 4-17 Initializing the partition 

Step 7: Launching mpirun 

LoadLeveler goes ahead and schedules the job on one of the FENs. The Start 

daemon (startd) receives the job and initiates a Starter process. The Starter 

launches the mpirun front-end process, which communicates with the mpirun 

back-end process on the SN (Figure 4-18). 

At this point, the job status in the LoadLeveler queue is changed to running (R). 

LoadLeveler is basically done passing the job to mpirun, which has now the 

responsibility to run the job. 


5 

Allocate Partition 

FEN 

Startd 

Starter 

bridge 

A 

DB 


Control 

Daemon 

6 

Actions 

Partition being 

configured 

Physical 

Machine

Note: The state of the mpirun job in LoadLeveler queue is running (R). 

However, the state of the Blue Gene/L partition is now changing from 

initializing (I) to ready (R). 

This step can be thought of as when an user invokes the mpirun command 

(without using LoadLeveler). However, there is a small additional check that the 

mpirun front-end process needs to do. It needs to check to see if LoadLeveler is 

installed and can run by calling the LoadLeveler API. One of the checks is to 

make sure that the LoadLeveler configuration allows mpirun to run jobs outside 

of LoadLeveler. See the discussion on BG_ALLOW_LL_JOBS_ONLY configuration 

parameter in 4.4.4, “Configuring LoadLeveler for Blue Gene/L” on page 172. 

Service Node 


FEN 

mpirun 

Startd 

Starter 

FEN 

Startd 

Starter 

bridge 

Figure 4-18 LoadLeveler launching mpirun 

7 

I 

DB 


Control 

Daemon 

Actions 

Physical 

Machine 


Steps 8 and 9: Starting the parallel job 

The back-end mpirun process uses the bridge API to set the partition to ready (R) 

state in the database, which triggers the control daemon to execute the job on 

the partition (see Figure 4-19). 

Service Node 


FEN 

mpirun 

Startd 

Starter 

FEN 

Startd 

Starter 

Figure 4-19 Starting the parallel job 

Note: Again, the job state in LoadLeveler queue is running (R) while the 

partition state that represent the job is in ready (also R). Both states are 

abbreviated as R but obviously have a different meaning. 


8 

bridge 

I 

DB 


Control 

Daemon 

9 

Actions 

Job begins 

executing 

Physical 

Machine

Step 10: Waiting for the job to complete 

The mpirun back-end process uses the bridge API to poll for the job status until 

the job is complete (see Figure 4-20). During this time, mpirun monitors job 

activities. 

Service Node 


FEN 

mpirun 

Startd 

Starter 

FEN 

Startd 

Starter 

10 

bridge 

Figure 4-20 Waiting for the job to complete 

Actions 

Note: LoadLeveler is not aware of what is going on with the job that is running 

in the Blue Gene/L partition. Thus, the job status in LoadLeveler queue 

remains in the running (R) state during this period. 

Steps 11 and 12: Cleaning the partition 

After mpirun receives “Job complete” status, the mpirun processes terminate, 

and the LoadLeveler Starter process terminates as well. The startd daemon 

receives the job status from Starter process and sends it to schedd, which reports 

the job status back to the CM on the SN. The CM uses the bridge API to set the 

state of the partition to F for free, which means that it will not be reused. At this 

point, the job cycle completes (see Figure 4-21). 

I 

DB 


Control 

Daemon 

Job Is Running 

Physical 

Machine 


Service Node 


Figure 4-21 Cleaning the partition 

Note: If the LoadLeveler is configured to reuse partitions, then the partition is 

not freed. Instead, is marked ready (R) to be reused for the next job (if it fits). 

4.4.9 LoadLeveler checklist 

FEN 

Startd 

Starter 

Free Partition 

You can use the tasks presented in this section for scanning normal LoadLeveler 

status with attention to details. You can use these checks to spot any abnormal 

aspects and to investigate hard-to-find problems or pitfalls. 

The checklist includes: 

► LoadLeveler cluster and node status 

► LoadLeveler run queue 

► Job command file 

► LoadLeveler processes, logs, and persistent storage 

► LoadLeveler configuration keywords 

► Environment variables, network, and library links 

LoadLeveler cluster and node status 

The llstatus command displays the status of the LoadLeveler cluster. 

Figure 4-22 shows the following important information that is provided by the 

llstatus command: 

1. Blue Gene is present. This message means that LoadLeveler can talk with 

the Blue Gene/L control server. 


11 

FEN 

Startd 

Starter 

bridge 

DB 


Control 

Daemon 

12 

Actions 

Physical 

Machine

2. The Central Manager is defined on node bglsn.itso.ibm.com. This node is 

the Blue Gene/L service node. This message is a good indication that the CM 

is up and running. 

3. Scheduler daemons (Schedd) are available. They are ready to schedule jobs 

on two of the FENs. 

Tip: Schedd dispatches the mpirun jobs 

4. The job starting daemons (Startd) are idle. They are ready to start jobs that 

come their way. Startd forks a child process called Starter, which then starts 

the mpirun front-end process. 

Figure 4-22 llstatus command output 

Schedds are 

running 

loadl@bglfen1:/usr/lib64> llstatus 

Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys 

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 1184 PPC64 Linux2 


PPC64/Linux2 2 machines 0 jobs 0 running 

Total Machines 2 machines 0 jobs 0 running 

The Central Manager is defined on bglsn.itso.ibm.com 

The BACKFILL scheduler with Blue Gene support is in use 

Blue Gene is present 

All machines on the machine_list are present. 

Startds are running 

CM is up & running on 

this machine 

LoadLeverler 

Can talk to 

Blue Gene 

When comparing the Figure 4-22 to Figure 4-23 on page 188, you can see the 

following problems in the LoadLeveler cluster: 

1. Blue Gene is absent. This message indicates that LoadLeveler cannot talk 

to the control server. As a result, jobs are not able to run. 

2. The Central Manager is defined on node, bglfen2.itso.ibm.com. This node is a 

node other than the service node. This might not be possible in a Blue 

Gene/L cluster but it is worthwhile to pay attention to the error messages. 

3. One of the schedd daemons is down, which means that LoadLeveler is having 

some problems on node bglfen1.itso.ibm.com. However, this issue could also 

be normal if the system administrator previously decided not to run schedd on 

this node. 

3 

4 

Chapter 4. Running jobs 187 

1 

2

4. One of the start daemons is down (indicated by the idle status, which means 

also that there are some problems with LoadLeveler on node 

bglfen2.itso.ibm.com. 

5. One node is absent. Although LoadLeveler can function with missing (or 

absent) nodes, the individual nodes might have important roles in the cluster. 

1 

LoadLeveler 

Cannot talk 

to Blue Gene 

3 4 

One schedd is 

down 



bglfen1.itso.ibm.com Down 0 0 Idle 0 0.00 1184 PPC64 Linux2 




The Central Manager is defined on bglsn.itso.ibm.com, but is unusable 

Alternate Central Manager is serving from bglfen2.itso.ibm.com 

CM is running on a 

different machine 


Blue Gene is absent 

The following machines are absent 

bglsn.itso.ibm.com 

Figure 4-23 Problems reported from the llstatus command 

In the worst case scenario, the llstatus command does not return any 

information but just error messages similar to those in Example 4-18. 

Example 4-18 Error messages reported from llstatus regarding LoadL_negotiator errors 


03/29 16:52:50 llstatus: 2539-463 Cannot connect to bglsn.itso.ibm.com 

"LoadL_negotiator" on port 9614. errno = 111 



llstatus: 2512-301 An error occurred while receiving data from the 

LoadL_negotiator daemon on host bglsn.itso.ibm.com. 

The error messages should interpreted accordingly. For example, the following 

error message could be interpreted in a two ways: 

2539-463 Cannot connect to bglsn.itso.ibm.com "LoadL_negotiator" on port 

9614. errno = 111 


5 

One node is absent 

One startd is down 

2

This message could mean either that LoadLeveler Negotiator or Central 

Manager is down or that LoadLeveler might not be running at all. 

The llstatus command in Blue Gene/L 

The llstatus command provides Blue Gene/L specific option (command flags) 

such as -b, -B, -P, which display information that is related to Blue Gene/L. (See 

IBM LoadLeveler Using and Administering Guide, SA22-7881, for a complete 

reference on the llstatus command.) Using these option flags, you can check 

and compare the LoadLeveler Blue Gene/L related information with information 

from other sources such as the Web interfaces. 

Issuing the llstatus -b command shows the overall dimension of the Blue 

Gene/L system and jobs in the queue. In Example 4-19 and Example 4-20, the 

llstatus -b is issued on two different Blue Gene/L systems. 

Example 4-19 The llstatus -b command on a one-midplane Blue Gene/L system 

loadl@bglfen1:~> llstatus -b 

Name Base Partitions c-nodes InQ Run 

BGL 1x1x1 8x8x8 0 0 

Example 4-20 The llstatus -b command on a 20-rack Blue Gene/L system 

thanhlam@bgwfen1:~> llstatus -b 

Name Base Partitions c-nodes InQ Run 

BGL 5x4x2 40x32x16 0 0 

Issuing llstatus with the -b and -l options (combined) displays more detail of 

the Blue Gene/L system structure, including network switches and cabling. 

Example 4-21 shows output from llstatus -b -l on an one-midplane Blue 


Note: Because this is the minimum configuration possible (one midplane), all 

Blue Gene/L specific networks (Torus, Barrier, and Collective) are inside the 

midplane. Thus, there is no additional cabling (wiring). 

Example 4-21 The llstatus -b -l command on a single midplane system 

loadl@bglfen1:~> llstatus -b -l | more 

Total Blue Gene Base Partitions 1 

Total Blue Gene Compute Nodes 512 

Machine Size in Base Partitons X=1 Y=1 Z=1 

Machine Size in Compute Nodes X=8 Y=8 Z=8 

-- list of base partitions -- 


Z = 0 

===== 

+------------+ 

| R000| 

0 | | 

| | 

+------------+ 

-- list of switches -- 

Switch ID: X_R000 

Switch State: UP 

Base Partition: R000 

Switch Dimension: X 

Switch Connections: NONE 

Switch ID: Y_R000 



Switch Dimension: Y 


Switch ID: Z_R000 



Switch Dimension: Z 


-- list of wires -- 

Wire Id: R000X_R000 

Wire State: UP 

FromComponent=R000 FromPort=MINUS_X 

ToComponent=X_R000 ToPort=PORT_S0 

PartitionState=NONE Partition=NONE 

Wire Id: R000Y_R000 


FromComponent=R000 FromPort=MINUS_Y 

ToComponent=Y_R000 ToPort=PORT_S0 


Wire Id: R000Z_R000 


FromComponent=R000 FromPort=MINUS_Z 

ToComponent=Z_R000 ToPort=PORT_S0 


Wire Id: X_R000R000 


FromComponent=X_R000 FromPort=PORT_S1 

ToComponent=R000 ToPort=PLUS_X 



Wire Id: Y_R000R000 


FromComponent=Y_R000 FromPort=PORT_S1 

ToComponent=R000 ToPort=PLUS_Y 


Wire Id: Z_R000R000 


FromComponent=Z_R000 FromPort=PORT_S1 

ToComponent=R000 ToPort=PLUS_Z 


Example 4-22 is extracted from the output of the llstatus -l -b on a 20-rack 


Note: Because the output is very long, only partial is captured in this example. 

Example 4-22 The llstatus -b -l command on a 20-rack Blue Gene/L system 

thanhlam@bgwfen1:~> llstatus -b 

Total Blue Gene Base Partitions 40 

Total Blue Gene Compute Nodes 20480 

Machine Size in Base Partitons X=5 Y=4 Z=2 

Machine Size in Compute Nodes X=40 Y=32 Z=16 

-- list of base partitions -- 

Z = 1 

===== 

+----------------------------------------------------------------+ 

| R011| R110| R311| R411| R210| 

3 | READY| READY| READY| READY| READY| 

| wR0| wR1| wR21R31| wR411| wR21R31| 

|----------------------------------------------------------------| 

| R031| R130| R331| R431| R230| 


| wR0| wR1| wR33| wR431| wR23| 

|----------------------------------------------------------------| 

| R021| R120| R321| R421| R220| 


| wR0| wR1| wR22R32| wR421| wR22R32| 

|----------------------------------------------------------------| 

| R001| R100| R301| R401| R200| 


| wR0| wR1| wR20R30| wR401| wR20R30| 

+----------------------------------------------------------------+ 


Z = 0 

===== 

+----------------------------------------------------------------+ 

| R010| R111| R310| R410| R211| 


| wR0| wR1| wR21R31| wR410| wR21R31| 

|----------------------------------------------------------------| 

| R030| R131| R330| R430| R231| 


| wR0| wR1| wR33| wR430| wR23| 

|----------------------------------------------------------------| 

| R020| R121| R320| R420| R221| 


| wR0| wR1| wR22R32| wR420| wR22R32| 

|----------------------------------------------------------------| 

| R000| R101| R300| R400| R201| 


| wR0| wR1| wR20R30| wR400| wR20R30| 

+----------------------------------------------------------------+ 

-- list of switches -- 

........ >>>>> Omitted lines

LoadLeveler run queue 

LoadLeveler is based on a job queuing principle. To identify a particular job that a 

user has submitted, a job identification (job ID) is returned when the job is sent 

successfully by the llsubmit command, as shown in Example 4-24. 

Example 4-24 A job ID returned by the llsubmit command 

loadl@bglfen1:/bgl/loadl> llsubmit hello.cmd 

llsubmit: The job "bglfen1.itso.ibm.com.53" has been submitted. 

loadl@bglfen1:/bgl/loadl> llq 

Id Owner Submitted ST PRI Class Running On 

------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.53.0 

loadl 4/3 09:32 I 50 small 


In this example, after the job is submitted successfully, the job number 53 shows 

up in the queue. The job ID bglfen1.53.0 is a unique identifier for this job. The 

number zero (0) is the job step identifier (jobstepid) in case of a job that contains 

more than one step. 

Note: The job ID that is returned by llsubmit has the long host name with the 

domain name, but the queue displays only the short host name with the job ID 

because of the limited number of characters that can fit on a line of the screen. 

However, LoadLeveler log files include the job ID with long host names 

(FQDN). 

In a queue that has hundreds or perhaps thousands of jobs, you can filter the llq 

command output with the job ID. 

The job ID is also used in other LoadLeveler commands such as llcancel. The 

llcancel command tells LoadLeveler to terminate the job if it is running and to 

remove it from the queue. For example: 

llcancel bglfen1.53 


You can use the llq -l command to get more information about a job. 

Another useful flag is -s, used as llq -s . If a job is in idle (I) state, 

using the -s flag tells the llq command to analyze and to display the reasons 

that the job cannot run at the moment (see Example 4-25). 

Example 4-25 Reasons for job in idle state 





------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.54.0 

loadl 4/3 10:10 I 50 small 


loadl@bglfen1:/bgl/loadl> llq -s bglfen1.54.0 

=============== Job Step bglfen1.itso.ibm.com.54.0 =============== 

Job Step Id: bglfen1.itso.ibm.com.54.0 

Job Name: bglfen1.itso.ibm.com.54 

Step Name: 0 

Structure Version: 10 

Owner: loadl 

Queue Date: Mon 03 Apr 2006 10:10:20 AM EDT 

Status: Idle 

Reservation ID: 

Requested Res. ID: 

Scheduling Cluster: 

Submitting Cluster: 

Sending Cluster: 

Requested Cluster: 

Schedd History: 

Outbound Schedds: 

Submitting User: 

Execution Factor: 1 

Dispatch Time: 

Completion Date: 

Completion Code: 

Favored Job: No 

User Priority: 50 

user_sysprio: 0 

class_sysprio: 0 

group_sysprio: 0 

System Priority: -157448 

q_sysprio: -157448 


Previous q_sysprio: 0 

Notifications: Error 

Virtual Image Size: 472 kb 

Large Page: N 

Checkpointable: no 

Ckpt Start Time: 

Good Ckpt Time/Date: 

Ckpt Elapse Time: 0 seconds 

Fail Ckpt Time/Date: 

Ckpt Accum Time: 0 seconds 

Checkpoint File: 

Ckpt Execute Dir: 

Restart From Ckpt: no 

Restart Same Nodes: no 

Restart: yes 

Preemptable: no 

Preempt Wait Count: 0 

Hold Job Until: 

RSet: RSET_NONE 

Mcm Affinity Options: 

Cmd: /bgl/BlueLight/ppcfloor/bglsys/bin/mpirun 

Args: -verbose 2 -exe /bgl/hello/hello.rts 

Env: 

In: /dev/null 

Out: /bgl/loadl/out/hello.bglfen1.54.0.out 

Err: /bgl/loadl/out/hello.bglfen1.54.0.err 

Initial Working Dir: /bgl/loadl 

Dependency: 

Resources: 

Requirements: (Arch == "PPC64") && (OpSys == "Linux2") 

Preferences: 

Step Type: Blue Gene 

Size Requested: 128 

Size Allocated: 

Shape Requested: 

Shape Allocated: 

Wiring Requested: MESH 

Wiring Allocated: 

Rotate: True 

Blue Gene Status: 

Blue Gene Job Id: 

Partition Requested: 

Partition Allocated: 

Error Text: 

Node Usage: shared 


Submitting Host: bglfen1.itso.ibm.com 

Schedd Host: bglfen1.itso.ibm.com 

Job Queue Key: 

Notify User: loadl 

Shell: /bin/bash 

LoadLeveler Group: No_Group 

Class: small 

Ckpt Hard Limit: undefined 

Ckpt Soft Limit: undefined 

Cpu Hard Limit: undefined 

Cpu Soft Limit: undefined 

Data Hard Limit: undefined 

Data Soft Limit: undefined 

Core Hard Limit: undefined 

Core Soft Limit: undefined 

File Hard Limit: undefined 

File Soft Limit: undefined 

Stack Hard Limit: undefined 

Stack Soft Limit: undefined 

Rss Hard Limit: undefined 

Rss Soft Limit: undefined 

Step Cpu Hard Limit: undefined 

Step Cpu Soft Limit: undefined 

Wall Clk Hard Limit: 00:30:00 (1800 seconds) 

Wall Clk Soft Limit: 00:30:00 (1800 seconds) 

Comment: 

Account: 

Unix Group: loadl 

NQS Submit Queue: 

NQS Query Queues: 

Negotiator Messages: 

Bulk Transfer: No 

Step Adapter Memory: 0 bytes 

Adapter Requirement: 

Step Cpus: 0 

Step Virtual Memory: 0.000 mb 

Step Real Memory: 0.000 mb 

================= EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.54.0 ================ 

Not enough resources to start now: 

Quarter "Q1" of BP "R000" is busy. 

Quarter "Q2" of BP "R000" does not have a nodecard with > 0 io_nodes. 




Not enough resources for this step as top-dog: 





Not enough resources to start now: 





Not enough resources for this step as top-dog: 





Blue Gene/L specific information for the run queue 

The LoadLeveler job ID is also stored in the Blue Gene/L database with the job 

record. You can use the DB2 select command to retrieve job information, as 

shown in Example 4-26. 

Example 4-26 Retrieving jobid from DB2 database 

bglsn:~ # db2 "select jobid,blockid,jobname from tbgljob_history where 

username='loadl'" 

JOBID BLOCKID JOBNAME 

----------- ---------------- ------------------------------------------------ 

184 RMP22Mr154151027 bglsn.itso.ibm.com.6.0 



202 RMP24Mr135201017 bglfen1.itso.ibm.com.18.0 

203 RMP24Mr151425028 bglfen1.itso.ibm.com.19.0 

229 R000_128 mpirun.2954.bglfen1 

230 R000_128 mpirun.3140.bglfen1 

255 RMP03Ap112830043 bglfen1.itso.ibm.com.53.0 


Note: The dynamic conditions of the Blue Gene/L partitions and the 

LoadLeveler nodes make it complicate for llq to diagnose all possible 

problems in such an environment. However, the reasons displayed can serve 

as hints and starting points for looking at the live problems on the system. 


Note: The JOBID from Blue Gene/L database is the Blue Gene job identifier. 

The LoadLeveler job ID is called JOBNAME in the database table. 

You can view the same job information from the Blue Gene/L Web interface. 

From the Blue Gene/L home page, click Runtime and then Job Information 

(see Figure 4-24). 

Figure 4-24 Job information from the Web interface 

Job command file 

To submit a job using LoadLeveler, a job command file is required by the 

llsubmit command. We have used a sample job command file (Example 4-16 on 

page 178) throughout the discussions in this chapter. We kept the number of 

keywords in this file to a minimum. There are many other keywords available. 

See IBM LoadLeveler Using and Administering Guide, SA22-7881, for a 

complete reference on job command file keywords. 


Note: In the job command file, each keyword starts with a new line, followed 

by two special characters, the number sign (#) and the “at” sign (@). The 

number sign character (#) is equivalent to the same character that starts the 

comments field in shell scripting language (bash). If job_type = serial, 

LoadLeveler executes the job command file as though it were a shell script. 

The @ character tells LoadLeveler that this line contains a job command 

keyword to be evaluated, and this line is not interpreted by the shell because it 

is a considered a comment. 

The following list describes each keyword in the file briefly and provides hints at 

what could go wrong with the keyword where possible: 

► #@ job_type 

There are basically three types of job supported by LoadLeveler: 

– serial 

– parallel 

– bluegene 

In this book, we only discuss the bluegene job type. 

► #@ job_type = bluegene 

This line is mandatory in the job command file. Without this keyword, 

LoadLeveler does not understand other Blue Gene/L related keywords. In 

fact, this keyword tells LoadLeveler to use the bridge API to exchange 

information with the Blue Gene/L system. Example 4-27 shows the llsubmit 

command returning an error message when the job command file does not 

contain the keyword #@ job_type = bluegene. 

Example 4-27 Missing job_type in job command file 


llsubmit: 2512-585 The "bg_size" keyword is only valid for "job_type = 

BLUEGENE" job steps. 

llsubmit: 2512-051 This job has not been submitted to LoadLeveler. 

► #@ executable = /usr/bin/mpirun 

This directive tells LoadLeveler that the executable for a Blue Gene/L job is 

the mpirun command, which is invoked by LoadLeveler to launch an MPI job 

on the Blue Gene/L nodes. If this keyword is missing or points to a different 

executable, LoadLeveler cannot find the mpirun program (and fails to submit 

the job). 


► #@ arguments 

This keyword contains the arguments that are passed to mpirun. They must 

be typed here exactly as they are entered on the mpirun command line. For 

example, to run the “Hello world!” program with mpirun on the command line, 

we used the following syntax: 

mpirun -exe /bglscratch/test1/hello-file.rts -args 10 -verbose 1 

In the LoadLeveler job command file, this line is translated to: 

#@ executable = /usr/bin/mpirun 

#@ arguments = -exe /bglscratch/test1/hello-file.rts -args 10 

-verbose 1 

Note: LoadLeveler does not validate the syntax of the command string that is 

passed to mpirun. If there is problem with an argument or value, mpirun 

returns a message in the job’s stderr(2). 

► #@ environment 

This keyword passes any environment variables that the job needs to set 

when it is running. The reserved word COPY_ALL copies all the user’s shell 

environment variables for the job (as displayed by the commands set or env). 

► #@ output and #@ error 

These two keywords contain the directory (directories) where the job’s 

stdout(1) and stderr(2) are sent. If these two keywords are not specified in 

the job command file, LoadLeveler redirects the job’s stderr(2) and 

stdout(1) to /dev/null. 

Note: These directories have to be writable for the user ID that runs the job. If 

the directories do not exist or are not accessible, the job is rejected by 

LoadLeveler. 

► #@ notification 

This keyword consists of a reserved word that indicates that LoadLeveler 

should notify the user ID specified in the #@ notify_user keyword. 

► #@ notify_user 

This keyword contains the user ID that is going to receive LoadLeveler’s 

notification in case the notification condition is set (#@ notification). 

► #@ class 

Depending on how the LoadLeveler cluster is configured, the job can chose to 

run under a LoadLeveler class. 


Note: If the class specified by the job command file is not defined in 

LoadLeveler configuration, the job is going to remain idle (I) in the queue. See 

, “LoadLeveler configuration keywords” on page 205 for information about how 

to identify class definitions in LoadLeveler configuration. 

► #@ queue 

This keyword is usually the last keyword in a job command file. It is not set to 

any value. The command llsubmit returns an error if the #@ queue line is 

missing from the job command file (as shown in Example 4-28). 

Example 4-28 The llsubmit error when missing #@queue keyword 

loadl@bglsn:/bgl/loadl> llsubmit hello.cmd 

llsubmit: 2512-058 The command file "hello.cmd" does not contain any 

"queue" keywords. 

llsubmit: 2512-051 This job has not been submitted to LoadLeveler. 

Blue Gene/L specific keywords 

The following list provides the keywords that are specific to Blue Gene/L: 

► #@ bg_size 

This keyword contains an integer that specifies the number of compute nodes 

that are required for this job to run. This is equivalent to the argument -np 

for mpirun. 

► #@ bg_shape 

This keyword contains the reserved word, mesh or torus. It is equivalent to the 

argument flag -shape for mpirun. 

► #@ bg_partition 

This keyword can be set to a partition name. The partition has to be 

predefined. It is equivalent to the argument flag -partition for mpirun. 

Note: One of these keywords should be used in the job command file instead 

of the mpirun argument flags -np, -shape, -partition in the #@ arguments 

keyword. When mpirun receives the job information from LoadLeveler, the 

partition is either created with the specified number of compute nodes, shape, 

or the specified predefined partition is chosen. 

When debugging problems for jobs with a complex command file, start with a 

simple file as described in this section. Make sure that the job can run with this 

file. Use the “Hello world!” program if necessary. Then, you can add keywords 

gradually into the same job command file until the problem is observed. 


LoadLeveler processes, logs, and persistent storage 

As discussed in 4.4.1, “LoadLeveler overview” on page 167, the cluster is 

managed and run by different LoadLeveler daemons. Figure 4-8 on page 168 

shows all the LoadLeveler daemons running in the background on a node. 

However, LoadLeveler is designed in such a way that not all daemons should run 

on every node in the cluster. It is normal to have only a subset of daemons 

running on every node. 

In order to know for sure which LoadLeveler daemons are running, you can use 

the ps -ef command (filtered for LoadLeveler processes), as shown in 

Example 4-29. In our case, the Negotiator daemon (or CM) and the Master 

daemon are running on the SN. 

Example 4-29 LoadLeveler daemons running on Blue Gene/L SN 

loadl@bglsn:/bgl/loadl> ps -ef | grep LoadL 

loadl 27425 1 0 Apr01 ? 00:00:00 

/opt/ibmll/LoadL/full/bin/LoadL_master 

loadl 27436 27425 0 Apr01 ? 00:03:17 LoadL_negotiator -f -c 

/tmp -C /tmp 

loadl 14892 11456 0 15:43 pts/34 00:00:00 grep LoadL 

Running the same command on an FEN shows the Master daemon, the 

Scheduler daemon, the Start daemon, and the Starter daemon running, as 


Example 4-30 LoadLeveler daemons running on Blue Gene/L FEN 

loadl@bglfen1:~> ps -ef | grep LoadL 

loadl 18931 1 0 Apr01 ? 00:00:00 


loadl 18940 18931 0 Apr01 ? 00:00:00 LoadL_schedd -f -c /tmp 

-C /tmp 

loadl 18941 18931 0 Apr01 ? 00:04:48 LoadL_startd -f -C /tmp 

-c /tmp 

loadl 820 18941 0 08:51 ? 00:00:00 LoadL_starter -p 131 -c 

/tmp -C /tmp 

loadl 1950 1891 0 13:45 pts/6 00:00:00 grep LoadL 


The two previous examples match with the LoadLeveler cluster configuration as 

shown by the llstatus command in Example 4-31. 

Example 4-31 Matching daemons with llstatus command 

loadl@bglfen1:~> llstatus 

Name Schedd InQ Act Startd Run LdAvg Idle Arch 

OpSys 

bglfen1.itso.ibm.com Avail 0 0 Idle 0 0.00 395 PPC64 

Linux2 


Linux2 







This configuration is defined in the LoadLeveler configuration files, which are 

described in “LoadLeveler configuration keywords” on page 205. 

The most important daemon is the Negotiator. If the Negotiator cannot start, 

LoadLeveler commands such as llstatus and llq will not work. To determine 

which node is supposed to be the CM, check the LoadL_admin file and look for 

the following line: 

central_manager = true 

Example 4-32 shows the lines extracted from the file LoadL_admin, in which CM 

is defined on node bglsn.itso.ibm.com. 

Example 4-32 CM defined in LoadL_admin 

bglsn.itso.ibm.com: type = machine 

schedd_host = true 


Note: In a common LoadLeveler configuration, there could be other nodes 

defined with central_manager = alt (these are alternate CMs). One of these 

nodes takes over the role of the CM if the designated CM fails. However, an 

alternate CM is not supported in Blue Gene/L environment. 


You can configure the LoadLeveler daemons to be respawned in case of 

abnormal termination. Therefore, for problems that happened in the past, each 

daemon’s log file has to be investigated. The log file names and locations are 

defined in the LoadL_config file, as shown in Example 4-33. 

Example 4-33 Log file names and locations 

loadl@bglfen1:/bgl/loadlcfg> grep -i _log LoadL_config | grep -v MAX 

KBDD_LOG = $(LOG)/KbdLog 

STARTD_LOG = $(LOG)/StartLog 

SCHEDD_LOG = $(LOG)/SchedLog 

NEGOTIATOR_LOG = $(LOG)/NegotiatorLog 

GSMONITOR_LOG = $(LOG)/GSmonitorLog 

STARTER_LOG = $(LOG)/StarterLog 

MASTER_LOG = $(LOG)/MasterLog 

Also in LoadL_config the $(LOG) variable can be defined as: 

LOG = $(tilde)/log 

where $(titde) is the home directory of the LoadLeveler administrator or the 

user ID that starts LoadLeveler. 

If the log file names and locations are not defined in LoadL_config, they are set 

to the location that is specified when the command llinit is issued. The syntax 

of the llinit command is: 

llinit -local /home/loadl 

where the option flag -local specifies the location where the following three 

directories are created: 

► log 

► execute 

► spool 

In addition to the log directory, the llinit command also creates two other 

directories: execute and spool. These directories serve as persistent storage for 

job data and history information. Therefore, if LoadLeveler is stopped with jobs in 

the queue, job data and information are saved for next time when LoadLeveler is 

started. 

Depending on the state of a job at the time LoadLeveler stops, it can be 

restarted, resumed, or started when LoadLeveler starts next time. 

Note: For complete information regarding LoadLeveler processes and logs, 

see IBM LoadLeveler Using and Administering Guide, SA22-7881. 


LoadLeveler configuration keywords 

The LoadL_config file includes the global LoadLeveler configuration keywords. 

To determine the location of this file, check the contents of the /etc/LoadL.cfg file, 

which contains the basic LoadLeveler configuration. Example 4-34 shows the 

contents of this file. 

Example 4-34 The contents of the /etc/LoadL.cfg file 

loadl@bglsn:~/log.save> cat /etc/LoadL.cfg 

LoadLUserid = loadl 

LoadLGroupid = loadl 

LoadLConfig = /bgl/loadlcfg/LoadL_config 

This file resides on a common file system so that you can access it from any 

node in the cluster (that NFS mounts the /bgl file system). The most important 

Blue Gene/L keywords in the LoadL_config file are described in 4.4.4, 

“Configuring LoadLeveler for Blue Gene/L” on page 172. Some of the 

configuration keywords that define the log files and locations are also discussed 

in , “LoadLeveler processes, logs, and persistent storage” on page 202. 

In addition to the global configuration file, LoadLeveler also uses a local 

configuration file that resides on a local file system on each node. This is 

specified in the global configuration file with the keyword LOCAL_CONFIG as 

following: 

LOCAL_CONFIG = $(tilde)/LoadL_config.local 

This local configuration file is needed in case the system administrator wants 

different nodes having different configurations for the following: 

► LoadLeveler daemons running 

► Job classes 

► Number of Starters 

To specify these parameters, the following keywords are used: 

► SCHEDD_RUNS_HERE 

If this keyword is set to FALSE, then LoadLeveler is not going to start the 

Scheduler daemon on this node. If the number of nodes defined to run the 

Scheduler is the majority in the cluster, this keyword can be set to TRUE in the 

global configuration file. Then, in the local configuration file of the node that 

does not need to run the Scheduler set the value to FALSE. 

Note: The setup value in the local configuration file overrides the one in the 

global configuration file. 


► STARTD_RUNS_HERE 

This keyword specifies whether LoadLeveler should start the Start daemon 

on the local node. 

Note: It is usually not desirable to run Scheduler and Start daemon on the 

Blue Gene/L SN. 

► CLASS 

To control the types of jobs to run on particular nodes, you can specify the 

CLASS keyword either in the global or local configuration file. An example is 

following: 

CLASS = small(8) medium(5) large(2) 

Unless a default class is defined in the LoadL_admin file, a job has to specify 

the keyword #@ class in its job command file to be able to run. The keyword 

is set to one of the class name. See Example 4-16 on page 178 for the use of 

the keyword #@ class. 

► MAX_STARTERS 

This configuration keyword sets the maximum number of jobs that can run on 

the local node. Depending on the capability of the node, the appropriate 

number of jobs is recommended. 

Blue Gene/L specific configuration keywords 

As discussed in 4.4.4, “Configuring LoadLeveler for Blue Gene/L” on page 172, 

there are special configuration keywords that enable Blue Gene/L functionalities 

in LoadLeveler. It is also recommended that Scheduler and Start daemon are not 

running on the Service Node. See 4.4.2, “Principles of operation in a Blue 

Gene/L environment” on page 170. 

Environment variables, network, and library links 

This section explains the variables that are critical for Blue Gene/L LoadLeveler 

software environment (job running). 

Environment variables 

In 4.4.6, “Setting Blue Gene/L specific environment variables” on page 175, we 

discuss the environment variables that you need to set up for a user ID to start 

LoadLeveler. A simple way to check these variables in a UNIX shell is to issue 

the echo command, as shown in Example 4-35. 


Example 4-35 Checking environment variables 

loadl@bglfen1:/bgl/loadlcfg> echo $BRIDGE_CONFIG_FILE 


loadl@bglfen1:/bgl/loadlcfg> echo $DB_PROPERTY 

/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties 

loadl@bglfen1:/bgl/loadlcfg> echo $MMCS_SERVER_IP 


If these variables are not set up correctly for the LoadLeveler administrator user 

ID, LoadLeveler will not start. For other users, they need to have these variables 

set up once at the beginning to be able to submit jobs. However, the following 

should be noted: 

► The location of the configuration files to which these variables point can be 

changed for individual users. 

► The contents of the configuration files for individual users can also be 

changed for various needs. 

Another common variable is the PATH ($PATH). For accessing the LoadLeveler 

commands, users should have $PATH, including the directory where the 

LoadLeveler binaries and libraries are installed. For example: 

/opt/ibmll/LoadL/full/bin/ 

Network and communications 

Because LoadLeveler is a cluster managed subsystem, network and 

communication between nodes are vital to its operations. The daemons use 

sockets to communicate to each other. Basic TCP/IP and sockets knowledge is 

helpful in problem determination. 

The global LoadL_config file defines the port numbers that the daemons use, as 


Example 4-36 Network ports defined for LoadLeveler daemons 

# Specify port numbers 

MASTER_STREAM_PORT = 9616 

NEGOTIATOR_STREAM_PORT = 9614 

SCHEDD_STREAM_PORT = 9605 

STARTD_STREAM_PORT = 9611 

COLLECTOR_DGRAM_PORT = 9613 

STARTD_DGRAM_PORT = 9615 

MASTER_DGRAM_PORT = 9617 


For example, knowing how sockets work helps when a socket is closed abruptly, 

and it might need to wait a certain time for all the traffic to be quiesced. This 

situation might occur when you stop and restart LoadLeveler very quickly, 

without waiting for a short while (1 minute). Example 4-37 shows the 

NegotiatorLog, which includes messages that it cannot start and has to wait on 

port 9614. 

Example 4-37 Negotiator daemon waiting on port 9614 

03/19 16:54:30 TI-1 ************************************************* 

03/19 16:54:30 TI-1 *** LOADL_NEGOTIATOR STARTING UP *** 

03/19 16:54:30 TI-1 ************************************************* 

03/19 16:54:30 TI-1 

03/19 16:54:30 TI-1 LoadLeveler: LoadL_negotiator started, pid = 14176 

03/19 16:54:30 TI-5 LoadLeveler: 2539-479 Cannot listen on port 9614 for service 

LoadL_negotiator. 

03/19 16:54:30 TI-5 LoadLeveler: Batch service may already be running on this 

machine. 

03/19 16:54:30 TI-5 LoadLeveler: Delaying 1 seconds and retrying ... 



03/19 16:54:30 TI-5 LoadLeveler: Batch service may already be running on this 

machine. 




One way to alleviate this problem is to set the TCP recycle attribute with the 


/sbin/sysctl -w netipv4.tcp_tw_recycle=”1” 

The netstat command is also a helpful to understand the status of the sockets or 

ports. For example: 

netstat -an | grep 9614 

Library links 

Libraries are other vital resources for LoadLeveler to run. As shown in some 

scenarios, some bad or broken links can cause problems with running 

LoadLeveler and submitting jobs. This verification can be useful as a last resort 

when everything else fails to find the problems for following reasons: 

► Links can be removed or broken 

► Libraries can be updated 

► A link can be pointing to a different library for various users 


Blue Gene/L specific links 

Library links are used extensively with LoadLeveler setup as described in 4.4.5, 

“Making the Blue Gene/L libraries available to LoadLeveler” on page 173. When 

in doubt, check and validate all the library links. You can use a script similar to 

that shown in Example 4-14 on page 175. 

4.4.10 Updating LoadLeveler in a Blue Gene/L environment 

In a Blue Gene/L system, LoadLeveler is not part of the Blue Gene/L code 

distribution. You can update LoadLeveler code separately on the SN and FENs. 

Consider the following recommendations: 

► Check the code levels on all the nodes in the cluster to make sure that there 

are no version or level mismatches. 

► If in doubt, check the libraries and their symbolic links. 

► Note that some installation scripts copy the library binaries rather than 

creating a symbolic link. In this case, the checksum command can help 

validate the binary files. 



Chapter 5. File systems 

5 

This chapter provides an understanding of the steps that are required to fix 

problems with the file systems (persistent storage) that are currently supported 

on Blue Gene/L. Currently both NFS and GPFS file systems are supported for 

use with Blue Gene/L, and in this chapter we discuss problem determination for 

both these file system types. 

For each of the file systems supported, we begin with a general introduction, and 

then we describe how the file system plugs in to Blue Gene/L. We discuss both 

the concepts and the steps that are required to configure the file systems. For 

each file system type, we then present a problem determination methodology 

that recommends a specific sequence of checks, including a checklist of steps 

that show how to make each of the checks, along with an explanation and 

suggested commands. 


5.1 NFS and GPFS 

In a basic configuration, Blue Gene/L requires an NFS file system, regardless of 

whether a GPFS file system is also used. Native access to a GPFS file system 

from a Blue Gene/L I/O node requires an available NFS file system so that the 

GPFS code on the I/O node can read the centrally held GPFS configuration files 

while during the node startup. 

NFS is the most convenient to set up because most operating systems provide 

the facilities to both create and mount the NFS file system. GPFS provides a 

more scalable solution for those configurations where high performance and 

large file system storage is needed (higher requirements than NFS can provide). 

You can configure GPFS with multiple Storage Server nodes that can work 

together to provide cumulated performance (unlike NFS, where all the storage 

that belongs to a NFS file system must be attached physically to a single node). 

GPFS currently supports a file size up 200 TB. This size limit is not the actual 

architectural limit (which in GPFS 2.3 is 2 99 bytes), rather it is the configuration 

that we tested. 

Both NFS and GPFS file systems must be mounted on the I/O node 

automatically during the node boot process. This mounting is essential because 

each time a job is run, a new block might have to be allocated and all the I/O 

nodes belonging to this block, therefore, will be booted. To be able to 

immediately run a job, the file system must be available. 

Understanding the I/O node boot sequence is key to understanding problem 

determination for Blue Gene/L file systems. 

5.1.1 I/O node boot sequence 

The IBM proprietary kernel Compute Node Kernel (CNK) that runs on the 

compute node is a single user, single process run time environment that has no 

paging mechanism. The compute node can communicate to the outside world 

only through the I/O node, and any executable program that runs on the compute 

node must be loaded from the I/O node through the Blue Gene/L internal 

collective network. 

Depending on the configuration, a Blue Gene/L system includes a number of I/O 

nodes. The I/O node runs the Mini-Control Program (MCP), which is a cut down 

Linux OS that runs a 32-bit PPC 2.4 uniprocessor kernel. The Compute node I/O 

daemon (ciod) that is loaded during MCP initialization and runs on the I/O node 

is responsible for handling the I/O calls made by the Compute nodes. The MCP, 

unlike the Compute Node Kernel, supports TCP/IP communication programs, 


such as NFS, ping, and other I/O related system functions that help with problem 

determination. 

Steps in the I/O node boot sequence 

Note: The variables that we use in this section are set during the I/O node 

system init (rc.sysinit) when it runs the /etc/rc.dist script that is built into the 

RAM disk image. Here are the contents of this file on our system: 

export 

BGL_DISTDIR="/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/dist" 

export BGL_SITEDISTDIR="/bgl/dist" 

export BGL_OSDIR="/bgl/OS/4.1" 

The steps of the I/O node boot sequence that we discuss in detail further in this 

section are: 

1. MCP kernel and ramdisk are loaded over the service network. 

2. The MCP launches from the ramdisk image the /sbin/init process that reads 

/etc/inittab. The system init rule in inittab is coded to run /etc/rc.d/rc.sysinit. 

3. The rc.sysinit is invoked from within the MCP ramdisk image. (You can find a 

copy of this file in the /bgl/BlueLight/ppcfloor/dist/etc/rc.d directory.) This 

script attempts to do the following: 

a. NFS mounts the /bgl directory from the Service Node (SN) or the directory 

that defined by the BGL_EXPORTDIR variable, if that variable is set. 

b. Runs the /etc/rc.dist script from the ramdisk image. 

4. The rc.sysinit2 is next invoked from the NFS mounted directory 

(/bgl/BlueLight/ppcfloor/dist/etc/rc.d) and does the following: 

a. Replaces empty /lib with symbolic link to $BGL_OSDIR/lib. 

b. Replaces empty /usr with symbolic link to $BGL_OSDIR/usr. 

c. Replaces empty /etc/rc.d/rc3.d with symbolic link under $BGL_DISTDIR. 

d. Loads the collective/tree network device drivers. 

e. Runs $BGL_DISTDIR/etc/rc.d/rc3.d/S* start scripts. 

f. Runs $BGL_SITEDISTDIR/etc/rc.d/rc3.d/S* start scripts. 

The scripts that are run by default by these start scripts are found in the 

/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d directory and are listed below 

along with the jobs that they perform: 

nfs Starts the portmap daemon. 

xntpd Starts the network time protocol daemon. 

Chapter 5. File systems 213

sshd (optional) Starts the secure shell daemon if required. This 

occurs if either GPFS_STARTUP=1 is set in 

/etc/sysconfig/gpfs on the I/O node OR the 

/etc/sysconfig/ssh is found. 

You can find this file on the SN as 

/bgl/BlueLight/ppcfloor/dist/etc/sysconfig/ssh 

gpfs Starts and mounts GPFS file systems if 

GPFS_STARTUP=1 is set in /etc/sysconfig/gpfs. 

syslog Starts syslog services. 

ciod Starts ciod. 

ibmcmp Starts the XL Compiler Environment for I/O node 

g. runs $BGL_SITEDISTDIR/etc/rc.local script. 

5. As each I/O node completes its MCP boot process, it looks for additional 

scripts to run. These additional scripts can be found in two separate 

directories documented in the following paragraphs. 

5.1.2 Additional scripts in I/O node boot sequence 

You can save scripts that you want invoked during the I/O node boot sequence in 

either of the following two directories: 

► /bgl/BlueLight/ppcfloor/dist/etc/rc.d/rc3.d - (installation dist directory) 

Warning: The /bgl/BlueLight/ppcfloor/ppc/dist/etc/rc.d/rc3.d directory is 

part of the installed software. Its contents are lost when you install a new 

driver or release. 

► /bgl/dist/etc/rc.d/rc3.d - (site dist directory) 

To be considered, the script’s file name must begin with the uppercase letter S 

(for start) or K (for kill), followed by two decimal digits (for example, S10mynfs, 

K10mynfs, and so forth) and a relevant name for the service. 

Note: The general rule is that the scripts starting with S are run at system 

init, and the scripts starting with K ar run when the system is shut down. 

At system initialization the list of scripts starting with S are sorted by the 

subsequent two digits, which specify the relative order that the scripts will be run 

by the I/O node. In a similar way, the kill scripts that start with the letter K are 

used when the I/O node is shut down when the associated block is freed. 

The scripts in both directories are sorted into a single list and then run one at a 

time in that order. 


5.2 NFS 

Warning: If a start script in the site dist directory has the same name as a 

start script in the installation dist directory, only the script in the installation dist 

directory is run. 

Let us assume that you have a script named /bgl/dist/etc/rc.d/rc3.d/S10mynfs 

that mounts additional file systems that contain your data. Because the script 

name begins with S10, it runs before S50ciod, which starts the ciod, and after 

S05nfs, which starts the port mapper. This sequence is correct, because your file 

systems are mounted before jobs can be started and after NFS is already 

running. 

Blue Gene/L requires an NFS file system regardless of whether a GPFS file 

system is also required. The default NFS file system, the /bgl directory, is 

exported from the SN. 

5.2.1 How NFS plugs into a Blue Gene/L system 

It is important that the /blg directory on the SN is NFS exported because the I/O 

nodes must be able to mount this file system when a block is booted. While 

applications can be run from the /blg directory, it is recommended that the /bgl 

file system is preserved for the installed Blue Gene/L code and that another file 

system is used for user applications. Figure 5-1 shows how NFS plugs in to a 


Note: Even though the system can run without an additional NFS server (SN 

provides the basic NFS services and file systems), we strongly recommend 

that you configure additional NFS servers due to applications performance 

and storage requirements and to avoid overloading the SN. 


1. Create NFS Server with storage and create and 

export the NFS file system (/bglscratch). 

172.30.1.33/16 

p5 

Storage 

NFS Server 

2. Export the NFS file system 

from the Server to the Blue 

Gene functional network. 

3. Check the NFS file 

system can be mounted over 

the functional network to the 

Service Node 

Functional 

Ethernet 

Figure 5-1 Adding an NFS file system to the Blue Gene/L 

172.30.0.0/16 

5.2.2 Adding an NFS file system to the Blue Gene/L system 

This section provides an example of how to make a file system available through 

NFS to the Blue Gene/L system. The file system can be served by any file server 

that complies to the NFS V3 protocol. The file system is made available through 

the Functional Network and must be mounted by the I/O nodes, SN, and the 

FENs that are used to compile and execute the jobs. 

In our environment, we used an IBM System p 630 running SUSE SLES 9 

connected to an IBM DS4500 storage. Because it is outside the scope of this 

book, we do not present the basic operating system, storage (file systems), and 

networking configuration steps. Instead, we emphasize the steps to make the file 

system available to the Blue Gene/L system. 


IO Nodes 

connections 

pSeries 

Service Node 

BlueGene 

NFS clients 

4. Add command to the sitefs to mount the 

NFS file system (/bglscratch) when the IO 

nodes boot. 

Important: The File Server name in this section refers to either the SN or 

another system that is used to host the NFS file system (NFS-FS) that is used 

to run user jobs.

Here are the steps that are required to make an NFS file system available for 

running jobs on Blue Gene/L: 

1. Create the file system (NFS-FS) on the File Server system (172.30.1.33, 

p630n03_fn) and mount it on the File Server system. 

mount /bglscratch 

The File Server system could be the SN, one FEN, or another system. 

2. Export the NFS-FS from the File Server. 

Set USE_KERNEL_NFSD_NUMBER="64" in /etc/sysconfig/nfs. 

Add the following line to /etc/exports and then activate it: 

/bglscratch 172.30.0.0/255.255.0.0 (rw,no_root_squash,async) 

exportfs -a 

Now check the export on the FS by issuing the command: 

showmount -e 

Check that the NFS server is started. 

/etc/init.d/nfsserver status 

This should return: 

Checking for kernel based NFS server: running 

3. Check that this NFS-FS can be mounted and accessed on the SN. 

On the SN issue the command (172.1.1.31 is the File Server IP address on 

the functional network): 

mount 172.30.1.33:/bglscratch /mnt 

Check that you can access the command on the SN: 

cd /mnt; touch foo 

4. Update the site customization script (sitefs) to enable the NFS-FS to be 

mounted when the I/O nodes boot. Then check that a job can access files on 

the NFS-FS when run. See sitefs entries in “Step 3 - Check that the NFS-FS 

is mounted when the block boots” on page 221. 

5.2.3 NFS problem determination methodology 

The methodology that we present here is intended to help with a wide variety of 

problems. The first sections cover the basics and the later sections cover the 

more unlikely and esoteric problem areas. If you think you already know in which 

area the problem lies, then we encourage you to go straight to that section. 

However, if you are unsure where the problem lies, we suggest that you use the 

methodology in the order presented here, because this approach often uncovers 


5.2.4 NFS checklists 

the simplest problems quickly and easily before you spend a long time looking for 

a solution to a presumed problem rather than the real one. 

Check that the NFS-FS can be mounted on the SN 

After each step mentioned here, check whether you can mount the NFS-FS: 

► Check that the NFS-FS is exported from the Server, as described in “Step 1 - 

Check that NFS-FS is exported from the File Server” on page 218. 

► Check that you can ping the FS Server over the functional network. 

► If you still cannot mount the file system, then check error messages (screen, 

console, system log) in /var/log/messages. 

Check that the NFS-FS can be mounted on the I/O nodes 

Also check whether you can mount the NFS-FS on the I/O nodes: 

► First boot a block that uses the I/O node that has the problem. 

► Check if the NFS-FS can be mounted on the I/O node, as described in “Step 2 

- Check if the NFS-FS can be mounted on the I/O node” on page 220. 

► Check that you can ping the File Server’s IP address from the I/O node. 

► Check that the NFS-FS is mounted when the block boots, as described in 

“Step 3 - Check that the NFS-FS is mounted when the block boots” on 

page 221 

Check network tuning parameters 

Network tuning parameters are unlikely to prevent NFS from mounting. However, 

if you are experiencing performance or intermittent connection problems, this 

check might help solving the problem. See “Step 4 - Check the network tuning 

parameters on the SN” on page 223. 

In addition to the problem determination methodology, the following detailed 

checklists (steps) can aid in NFS problem determination. 

Step 1 - Check that NFS-FS is exported from the File Server 

The best way to check if a file system is exported is to use the showmount 

command. Example 5-1 issues the showmount command on the SN to check that 

the /bgl directory is exported from the SN over the functional network. 


Example 5-1 Checking NFS exports 

bglsn:/tmp # showmount -e 

Export list for bglsn: 

/bgl 172.30.0.0/255.255.0.0 

Example 5-2 uses the showmount command from the SN to check an additional 

NFS server (that holds the user application code and data) to see what NFS file 

systems can be mounted on the SN (and on I/O nodes). 

Example 5-2 Checking exports for additional servers 

bglsn:/tmp # showmount -e p630n03_fn 

Export list for p630n03_fn: 

/nfs_mnt (everyone) 

/bglscratch (everyone) 

/bglhome (everyone) 

If the showmount command returns the following error then the rpc.mountd or nfsd 

services are not running: 

'mount clntudp_create: RPC: Program not registered'. 

To fix this issue, run the following command: 

/etc/init.d/nfsserver restart 

Another error returned by the showmount command might be the following 

message, which means that the portmap service is not running: 

'mount clntudp_create: RPC: Port mapper failure - RPC: Unable to 

receive' 

To fix this issue run the following: 

/etc/init.d/portmap restart 


You can use the following command to check the server: 

/etc/init.d/nfsserver status 


To check the port mapper service, you can use the following command: 

bglsn:/tmp # /etc/init.d/portmap status 

Checking for RPC portmap daemon: running 



Step 2 - Check if the NFS-FS can be mounted on the I/O node 

Use the mmcs_db_console to check which file systems are mounted on a 

particular I/O node (in our case, ionode4) using the mount command and then 

use the same technique to check connectivity (ping) the SN - 172.30.1.1. 

Example 5-3 shows the commands we used in a mmcs_db_console session and 

the write_con command (command lines are in bold font). 

Example 5-3 Using mmcs_db_console to mount NFS file system on I/O node 

mmcs$ allocate R000_J108_32 

OK 

mmcs$ redirect R000_J108_32 on 

OK 


OK 

mmcs$ Mar 29 13:42:35 (I) [1079301344] {17}.0: h 

Mar 29 13:42:35 (I) [1079301344] {0}.0: h 

Mar 29 13:42:35 (I) [1079301344] {0}.0: ostname 

ionode3 

$ 

Mar 29 13:42:35 (I) [1079301344] {17}.0: ostname 

ionode4 

$ 

mmcs$ {17} write_con hostname 

OK 

mmcs$ Mar 29 13:48:46 (I) [1079301344] {17}.0: h 

Mar 29 13:48:46 (I) [1079301344] {17}.0: ostname 

ionode4 

mmcs$ {17} write_con mount 

OK 

mmcs$ Mar 29 13:43:36 (I) [1079301344] {17}.0: m 

Mar 29 13:43:36 (I) [1079301344] {17}.0: ount 

rootfs on / type rootfs (rw) 

/dev/root on / type ext2 (rw) 

none on /proc type proc (rw) 

172.30.1.1:/bgl on /bgl type nfs 

(rw,v3,rsize=8192,wsize=8192,hard,udp,nolock,addr=172.30.1.1) 


172.30.1.33:/bglscratch on /bglscratch type nfs 

(rw,v3,rsize=32768,wsize=32768,hard,tcp,lock,addr=172.30.1.33) 

/dev/bubu_gpfs1 on /bubu type gpfs (rw) 

$ 

mmcs$ {17} write_con ping -c 2 172.30.1.1 

OK 

mmcs$ Mar 29 13:44:07 (I) [1079301344] {17}.0: p 

Mar 29 13:44:07 (I) [1079301344] {17}.0: ing -c 2 172.30.1.1 

PING 172.30.1.1 (172.30.1.1) 56(84) bytes of data. 

64 bytes from 172.30.1.1: icmp_seq=1 ttl=64 time=0.126 ms 

Mar 29 13:44:08 (I) [1079301344] {17}.0: 64 bytes from 172.30.1.1: 

icmp_seq=2 ttl=64 time=0.098 ms 

Mar 29 13:44:08 (I) [1079301344] {17}.0: 

--- 172.30.1.1 ping statistics --- 

2 packets transmitted, 2 received, 0% packet loss, time 999ms 

rtt min/avg/max/mdev = 0.098/0.112/0.126/0.014 ms 

$ 

From Example 5-3, you can see how to check for mounted file systems on an I/O 

node that is booted and also how to ping the SN from that node to check basic 

functional network connectivity. This technique (using the mmcs_db_console and 

the write_con commands) can also be used to mount the NFS-FS if it is NOT 

automatically mounted. 

Step 3 - Check that the NFS-FS is mounted when the block boots 

You start by checking that the sitefs file has the correct entries (see 

Example 5-4). Next, check that the correct links are in place to invoke the sitefs 

file when I/O nodes are booted. 

The complete sitefs file that we used is shown in Appendix B, “The sitefs file” on 

page 423. Example 5-4 shows the relevant lines (in bold) from our sitefs file. 

Example 5-4 Sample sitefs file with /bglscratch file system 

bglsn:/bgl/dist/etc/rc.d/init.d # ls 

. .. sitefs 

bglsn:/bgl/dist/etc/rc.d/init.d # cat sitefs 

#!/bin/sh 

# 

# Sample sitefs script. 

# 


# It mounts a filesystem on /scratch, mounts /home for user files 

# (applications), creates a symlink for /tmp to point into some 

directory 

# in /scratch using the IP address of the I/O node as part of the 

directory 

# name to make it unique to this I/O node, and sets up environment 

# variables for ciod. 

# 

. /proc/personality.sh 

. /etc/rc.status 

#------------------------------------------------------------------- 

# Function: mountSiteFs() 

# 

# Mount a site file system 

# Attempt the mount up to 5 times. 

# If all attempts fail, send a fatal RAS event so the block fails 

# to boot. 

# 

# Parameter 1: File server IP address 

# Parameter 2: Exported directory name 

# Parameter 3: Directory to be mounted over 

# Parameter 4: Mount options 

#------------------------------------------------------------------mountSiteFs() 

{ 

#............>............... 

#------------------------------------------------------------------- 

# Script: sitefs() 

# 

# Perform site-specific functions during startup and shutdown 

# 

# Parameter 1: "start" - perform startup functions 

# "stop" - perform shutdown functions 

#------------------------------------------------------------------- 

# Set to ip address of site fileserver 

SITEFS=172.30.1.33 

# First reset status of this service 

rc_reset 

# Handle startup (start) and shutdown (stop) 

case "$1" in 


start) 

echo Mounting site filesystems 


mountSiteFs $SITEFS /bglscratch /bglscratch 


##..............>>>>>>>> Omitted lines > Omitted lines

5.3 GPFS 

Example 5-6 Recommended network tuning parameters 

# set UDP receive buffer default (and max, below) so that we don't drop 

packets 

net.core.rmem_default = 1024000 

net.core.rmem_max = 8388608 

net.core.wmem_max = 8388608 

net.core.netdev_max_backlog = 3072 

# ARP cache area size to avoid Neighbour table overflow messages 

# defaults are 128, 512, 1024. For 64 racks should be 512, 2048, and 

4096 . 

net.ipv4.neigh.default.gc_thresh1 = 256 



# NFS tuning parameters 

net.ipv4.tcp_timestamps = 1 

net.ipv4.tcp_window_scaling = 1 

net.ipv4.tcp_sack = 1 

net.ipv4.tcp_rmem = 4096 87380 4194304 

net.ipv4.tcp_wmem = 4096 65536 4194304 

net.ipv4.ipfrag_low_thresh = 393216 

net.ipv4.ipfrag_high_thresh = 524288 

Important: If you have changed any of the network parameters, then you 

must run /etc/rc.d/boot.sysctl start or sysctl -p /etc/sysctl.conf for the 

settings to take effect immediately. 

To view the current settings for these parameters use: 

sysctl -A | grep net. 

GPFS stands for General Parallel File System. GPFS is a high I/O performance 

and scalable file system that is intended primarily for clusters of computers 

where a large number of processors are required to access the same copy of the 

data (one of the basic requirements for parallel computing environments). 

GPFS is based on a client-server model, with the server part responsible for 

managing the storage, and the client part providing application access.The 

GPFS client software is highly efficient in handling data so that the CPU slice 


equired to read and write data is typically much less than for other file systems 

(like NFS). 

Unlike NFS, where managing the storage associated with a file system is the 

responsibility of a single server (OS image), in GPFS the storage can be 

distributed among multiple servers, eliminating the single server bottleneck. 

Using GPFS it is easy to create file systems that store the data on many 

separate disks connected to many separate servers. In addition to performance, 

storage, and server load balancing, GPFS also provides excellent scalability, 

reliability, and high availability by providing continuos operation while adding or 

removing nodes, disks, and file systems. 

Blue Gene/L is a highly scalable processing engine that is designed for highly 

parallel applications. It is likely that many parallel applications that are designed 

to run efficiently on Blue Gene/L will also benefit from the increased and scalable 

I/O performance that GPFS file systems can provide. By combining the 

scalability and performance of the Blue Gene/L processing platform with the 

scalability and I/O performance of a GPFS file system it is possible to provide a 

highly optimized computing environment to run parallel applications. 

This section provides a general overview of GPFS. 

5.3.1 When to use GPFS 

Because GPFS is a client-server application that requires additional knowledge 

and system administration skills, you should consider carefully when deciding to 

go for a GPFS implementation. Because the benefits of such a product are 

major, you should also consider specific elements when deciding to implement it. 

The following considerations can help you make the correct decision: 

► If you need a file system that can provide high performance (and a single 

server cannot deliver), then GPFS would be a good solution. 

► If you need a reliable file system for a cluster that is unaffected by a failure of 

a storage server or disk then GPFS can be configured to provide such a 

system. 

► If you want to allow a parallel applications running on a cluster of machines to 

access a single file at the same time with tight control over data integrity 

(multiple application instances accessing same data file concurrently), then 

GPFS has the appropriate architectural features and also the proven track 

record in providing these functions. 


However, if you have the following requirements, then GPFS is not mandatory: 

► File system performance offered by one NFS server system is adequate. 

► The files you are using are smaller than 2 GB. 

► You have no requirement to run parallel applications that write to the same 

file. 

► You do not intend to scale up in performance or storage capacity. 

5.3.2 Features and concepts of GPFS 

Some of the features of GPFS include: 

► File consistency 

GPFS uses a sophisticated token management system to provide data 

consistency while allowing multiple independent paths to the same file by the 

same name from anywhere in the cluster. 

► High recoverability and increased data availability 

Using GPFS replication, it is possible to configure GPFS to have two copies 

of the data on separate groups of disks (failure groups) should a single disk or 

group for disks fail, access to the data is not lost. 

GPFS is a journaling file system that creates separate journal files for each 

node. These logs record the allocation and modification of metadata, aiding in 

fast recovery and the restoration of data consistency in the event of node 

failure. 

► High I/O performance 

GPFS can provide high I/O performance and achieves this partly by striping 

the files across all the disks in the file system. Managing its own striping 

affords GPFS the control it needs to achieve fault tolerance and to balance 

load across adapters, storage controllers, and disks. Large files in GPFS are 

divided into equal sized blocks, and consecutive blocks are placed on 

different disks in a round-robin fashion. 

To exploit disk parallelism when reading a large file from a single-threaded 

application, whenever it can recognize a pattern, GPFS prefetches data into 

its buffer pool (pagepool), issuing I/O requests in parallel to as many disks as 

necessary to achieve the bandwidth of which the switching fabric is capable. 

GPFS recognizes sequential, reverse sequential, and various forms of strided 

access patterns. 

For parallel applications GPFS provides enhanced performance by allowing 

multiple processes or applications on all nodes in the cluster simultaneous 

access to the same file using standard file system calls. GPFS also allows 

concurrent reads and writes from multiple nodes. This is a key concept in 


parallel processing. Also useful for parallel applications is GPFS’s support of 

byte-range locks on file writes so that multiple clients can write to different 

byte-ranges within the same file at the same time. 

► Very large file and file system sizes 

The currently supported limits for both GPFS file system and file size are 

currently 95 TB for Linux and 100 TB for AIX. These supported limits are 

confined to those configurations that have been tested. The architectural limit 

for GPFS however in 2 PB. 

This is substantially more that most available file systems and can be a key 

advantage as data volumes and file sizes continue to increase. 

► Cross cluster file system access 

GPFS allows users shared access to files in either the cluster where the file 

system was created, or other (remote) GPFS clusters. Each cluster in the 

network is managed as a separate cluster, while allowing shared file system 

access. When multiple clusters are configured to access the same GPFS file 

system, Open Secure Sockets Layer (OpenSSL) is used to authenticate 

cross-cluster network connections. 

5.3.3 GPFS requirements for Blue Gene/L 

Due to the internal structure of the Blue Gene/L system, adding GPFS support 

has been a challenging task. In this section we describe some of the major 

challenges and the solutions designed to overcome them. 

Tip: The GPFS implementation for Blue Gene/L exploits in one of the main 

features of GPFS: cross cluster mounting of GPFS file systems. 

In fact the configuration consists of two GPFS clusters, a storage “server” 

cluster (in fact a common GPFS cluster with storage nodes), and a “client“ 

cluster, consisting of the Blue Gene/L I/O nodes and the SN. 

Challenges for GPFS on Blue Gene/L 

Challenges for GPFS on Blue Gene/L systems include: 

► bgIO cluster - SN is the only quorum node 

► bgIO has no local storage, another cluster is required for storage (gpfsNSD) 

► Blue Gene/L I/O nodes are diskless 

GPFS code usually runs on AIX and Linux, on stand-alone machines that have 

dedicated OS disk(s) from which the system boots. Due to its tight integration 

with the OS, GPFS has been designed to store its code, configuration and log 

files on the local disk(s). On Blue Gene/L all the I/O traffic is done through the I/O 


nodes and these have no boot/OS disks. GPFS, however, needs a place to store 

the files that it uses. These include: 

► GPFS code and utilities 

► GPFS configuration files (one individual set per cluster node) 

► Console log files (one set per node) 

► Syslog files 

► Traces (if debug is needed) 

In addition, due to GPFS structure (clustering layer, storage abstraction layer 

and file system device driver), each node can assume various roles (file system 

manager, configuration manager, and so on). Because the availability of the I/O 

nodes is dynamic (per job block allocation and releasing), if one of these nodes 

would assume a management role inside the cluster, this would cause huge 

performance problems. Performance would be affected by two factors: 

► Cluster reconfiguration requiring GPFS management roles takeover which 

can suspend I/O during such an operation. 

► The additional load induced by the GPFS management roles create an 

imbalance between I/O nodes that are allocated for the same job. 

Solutions for GPFS on Blue Gene/L 

This section describes the approach that IBM development takes (GPFS and 

Blue Gene/L) to the challenges for GPFS on Blue Gene/L. 

► Problem: Access to GPFS files for each I/O node 

Solution: Because the SN is the only in the bgIO cluster that has disks, the 

files needed by GPFS must be stored on those disks. The /bgl file system on 

the SN is used for this purpose. The Blue Gene/L I/O nodes access the GPFS 

files in this file system by means of NFS mounts . 

► Problem: I/O node bootup sequence must include GPFS handling 

Solution: The ‘/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d/gpfs’ script has been 

provided, and is installed when installing the Blue Gene/L RPMs. 

When called during I/O node boot up, the $BGL_DISTDIR/etc/rc.d/init.d/gpfs 

script is responsible for creating the necessary symbolic links that allow the 

GPFS client code to find all the files that it normally uses . 


Example 5-7 presents the directories and links that are created in the gpfs 

script if GPFS_STARTUP = 1 is found in the /etc/sysconfig/gpfs file. 

Example 5-7 Excerpt from ‘gpfs’ script 

# Directories .... 

/bgl/gpfsvar//var/mmfs/gen 

/bgl/gpfsvar//var/mmfs/etc 

/bgl/gpfsvar//var/mmfs/tmp 

/bgl/gpfsvar//var/adm/ras 

# Links...... 

ln -s /bgl/gpfsvar//var/mmfs /var/mmfs 

ln -s /bgl/gpfsvar//var/adm/ras /var/adm/ras 

► Problem: Preventing I/O nodes from assuming GPFS management roles 

(cluster manager, file system/stripe group manager) 

Solution: The Blue Gene/L I/O cluster, referred to hereafter as bgIO, does 

not own any file system, rather it cross- mounts it from another GPFS cluster 

(referred to hereafter as gpfsNSD). 

This solutions also clearly separates the administration of the GPFS file 

system storage from the administration of the bgIO cluster (SN plus I/O 

nodes). This has the advantage that both the Blue Gene/L system and the 

GPFS storage cluster (gpfsNSD) can be scaled independently, if required. 

► Problem: The GPFS cluster for the Blue Gene/L system (bgIO), is unusual in 

as much as it contains only one quorum node, the SN. If this single quorum 

node goes down then the cross mounted GPFS file system, referred to 

hereafter as the GPFS-FS, will become unmounted. 

Solution: This is consistent with the same dependency that the Blue Gene/L 

system has on the SN. Thus, if the SN is turned off (or becomes unavailable 

for any reason), the Blue Gene/L system cannot operate anyway. Therefore, 

it is acceptable that the GPFS-FS will also be unavailable. 

5.3.4 GPFS supported levels 

It is important that the GPFS packages that are installed for the Blue Gene/L I/O 

nodes match the level of the code that are installed for the Blue Gene/L driver 

itself. The installation levels must match because for the Blue Gene/L I/O nodes, 

we do not have to build the portability layer because this is already provided by 

the Blue Gene/L GPFS RPMs. 


Important: The GPFS code that is installed for the Blue Gene/L I/O nodes is 

different from that installed for the SN. The SN runs Linux on IBM System p 

64-bit hardware. This hardware, therefore, uses GPFS for SUSE Linux (SLES 

9) on IBM System p code. The I/O nodes uses a PowerPC 440 CPU that is 32 

bit. So, this hardware uses a special version of GPFS code that is specific to 

this environment. 

This Blue Gene/L I/O node specific GPFS code can only be downloaded from the 

following IBM Web site (which is password protected): 

https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=BGL-BL 

UEGENE 

Access to this IBM Web site is granted to organizations that have purchased the 

GPFS for Blue Gene/L code. 

When the Blue Gene/L driver code level is updated, the GPFS code for the I/O 

nodes must be reinstalled at the correct level. The code must be at the correct 

level because when the I/O nodes boot, they depend on files that exist under the 

following directory: 

/bgl/BlueLight/ppcfloor/dist/etc/rc.d 

When the Blue Gene/L driver is updated, the /bgl/BlueLight/ppcfloor symbolic 

link points to another directory that does not have the GPFS files that are 

required during I/O nodes boot process. You have to re-install the GPFS for MCP 

(I/O nodes) code into the new Blue Gene/L driver directory. 

The GPFS Portability Layer for the SN must be built on the SN and Front-End 

Nodes (FENs) when the GPFS for SLES RPMs have been installed. The GPFS 

Portability Layer for the Blue Gene/L I/O nodes is shipped pre-built, so there is no 

need to build the GPFS Portability Layer for the I/O nodes after installing the 

GPFS for Blue Gene/L RPMs. 

5.3.5 How GPFS plugs in 

In this section we describe the steps needed to add a GPFS file system to the 

Blue Gene/L system. We present only the GPFS commands that enable an 

existing GPFS file system to be cross-mounted on to the Blue Gene/L system. 

This section assumes that you follow the GPFS installation and administration 

procedures that are documented in the GPFS product manuals that are 

documented at the end of this chapter in 5.3.11, “References” on page 264 to 

build the GPFS storage cluster (gpfsNSD). 


Figure 5-2 presents the three high-level steps that needed to make a GPFS file 

system available on Blue Gene/L. The essential concept that we use to make this 

possible is the ability of GPFS 2.3 to allow one GPFS cluster to mount a GPFS 

file systems that belongs to a remote GPFS cluster. 

While it is possible to add the NSD (Network Shared Disk) storage servers 

directly to the bgIO cluster and provide a locally owned GPFS file system this is 

not recommended, as the dynamic nature of the Blue Gene/L system might 

cause the GPFS cluster performance problems (see “Challenges for GPFS on 

Blue Gene/L” on page 227). 

1. Create GPFS cluster (gpfsNSD) with storage 

and create a GPFS file system(/gpfs1). 

172.30.1.31,32,33 

p5 p5 p5 

Storage 

IBM 

DS4500 

gpfsNSD Cluster 

Figure 5-2 Plugging in GPFS in steps 

3. Cross mount GPFS file 

system (/gpfs1) from gpfsNSD 

onto Blue Gene cluster (bgIO). 

Functional Ethernet 

172.30.0.0/16 

IO Nodes 

connections 

172.30.1.1 

BlueGene 

You can find the latest detailed instructions to install Blue Gene/L in the GPFS 

“How to“ document, which is available at: 

https://www14.software.ibm.com/webapp/iwm/web/preLogin.do?source=BGL-BL 

UEGENE 

The three major steps that are required to enable a GPFS file system on a Blue 

Gene/L system are: 

1. Create the GPFS file system on a remote cluster (gpfsNSD). 

2. Create a GPFS cluster on Blue Gene/L (bgIO). This step creates the bgIO 

cluster with just the SN. 

3. Cross mount the GPFS file system from gpfsNSD on to bgIO. This step 

includes adding the I/O nodes for Blue Gene/L to the bgIO cluster. 

pSeries 

Service Node 

bgIO GPFS Cluster 

2. Create GPFS cluster with the 

Service Node and IO 

nodes.(bgIO) 


5.3.6 Creating the GPFS file system on a remote cluster (gpfsNSD) 

Figure 5-3 show the GPFS cluster that we built for our test environment. This 

GPFS cluster uses three nodes running AIX 5L V5.3 and has the GPFS file 

system (mounted as /gpfs1) using four LUNs that reside on 4+P RAID5 arrays 

from a DS4500 storage controller. 

p630n01_fn p630n02_fn p630n03_fn 

OpenPower 

Functional 

Ethernet (Gbit) 

Power Module 

Fan 

Cont r oller 

Cache 

DS4500 

Storage 

DS4500 

TotalStorage 

EXP710 

Storage Group 

Figure 5-3 GPFS storage cluster 

GPFS storage cluster: 

gpfsNSD 

GPFS file system 

mounted as: /gpfs1 

You can create the GPFS file system on a cluster using either AIX or Linux 

nodes. We do not discuss the process of creating this cluster in detail here. 

However, the remote cluster must conform to the following rules: 

► The SN is not included in this cluster 

► All nodes in this storage cluster must be able to access the SN and all Blue 

Gene/L I/O nodes across the functional network. 

5.3.7 Creating a GPFS cluster on Blue Gene/L (bgIO) 

The GPFS code that is installed on the SN is different to that installed for the 

Blue Gene/L I/O nodes. In this section we create the GPFS cluster on the Blue 

Gene/L with only one node that is the SN. Figure 5-4 presents the diagram of our 

cluster before we installed GPFS and configured the bgIO cluster. 


OpenPower 

TotalStorage 

172.30.0.0/16 

OpenPower

172.30.1.1, 

bglsn_fn 

OpenPower 

SLES 9 PPC 64bit 

Service Node 

/bgl 

(local) 

Functional 


I/O Node 

I/O Node 

/bgl (nfs) 

I/O Node 

I/O Node 

I/O Node 

I/O Node 

Figure 5-4 Blue Gene/L system before bgIO cluster is created 

I/O Node 

I/O Node 

The I/O nodes can only be added to this GPFS cluster when the block that they 

serve has been initialized and after the cross-mount of the GPFS file system. 

Here are the high-level steps that are required to create a GPFS cluster that uses 

only the Blue Gene/L SN: 

1. Install the GPFS code for SN, as described in “Installing the GPFS code for 

SN” on page 234 

2. Install the GPFS code for Blue Gene/L I/O nodes, as described in “Installing 

the GPFS code for Blue Gene/L I/O nodes” on page 235 

3. Configure ssh and scp on all Blue Gene/L nodes, as described in “Configuring 

ssh and scp on SN and I/O nodes” on page 237 

4. Create the bgIO cluster, as described in “Creating the bgIO cluster” on 

page 244 

…….. 


Installing the GPFS code for SN 

Figure 5-5 illustrates the steps to install the GPFS code for SN. 

172.30.1.1 

bglsn_fn 

OpenPower 


Install GPFS 

RPMs (PPC64) 

Complile portability 

layer 

Service Node 

/bgl 

(local) 

Functional 


Install GPFS 

BlueGene 

RPMs (MCP) 

Figure 5-5 GPFS code install on the Blue Gene/L system 

To install the code, follow these steps: 

1. Create a new directory for the GPFS code for the SN and populate it with the 

correct GPFS RPMs. 

You can do this using the self-extracting product image, 

gpfs_install-2.3.0-0_sles9_ppc64, from the GPFS for Linux on POWER 

CD-ROM to the new directory. Example 5-8 shows commands that we used. 

Example 5-8 Installing GPFS code on SN 

I/O Node 

I/O Node 

/bgl (nfs) 

root@bglsn_fn~/> mkdir -p /tmp/gpfslpp_for_servicenode/updates 

root@bglsn_fn~/> cd /tmp/gpfslpp_for_servicenode 

root@bglsn_fn~/> cp /cdrom/*/gpfs_install-2.3.0-0_sles9_ppc64 . 

root@bglsn_fn~/> ./gpfs_install-2.3.0-0_sles9_ppc64 --silent 


I/O Node 

I/O Node 

I/O Node 

I/O Node 

I/O Node 

I/O Node

After you accept the license agreement, the GPFS product installation images 

reside in the extraction target directory (in our case, 

/tmp/gpfslpp_for_servicenode). See Example 5-9. 

Example 5-9 GPFS RPMs for SN 

root@bglsn_fn~/> cd /tmp/gpfslpp_for_servicenode 

root@bglsn_fn~/> ls gpfs.* 

gpfs.base-2.3.0-11.sles9.ppc64.rpm 

gpfs.docs-2.3.0-11.noarch.rpm 

gpfs.gpl-2.3.0-11.noarch.rpm 

gpfs.msg.en_US-2.3.0-11.noarch.rpm 

2. Install the GPFS code for the SN: 

cd /tmp/gpfslpp_for_servicenode 

rpm -ivh gpfs*.rpm 

3. Install any updates available. To do this, copy any update rpms to the 

/tmp/gpfslpp_for_servicenode/updates directory and then issue the following 

commands: 

cd /tmp/gpfslpp_for_servicenode/updates 

rpm -Uvh gpfs*.rpm 

4. Create the GPFS portability layer binaries. Follow the instructions in the 

/usr/lpp/mmfs/src/README (on the SN). The files mmfslinux, lxtrace, 

tracedev, and dumpconv are installed in /usr/lpp/mmfs/bin after you have 

completed the instructions 

Installing the GPFS code for Blue Gene/L I/O nodes 

To install the GPFS code for Blue Gene/L I/O nodes, follow these steps: 

1. Download the GPFS for MCP RPMs on to the SN. 

Create a new directory for the GPFS code for the I/O nodes and copy the 

correct RPMs. You can copy these RPMs from the secure Blue Gene/L 

software portal: 

mkdir -p /tmp/gpfslpp_for_ionodes/updates 

Attention: Make sure you do not mix the RPMs for PPC64 with the ones for 

I/O nodes (PPC440, 32-bit)! 


You should see a list similar to the following: 

cd /tmp/gpfslpp_for_ionodes; ls gpfs.* 

gpfs.base-2.3.0-11.ppc.rpm 


gpfs.gplbin-2.3.0-11.ppc.rpm 



2. Install the GPFS code for the I/O nodes: 

cd /tmp/gpfslpp_for_ionodes 

rpm --root /bgl/BlueLight/driver/ppc/bglsys/bin/bglOS --nodeps 

-ivh gpfs*.rpm 

Note: It is important to note the following rpm command argument. 

--root 

This command argument forces the to be used as root 

directory for the RPMs installation. 

3. Install any updates that are available to the code. To do this, copy any update 

RPMs to the /tmp/gpfslpp_for_ionodes/updates directory and issue the 


cd /tmp/gpfslpp_for_iondoes/updates 

rpm -Uvh gpfs*.rpm 


Configuring ssh and scp on SN and I/O nodes 

Important: In this section, we use the following naming conventions: 

► $BGL_SITEDISTDIR: normally points to the /bgl/dist directory. 

► $BGL_DISTDIR: normally points to the /bgl/Bluelight/ppcfloor/dist directory. 

► $BGL_SNIP is the SN’s IP address on the functional network. 

► $SN_HOSTNAME is the SN’s host name on the functional network. 

► $IONODE_IPS is a wildcarded IP address representing all I/O nodes. 

For example, if the I/O nodes have IP addresses 172.30.100.1 through 

172.30.100.128, and 172.30.101.1 through 172.30.101.128, a reasonable 

value for $IONODE_IPS would be 172.30.10?.* 

► $IONODE_HOSTNAMES is a wildcard for the host name representing all I/O 

nodes. 

For example, if the I/O nodes have host names ionode1, ionode2, and so 

forth, a reasonable value for $IONODE_HOSTNAMES would be ionode*. 

In these examples we have chosen to use the RSA2 type for ssh keys. You 

can choose other key types (RSA1, DSA). However, when chosen, it is 

strongly recommended to use the same key type for all ssh keys. 

GPFS requires to execute commands and copy configuration files between all 

nodes in the cluster without being prompted for a password. For GPFS on Linux, 

the default remote command execution and copy program is secure shell/copy. 

This is why we have to prepare the nodes (SN and I/O - see Figure 5-6). 


172.30.1.1 

bglsn_fn 

OpenPower 

Service 

Node 


/bgl 

(local) 

GPFS GPFS 

Configure SSH 

for Service Node 

Functional 


Configure SSH 

for I/O nodes 

I/O Node 

I/O Node 

/bgl (nfs) 

Figure 5-6 Configuring ssh on SN and I/O nodes (for GPFS) 

To configure ssh and scp on SN and I/O nodes, follow these steps: 

1. Ensure that the host name that is associated with the SN is unique. Check 

both the /etc/hosts file and DNS. 

Note: For avoiding network and name resolution problems, we strongly 

recommend that you maintain a consistent name resolution using local 

files (/etc/hosts). Even though you can use DNS, it is not useful to add 

DNS entries for I/O nodes, because they should not be accessible directly 

by the users. 

2. In the /etc/hosts file, add an entry for each I/O node, and check for duplicate 

IP addresses or IP labels (names). 

3. Copy this newly update /etc/hosts file to the correct directory in the Blue 

Gene/L tree (to make it available to the I/O nodes): 

cp /etc/hosts $BGL_SITEDISTDIR/etc/hosts 

chmod 644 $BGL_SITEDISTDIR/etc/hosts 


I/O Node 

I/O Node 

I/O Node 

I/O Node 

I/O Node 

I/O Node

4. Create and verify the directories for the root user on the SN and the I/O 

nodes, as shown in Example 5-10. 

Example 5-10 Directories for ssh client files 

root@bglsn_fn~/> chmod 755 $BGL_SITEDISTDIR 

root@bglsn_fn~/> mkdir $BGL_SITEDISTDIR/root 

root@bglsn_fn~/> chmod 700 $BGL_SITEDISTDIR/root 

root@bglsn_fn~/> mkdir $BGL_SITEDISTDIR/root/.ssh 

root@bglsn_fn~/> chmod 700 $BGL_SITEDISTDIR/root/.ssh 

5. Check the ssh keys pair for root user on the I/O nodes. Check for both the 

private key file (/bgl/dist/root/.ssh/id_rsa), and the public key file 

(/bgl/dist/root/.ssh/id_rsa.pub). If the keys have not been created, use the 


ssh-keygen -t rsa -b 1024 -f $BGL_SITEDISTDIR/root/.ssh/id_rsa -N 

'' 

6. Check the ssh keys pair for root user on the on the SN. Check for both the 

private key file (~/.ssh/id_rsa.pub), and the public key file (~/.ssh/id_rsa.pub). 

If these have not been created, use the following command: 

ssh-keygen -t rsa -b 1024 -f /root/.ssh/id_rsa -N '' 

7. Check the ssh keys pair for ssh daemon on the I/O nodes. Check for both the 

private key file (/bgl/dist/etc/ssh/ssh_host_rsa_key), and the public key file 

(/bgl/dist/etc/ssh/ssh_host_rsa_key.pub). If these have not been created, use 

the following command: 

ssh-keygen -t rsa -b 1024 -f 

$BGL_SITEDISTDIR/etc/ssh/ssh_host_rsa_key -N '' 

8. Check the ssh keys pair for ssh daemon on the SN. Check for both the private 

key file (/etc/ssh/ssh_host_rsa_key), and the public key file 

(/etc/ssh/ssh_host_rsa_key.pub). If these have not been created (most 

unlikely!), use the following command. 

ssh-keygen -t rsa -b 1024 -f /etc/ssh/ssh_host_rsa_key -N '' 

9. Create the authorized_keys file for all nodes in the bgIO cluster. Copy the root 

user’s public key file from the SN to a temporary file (/tmp/authorized_keys). 

Then, append this file with root user’s public key file from the I/O nodes: 

cat /root/.ssh/id_rsa.pub >>/tmp/authorized_keys 

cat $BGL_SITEDISTDIR/root/.ssh/id_rsa.pub >>/tmp/authorized_keys 


Having created the /tmp/authorized_keys file, then distribute it. Check if either 

the SN or the I/O nodes already have an authorized_keys file. If one already 

exists, then append the /tmp/authorized_keys file to the end of the existing 

one: 

cat /tmp/authorized_keys >>/root/.ssh/authorized_keys 

cat /tmp/authorized_keys >> 

$BGL_SITEDISTDIR/root/.ssh/authorized_keys 

10.Create the known_hosts file for all nodes in the bgIO cluster. Create a 

temporary known_hosts file for both the SN and the I/O nodes. Then combine 

these two files to create the /tmp/known_hosts_gpfs file, as shown in 

Example 5-11. 

Example 5-11 Creating the known_hosts file 

root@bglsn_fn~/> echo ”$BGL_SNIP,$SN_HOSTNAME $(cat /etc/ssh/ssh_host_rsa_key.pub)” 

>>\ /tmp/known_hosts_sn 

root@bglsn_fn~/> echo ”$IONODE_IPS,$IONODE_HOSTNAMES $(cat \ 

/bgl/dist/etc/ssh/ssh_host_rsa_key.pub)” >> /tmp/known_hosts_io 

root@bglsn_fn~/> cp /tmp/known_hosts_sn /tmp/known_hosts_gpfs 

root@bglsn_fn~/> cat /tmp/known_hosts_io >> /tmp/known_hosts_gpfs 

Note: The variables $BGL_SNIP, $SN_HOSTNAME, $IONODE_IPS, and 

$IONODE_HOSTNAMES are explained in the Note on page 213. 

Example 5-12 shows one entry in the file that uses the wildcard character (*). 

This character saves having to add entries for every I/O node individually. 

Example 5-12 The known_hosts file entry that uses wild card chars 

172.30.2.*,ionode* ssh-rsa 

AAAAB3NzaC1yc2EAAAABIwAAAIEA27GK+WllP58rmK//LGhE4NKBHDdb30x4Kvrkb3ibbRs 

41eHuLE3/KIV0IQkwi36F4hg5gRBC2vbBINaIJvwiybovpoL2gfpFTeRworWvVI3goBAJh/ 

/hIeT+J9sm+Iogxe2iQ6Q6TfsdPss4dkq3nvGM/HmUULsohgT3u494vVc= root@bglsn 

After creating the /tmp/known_hosts_gpfs file, distribute it (see 

Example 5-13). Check whether either the SN or the I/O nodes already have a 

known_hosts file. If one already exists, then append the 

/tmp/known_hosts_gpfs file to the end of the existing one. 


Example 5-13 Distributing known_hosts file 

root@bglsn_fn~/> cat /tmp/known_hosts_gpfs >> /root/.ssh/known_hosts 

root@bglsn_fn~/> touch $BGL_SITEDISTDIR/root/.ssh/known_hosts 

root@bglsn_fn~/> cat /tmp/known_hosts_gpfs >>\ 

$BGL_SITEDISTDIR/root/.ssh/known_hosts 

Attention: If the authorized_keys and known_hosts files already exist and if 

you want to append to these files, check for duplicate entries in these files. If 

there are duplicate entries, only the first occurrence are considered. Thus, you 

can have authentication problems! 

11.Test unprompted command execution. 

Verify that the ssh files are configured properly by using ssh between all the 

bgIO cluster nodes without being prompted for password or host key 

acceptance. This test requires the sshd daemon to be running on the I/O 

nodes to be tested. The simplest way to achieve this is to ensure that you 

have a sitefs file in the $BGL_SITEDISTDIR/etc/rc.d/init.d/sitefs directory and 

that this sitefs file includes the following line: 


If you do not have a sitefs file, then you can create one using the example 

found at the end of the following file: 

$BGL_SITEDISTDIR/docs/ionode.README 

For your convenience, this file is also included in Appendix C, “The 

ionode.README file” on page 431. 

The sitefs script that we used for this book is shown in Appendix B, “The sitefs 

file” on page 423. The lines that are important to check for in your sitefs script 

are shown in bold in Example 5-14. 

Example 5-14 The sitefs file with GPFS enabled 


# Optionally uncomment the other lines to change the defaults for 

# GPFS. 


# echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs 


When the "GPFS_STARTUP=1" line is included in the sitefs script and the sitefs 

script is also linked into the startup script files then the sshd daemon will be 

started on the I/O nodes at bootup by the S16sshd startup script. Check that 

the following symbolic links are in place so that your sitefs file will be called 

during the I/O node initialization: 

ln -s $BGL_SITEDISTDIR/etc/rc.d/rc3.d/sitefs \ 

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S10sitefs 


$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K90sitefs 

Now you can boot a block and check that you can connect using ssh to one 

I/O node and then from this I/O node back to the SN. First, boot a block and 

establish the IP address of the I/O nodes by using the {i} write_con 

hostname command from the mmcs_db_console (see Example 5-15). 

Example 5-15 Using mmcs_db_console to boot a block and check for I/O nodes 


OK 


mmcs$ select_block R000_128 

OK 


OK 


OK 

mmcs$ Mar 22 14:47:25 (I) [1079031008] {119}.0: h 

Mar 22 14:47:25 (I) [1079031008] {102}.0: h 

Mar 22 14:47:25 (I) [1079031008] {17}.0: h 

Mar 22 14:47:25 (I) [1079031008] {17}.0: ostname 

172.30.2.2 

................. >> ............... 

Mar 22 14:47:25 (I) [1079031008] {0}.0: ostname 

172.30.2.1 

$ 

Mar 22 14:47:25 (I) [1079031008] {34}.0: ostname 

172.30.2.5 

$ 

Example 5-15 shows that we have eight I/O nodes with IP addresses from 

172.30.2.1 - 172.30.2.8. Now, check whether you can connect using ssh to 


one of these nodes from the SN and then back again using both the IP 

address and the node name as listed in the /etc/hosts file: 

bglsn:/tmp # ssh root@172.30.2.1 

BusyBox v1.00-rc2 (2006.01.09-19:48+0000) Built-in shell (ash) 

Enter 'help' for a list of built-in commands. 

This shows that you have sshd into ionode with the IP address of 172.30.2.1. 

Now, you ssh back to the SN using the IP address: 

$ ssh root@172.30.1.1 

Last login: Wed Mar 22 14:45:13 2006 from 192.168.100.60 

bglsn:~ # 

bglsn:~ # exit 

logout 

Connection to bglsn_fn.itso.ibm.com closed. 

Now, you show that you can also ssh between an I/O node and the SN using 

the alias names rather than the IP addresses, as shown in Example 5-16. 

Example 5-16 Verifying ssh connection using IP labels 

$ hostname 

ionode1 

$ ssh root@bglsn_fn.itso.ibm.com 

Last login: Wed Mar 22 15:31:09 2006 from ionode1 

bglsn:~ # ssh root@ionode1 



$ hostname 

ionode1 

$ exit 

Connection to ionode1 closed. 

bglsn:~ # exit 

logout 

Connection to bglsn_fn.itso.ibm.com closed. 

$ exit 

Connection to ionode1 closed. 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # 

This concludes our ssh tests, and we have now confirmed that the ssh setup is 

correct. 


Creating the bgIO cluster 

After you have installed the GPFS packages and have configured remote 

command execution, you can create the GPFS cluster (bgIO). Figure 5-7 

illustrates this step. 

Figure 5-7 Creating the GPFS cluster named bgIO on the Blue Gene/L system 

To create the bgIO cluster, follow these steps: 

1. Create a GPFS node file called service.node that contains only the SN entry. 

Initially we have a single node in the bgIO cluster (Example 5-17). 

Example 5-17 GPFS node definition file for bgIO cluster 

bglsn:/tmp # echo ”$SN_HOSTNAME:quorum ” >> /tmp/service.node 

bglsn:/tmp # cat service.node 

bglsn_fn:quorum 

bglsn:/tmp # 

2. Use the /tmp/service.node file to create the bgIO cluster. Here is the 

command to be issued from the SN: 

/usr/lpp/mmfs/bin/mmcrcluster -n service.node -p bglsn_fn -C bgIO 

-A -r /usr/bin/ssh -R /usr/bin/scp 

Set the pagepool to 128M and any other GPFS configuration parameters that 

you might want to change at this time. Changing the pagepool (default 64 

MB) to 128 MB improves performance. Larger values than 128 MB can result 

in GPFS not being able to load (I/O node has only 2 GB of RAM). 

# mmchconfig pagepool=128M 

# mmchconfig dataStructureDump=/var/mmfs/tmp 


3. Check the bgIO cluster to verify the parameters as shown in Example 5-18. 

Example 5-18 The bgIO cluster configuration 

bglsn:/tmp # /usr/lpp/mmfs/bin/mmlscluster 

GPFS cluster information 

======================== 

GPFS cluster name: bgIO.itso.ibm.com 

GPFS cluster id: 12402351528774401789 

GPFS UID domain: bgIO.itso.ibm.com 

Remote shell command: /usr/bin/ssh 

Remote file copy command: /usr/bin/scp 

GPFS cluster configuration servers: 

----------------------------------- 

Primary server: bglsn_fn.itso.ibm.com 

Secondary server: (none) 

Node number Node name IP address Full node name Remarks 

----------------------------------------------------------------------------------- 

1 bglsn_fn 172.30.1.1 bglsn_fn.itso.ibm.com quorum node 

4. Start the bgIO GPFS cluster using the mmstartup -a command and check the 

/var/adm/ras/mmfs.log.latest file to ensure that GPFS has started (see 

Example 5-19) and look for ‘mmfsd ready’. 

Example 5-19 The mfmfs.log.latest file showing that GPFS is started and ready 

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmstartup -a 

Mon Mar 20 14:14:30 EST 2006: mmstartup: Starting GPFS ... 

bglsn:/mnt/chriss/gpfs/BGL # cat /var/adm/ras/mmfs.log.latest 

Mon Mar 20 14:14:31 EST 2006 runmmfs starting 

Removing old /var/adm/ras/mmfs.log.* files: 

/bin/mv: cannot stat `/var/adm/ras/mmfs.log.previous': No such file or 

directory 

Unloading modules from /usr/lpp/mmfs/bin 

Loading modules from /usr/lpp/mmfs/bin 

Module Size Used by 

mmfslinux 268384 1 mmfs 

tracedev 35552 2 mmfs,mmfslinux 

Removing old /var/mmfs/tmp files: 

Mon Mar 20 14:14:33 2006: mmfsd initializing. {Version: 2.3.0.10 

Built: Jan 16 2006 13:07:54} ... 

Mon Mar 20 14:14:34 EST 2006 /var/mmfs/etc/gpfsready invoked 


Mon Mar 20 14:14:34 2006: mmfsd ready 

Mon Mar 20 14:14:34 EST 2006: mmcommon mmfsup invoked 

Mon Mar 20 14:14:34 EST 2006: /var/mmfs/etc/mmfsup.scr invoked 

bglsn:/mnt/chriss/gpfs/BGL # 

5.3.8 Cross mounting the GPFS file system on to Blue Gene/L cluster 

Most of this section deals with authentication and the swapping that is associated 

with openssl based certificates (keys). 

The following steps are necessary to cross mount the GPFS file system on the 

Blue Gene/L: 

1. Configure GPFS authentication on gpfsNSD and bgIO clusters. 

2. Mount the GPFS file system on the SN. 

3. Add all the I/O nodes to the bgIO cluster. 

4. Boot a block and check for automatic mount of the GPFS file system. 

Configuring authentication on gpfsNSD and bgIO cluster 

Figure 5-8 illustrates our environment before the bgIO and gpfsNSD cluster are 

authenticated with each other. 

Figure 5-8 Configure GPFS authentication on both clusters 


To configure authentication on gpfsNSD and bgIO cluster, follow these steps: 

1. Generate the SSL keys on gpfsNSD and bgIO clusters. 

On both gpfsNSD and bgIO clusters, first ensure that GPFS is stopped and 

openssl packages are installed. 

On one node in gpfsNSD cluster and on the SN run the mmauth genkey 

command, as shown in Example 5-20. This command generates the 

public/private key pair that is saved in the /var/mmfs/ssl directory. 

Example 5-20 Generating GPFS cluster ssl keys 

###### On one node in gpfsNSD cluster (p630n01): 

p630n01][/]> /usr/lpp/mmfs/bin/mmauth genkey 

Verifying GPFS is stopped on all nodes ... 

Generating RSA private key, 512 bit long modulus 

...............++++++++++++ 

...............++++++++++++ 

e is 65537 (0x10001) 

id_rsa1 100% 497 

0.5KB/s 00:00 

id_rsa1 100% 497 

0.5KB/s 00:00 

mmauth: Command successfully completed 

####### and on the Service Node: 

bglsn:/root # /usr/lpp/mmfs/bin/mmauth genkey 


Generating RSA private key, 512 bit long modulus 

...............++++++++++++ 

...............++++++++++++ 

e is 65537 (0x10001) 

id_rsa1 100% 497 

0.5KB/s 00:00 

id_rsa1 100% 497 

0.5KB/s 00:00 

mmauth: Command successfully completed 

2. Set cipherList=AUTHONLY on gpfsNSD. 

On both gpfsNSD and bgIO clusters ensure that GPFS is stopped. On one 

node in each cluster run the mmchconfig cipherList=AUTHONLY command, as 

shown in Example 5-21. This command sets GPFS to authenticate and 

checks authorization for network connections. This is required for cross 

cluster communications. 


Example 5-21 Telling clusters to authenticate cross-cluster connections 

bglsn:/root # /usr/lpp/mmfs/bin/mmchconfig cipherList=AUTHONLY 


mmchconfig: Command successfully completed 

mmchconfig: 6027-1371 Propagating the changes to all affected nodes. 

This is an asynchronous process. 

3. Exchange the SSL public keys between clusters 

Copy the bgIO cluster public key to one of the nodes in the gpfsNSD cluster, 

and the gpfsNSD cluster public key to the SN. 

Then, add the keys to the authorization list on each cluster. Use mmauth add 

command on both clusters, as shown in Example 5-22. 

Example 5-22 Authenticating bgIO and gpfsNSD clusters 

# On one node in gpfsNSD cluster: 

p630n01][/]> scp bglsn_fn:/var/mmfs/ssl/id_rsa.pub ~/id_rsa.pub.bgIO 

p630n01][/]> /usr/lpp/mmfs/bin/mmauth add bgIO -k ~/id_rsa.pub.bgIO 

and, on Service Node: 

bglsn:/ # scp p630n01_fn:/var/mmfs/ssl/id_rsa.pub ~/id_rsa.pub.gpfsNSD 

bglsn:/ # /usr/lpp/mmfs/bin/mmauth add gpfsNSD -k ~/id_rsa.pub.gpfsNSD 

-n p630n01_fn,p630n02_fn 

Note: As of the current GPFS version (V2.3), you need to specify the NSD 

nodes when authorizing the remote cluster, in our case p630n01_fn and 

p630n02_fn. 

4. Allow bgIO access to the GPFS-FS exported by gpfsNSD. 

Start GPFS daemons on gpfsNSD cluster, and then use the mmauth grant 

command to allow the bgIO access to the GPFS file system. In the case 

shown the device name of the GPFS file system that bgIO is granted access 

is gpfs1. The bgIO cluster name used with the mmauth grant command 

MUST be the actual name of the cluster as shown by the mmlscluster 

command on the SN (in our case, bgIO). 

p630n01][/]> /usr/lpp/mmfs/bin/mmstartup -a 

p630n01][/]> /usr/lpp/mmfs/bin/mmauth grant bgIO -f gpfs1 

Note: The cluster name is bgIO.itso.ibm.com but the short name is 

allowed. 


5. Add the remote file system to the bgIO cluster. 

On the bgIO cluster, ensure that GPFS is shut down. Then, run the 

mmremotefs add command. This command tells the bgIO cluster about the 

remote file system it can mount. Note that these commands must be issued 

by the root user. The gpfsNSD cluster name gpfsNSD used with the 

mmremotefs add command after the -C parameter must be the actual name of 

the gpfsNSD cluster as shown by the mmlscluster command run on the 

gpfsNSD cluster. 

The local device name for the remote GPFS-FS is bubu_gpfs, and /bubu is 

the local mount point for the file system. 

bglsn:/root # /usr/lpp/mmfs/bin/mmremotefs add bubu_gpfs1 -f 

gpfs1 -C gpfsNSD -T /bubu 

Mounting the GPFS file system on the SN 

Figure 5-9 illustrates the cross-mounted file system that we provided for our Blue 


Figure 5-9 Cross-mount /gpfs1 from the gpfsNSD cluster on the SN 


To mount the GPFS file system on the SN, follow these steps: 

1. On the SN start GPFS and ensure that the remote file system can be 

mounted, as shown in Example 5-23. 

Example 5-23 Mounting the remote file system on bgIO 

bglsn:/root # /usr/lpp/mmfs/bin/mmstartup -a 


bglsn:/root # mount /bubu 

bglsn:/root # df 


/dev/sdb3 70614928 4632540 65982388 7% / 

tmpfs 1898508 8 1898500 1% /dev/shm 

/dev/sda4 489452 50972 438480 11% /tmp 

/dev/sda1 9766544 1997804 7768740 21% /bgl 

/dev/sda2 9766608 698992 9067616 8% /dbhome 

p630n03:/bglscratch 36700160 5952 36694208 1% /bglscratch 

p630n03_fn:/nfs_mnt 104857600 11300064 93557536 11% /mnt 

/dev/bubu_gpfs1 1138597888 2918400 1135679488 1% /bubu 

2. Enable the remote file system from the gpfsNSD cluster (/gpfs1) to mount 

automatically over the local mount point (/bubu) when GPFS is started on the 

bgIO cluster. Use the mmremotefs update command, as shown in 

Example 5-24. 

Example 5-24 Changing remote file system to automount at bgIO cluster startup 

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmremotefs update bubu_gpfs1 -A yes 

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmshutdown -a 

...... 

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmstartup -a 


bglsn:/mnt/chriss/gpfs/BGL # df 


/dev/sdb3 70614928 4635404 65979524 7% / 

tmpfs 1898508 8 1898500 1% /dev/shm 

/dev/sda4 489452 50972 438480 11% /tmp 

/dev/sda1 9766544 2005192 7761352 21% /bgl 

/dev/sda2 9766608 698992 9067616 8% /dbhome 




bglsn:/mnt/chriss/gpfs/BGL # 


Adding all the I/O nodes to the bgIO cluster 

Figure 5-10 illustrates our complete GPFS on Blue Gene/L configuration. Adding 

all I/O nodes to the bgIO cluster is the last step before you can actually start 

using the GPFS file system for running user jobs. 

Figure 5-10 Adding I/O nodes to the bgIO GPFS cluster 

To add all the I/O nodes to the bgIO cluster, follow these steps: 

1. Create a node definition file (ionodes) that contains a list of the I/O nodes to 

be added (their IP label), as in Example 5-25. 

Note: You can choose not to add all I/O nodes to the bgIO cluster. 

However, this means that certain I/O nodes will not be able to access the 

GPFS file systems. This is acceptable if you use manual block allocation or 

if the job submission system (automated scheduler) can be made aware of 

this configuration. 


Example 5-25 Node definition file for I/O nodes 

bglsn:/mnt/chriss/gpfs/BGL # cat /tmp/ionodes 

ionode1 

ionode2 

ionode3 

ionode4 

ionode5 

ionode6 

ionode7 

ionode8 

2. To add the nodes to the bgIO cluster we must ensure first that we have a 

block booted that contains all the I/O nodes you want to use with GPFS. Then 

add the nodes to the bgIO cluster using mmadnode command, as shown in 

Example 5-26. 

Example 5-26 Adding the I/O nodes to the bgIO cluster 

bglsn:/mnt/chriss/gpfs/BGL # /usr/lpp/mmfs/bin/mmaddnode -n 

/tmp/ionodes 

Mon Mar 20 15:13:34 EST 2006: mmaddnode: Processing node ionode1 








mmaddnode: Command successfully completed 

mmaddnode: Propagating the changes to all affected nodes. 


Attention: If some of the nodes are not available (not booted, bad network 

connection, ssh not configured, and so forth), you need to correct the situation 

and retry the mmadnode command with the respective nodes. 

Booting a block an check automatic mount of GPFS-FS 

You now have to start GPFS on the newly added nodes. The best way to do this 

is to de-allocate the block, and then allocate it again. In this way, you can check 

that the file system (/bubu ) is mounted on the I/O nodes automatically. 

Example 5-27 shows the check performed for a small block that has just two I/O 

nodes using mmcs_db_console. 


Example 5-27 Using mmcs_db_console to check the GPFS file system on I/O nodes 


OK 

mmcs$ select_block R000_J106_32 

OK 

mmcs$ redirect R000_J106_32 on 

OK 

mmcs$ {i} write_con df | grep bubu 

OK 

mmcs$ Apr 04 16:09:20 (I) [1083225312] {17}.0: d 

Apr 04 16:09:20 (I) [1083225312] {17}.0: f | grep bubu 


$ 

Apr 04 16:09:20 (I) [1083225312] {0}.0: d 

Apr 04 16:09:20 (I) [1083225312] {0}.0: f | grep bubu 


$ 

Finally, Figure 5-11 shows the bgIO cluster with the I/O nodes active and having 

the GPFS file system mounted. The block is ready now for running jobs. 

Figure 5-11 GPFS file system cross mounted to bgIO and I/O nodes added 


5.3.9 GPFS problem determination methodology 

The methodology presented in this section is intended to help with a wide variety 

of problems. It is set out in sections that first deal with the SN and then the bgIO 

cluster. It is intended that you undertake the methodology in the order that we 

present it and, if a check passes, then you proceed to the next check until you 

find the problem. If you think you already know in which area the problem lies, 

then by all means go straight to that section. If, however, you are unsure of 

exactly where the problem lies, then use the methodology in the order presented 

because this often uncovers the simplest problems quickly and easily before you 

spend a long time looking for a solution to an assumed problem rather than the 

one you actually have. 

Checking that the GPFS-FS can be mounted on the SN 

Here is the methodology. All the commands in this section should be run on the 

SN as root. 

► Check that the GPFS is started on the SN, as described in “Checking that the 

GPFS is started” on page 255 

► Check the GPFS log files for problems, as described in “Checking the GPFS 

log files for problems” on page 256 

► Check that the GPFS-FS can mount on the SN, as described in “Checking 

that the GPFS-FS can mount on the SN” on page 257 

► Check that the GPFS-FS can mount on the I/O nodes, as described in 

“Checking that the GPFS-FS can mount on the I/O nodes” on page 257 

► Check that the GPFS-FS is configured on the SN, as described in “Checking 

that the file system is configured on the SN” on page 259 

► Check that the GPFS-FS is authorized to mount on the SN, as described in 

“Checking that the SN is authorized to mount the GPFS-FS” on page 261 

Checking that the GPFS-FS can be mounted on the gpfsNSD 

cluster 

Here is the methodology. You should run all of the commands in this section on 

one of the gpfsNSD cluster nodes as root. 

► Check GPFS is started on all nodes of gpfsNSD cluster, as described in 

“Checking that the GPFS is started” on page 255 

► Check GPFS-FS is configured on gfpsNSD cluster, as described in “Checking 

that GPFS-FS is configured on gfpsNSD cluster” on page 261 

► Check GPFS-FS can mount on the gfpsNSD cluster, as described in 

“Checking that GPFS-FS can mount on the gfpsNSD cluster” on page 262 


► Check GPFS-FS disks are available on gpfsNSD, as described in “Checking 

that the GPFS-FS disks are available on gpfsNSD” on page 263 

► Check the bgIO cluster is authorized to mount the GPFS-FS, as described in 

“Checking that bgIO cluster is authorized to mount the GPFS-FS” on 

page 263 

5.3.10 GPFS Checklists 

This section includes checklist for the GPFS. 

Checking that the GPFS is started 

Use the mmgetstate command to check if GPFS is started on either the SN or 

any node in the gpfsNSD cluster. Example 5-28 shows three (all) active nodes in 

our gpfsNSD cluster. 

Example 5-28 Checking GPFS node status 

[p630n01][/]> mmgetstate -a 

Node number Node name GPFS state 

----------------------------------------- 

1 p630n01_fn active 



The same command was run on the SN and shows that GPFS is started on the 

SN (see Example 5-29). 

Example 5-29 The mmgetstate command on the SN 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate -a 


----------------------------------------- 

1 bglsn_fn active 

If this command shows that GPFS is not active on all nodes, you need to start 

GPFS on all nodes using the mmstartup -a command on any node in the cluster. 

If only some of the nodes are inactive, start them individually as follows (in this 

case we started GPFS on the SN only): 

bglsn:/tmp # mmstartup 

Wed Mar 29 17:40:27 EST 2006: mmstartup: Starting GPFS ... 


Checking the GPFS log files for problems 

The GPFS log files for nodes with local storage are kept in /var/adm/ras 

directory. The latest log is named mmfs.log.latest and shows the history of the 

message since the last time that GPFS was started on that node. 

Example 5-30 shows the latest GPFS log file on the SN. This shows mmfs ready, 

which means that GPFS was functioning properly on this node. It also shows the 

file system /bubu_gpfs has been mounted from a remote cluster known as 

gpfsNSD. 

Example 5-30 Latest GPFS log on SN 

bglsn:/var/adm/ras # cat mmfs.log.latest 

Wed Mar 29 17:40:27 EST 2006 runmmfs starting 


Unloading modules from /usr/lpp/mmfs/bin 

Loading modules from /usr/lpp/mmfs/bin 

Module Size Used by 

mmfslinux 268384 1 mmfs 

tracedev 35552 2 mmfs,mmfslinux 

Removing old /var/mmfs/tmp files: 

Wed Mar 29 17:40:30 2006: mmfsd initializing. {Version: 2.3.0.10 

Built: Jan 16 2006 13:07:54} ... 

Wed Mar 29 17:40:30 2006: OpenSSL library loaded 

Wed Mar 29 17:40:30 EST 2006 /var/mmfs/etc/gpfsready invoked 

Wed Mar 29 17:40:30 2006: mmfsd ready 

Wed Mar 29 17:40:30 EST 2006: mmcommon mmfsup invoked 

Wed Mar 29 17:40:30 EST 2006: /var/mmfs/etc/mmfsup.scr invoked 

Wed Mar 29 17:40:30 EST 2006: mounting /dev/bubu_gpfs1 

Wed Mar 29 17:40:31 2006: Waiting to join remote cluster p630n01_fn 

Wed Mar 29 17:40:31 2006: Connecting to 172.30.1.31 p630n01_fn 

Wed Mar 29 17:40:31 2006: Connected to 172.30.1.31 p630n01_fn 

Wed Mar 29 17:40:31 2006: Joined remote cluster gpfsNSD 

Wed Mar 29 17:40:31 2006: Command: mount bubu_gpfs1 

Wed Mar 29 17:40:31 2006: Connecting to 172.30.1.32 p630n02_fn 

Wed Mar 29 17:40:31 2006: Connected to 172.30.1.32 p630n02_fn 

Wed Mar 29 17:40:32 2006: Command: err 0: mount p630n01_fn:gpfs1 

Wed Mar 29 17:40:32 EST 2006: finished mounting /dev/bubu_gpfs1 

If you are experiencing problems with GPFS starting on not mounting the file 

system then this is a good place to look. 

GPFS log files on the I/O nodes are usually found under the following directory 

on the SN (or as specified in $BGL_DISTDIR/etc/rc.d/init.d/gpfs script). 

‘/bgl/gpfsvar//var/adm/ras’ 


Checking that the GPFS-FS can mount on the SN 

If the remote GPFS file system is not mounted, you need to identify it first. To find 

the remote file system to be mounted, use the mmremotefs command, and then 

try to mount it, as shown in Example 5-31. 

Example 5-31 Checking remote file systems 

bglsn:/tmp # mmremotefs show all 

Local Name Remote Name Cluster name Mount Point Mount Options 

Automount 

bubu_gpfs1 gpfs1 p630n01_fn /bubu rw 

yes 

bglsn:/tmp # mount /bubu 

mount: /dev/bubu_gpfs1 already mounted or /bubu busy 

mount: according to mtab, /dev/bubu_gpfs1 is already mounted on /bubu 

In our example the /bubu file system was already mounted. 

Checking that the GPFS-FS can mount on the I/O nodes 

Note: Only attempt this test if the GPFS-FS can be mounted on the SN (see 

previous check). 

First boot a block using the mmcs_db_console and check that GPFS has started 

on the nodes (Example 5-32). 

Example 5-32 Checking GPFS on I/O nodes 






OK 


OK 

mmcs$ quit 

OK 

mmcs_db_console is terminating, please wait... 

mmcs_db_console: closing database connection 

mmcs_db_console: closed database connection 

mmcs_db_console: closing console port 

mmcs_db_console: closed console port 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate -a 



----------------------------------------- 


2 ionode1 active 









Then, connect using ssh to an I/O node and try to mount the GPFS file system on 

the I/O node, as shown in Example 5-33. 

Example 5-33 Checking GPFS-FS can mount on I/O nodes 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ssh root@ionode4 



$ df 


rootfs 7931 1644 5878 22% / 

/dev/root 7931 1644 5878 22% / 

172.30.1.1:/bgl 9766544 2450872 7315672 26% /bgl 

172.30.1.33:/bglscratch 

36700160 71776 36628384 1% /bglscratch 

$ /usr/lpp/mmfs/bin//mmgetstate 


----------------------------------------- 


$ mount /bubu 

$ df 


rootfs 7931 1644 5878 22% / 

/dev/root 7931 1644 5878 22% / 

172.30.1.1:/bgl 9766544 2450872 7315672 26% /bgl 

172.30.1.33:/bglscratch 36700160 71776 36628384 1% /bglscratch 



If you are unable to connect through ssh in to an I/O node or if the mmgetstate -a 

command shows that no I/O nodes are active, you need to investigate your sitefs 

script (see also Appendix B, “The sitefs file” on page 423) and ensure that the 

following two important steps have been executed: 

► Added the following line to the /bgl/dist/etc/rc.d/init.d/sitefs file: 


► Added the following symbolic links to ensure the sitefs called at bootup. 

fbglsn:/bgl/dist/etc/rc.d/rc3.d # ls -als 

total 0 

0 drwxr-xr-x 2 root root 112 Mar 27 15:16 . 

0 drwxr-xr-x 4 root root 96 Mar 27 14:52 .. 

0 lrwxrwxrwx 1 root root 16 Mar 27 14:52 K90sitefs -> 

../init.d/sitefs 

0 lrwxrwxrwx 1 root root 16 Mar 27 14:52 S10sitefs -> 

../init.d/sitefs 

bglsn:/bgl/dist/etc/rc.d/rc3.d # 

Checking that the file system is configured on the SN 

To check that GPFS is configured, we ran both the mmlsconfig and the 

mmlscluster commands. Example 5-34 shows that the following important 

information from the mmlsconfig command: 

► Cluster name - bgIO.itso.ibm.com 

► pagepool - we have set it is 128M (maximum recommended value) 

► cipherList set to AUTHONLY 

Example 5-34 The mmlsconfig output 

bglsn:/tmp # mmlsconfig 

Configuration data for cluster bgIO.itso.ibm.com: 

------------------------------------------------clusterName 

bgIO.itso.ibm.com 

clusterId 12402351528774401789 

clusterType lc 

multinode yes 

autoload yes 

useDiskLease yes 

maxFeatureLevelAllowed 813 

cipherList AUTHONLY 

pagepool 128M 

File systems in cluster bgIO.itso.ibm.com: 

------------------------------------------ 

(none) 


Example 5-35 shows the output of the mmlscluster command and displays the 

following relevant information: 

► Cluster name - on SN this is bgIO.itso.ibm.com 

► Remote shell command - on SN must be /usr/bin/ssh. 

► Remote copy command - on SN must be /usr/bin/scp 

► SN is the only quorum node. 

Example 5-35 The mmlscluster output 

bglsn:/tmp # mmlscluster 


======================== 

GPFS cluster name: bgIO.itso.ibm.com 


GPFS UID domain: bgIO.itso.ibm.com 




----------------------------------- 

Primary server: bglsn_fn.itso.ibm.com 

Secondary server: (none) 

Node number Node name IP address Full node name 

Remarks 

----------------------------------------------------------------------- 

-------- 

1 bglsn_fn 172.30.1.1 bglsn_fn.itso.ibm.com 

quorum node 

2 ionode1 172.30.2.1 ionode1 








Example 5-34 and Example 5-35 also reveal that GPFS is configured correctly. If 

you have problems with GPFS cluster configuration, refer to the installation 

instructions and to GPFS manuals (see 5.3.11, “References” on page 264). 


Checking that the SN is authorized to mount the GPFS-FS 

To check that the SN is authorized to mount the GPFS file system. Use the 

mmauth command, as shown in Example 5-36. 

Example 5-36 Checking authorized access for the remote file system 

bglsn:/tmp # mmremotefs show all 

Local Name Remote Name Cluster name Mount Point Mount Options 

Automount 

bubu_gpfs1 gpfs1 gpfsNSD /bubu rw 

yes 

This output shows that the SN is authorized to mount the /bubu_gpfs file system 

from the cluster named gpfsNSD for both read and write operations. If this is not 

correct, run the mmauth command to fix (see also 5.3.8, “Cross mounting the 

GPFS file system on to Blue Gene/L cluster” on page 246). 

Checking that GPFS-FS is configured on gfpsNSD cluster 

To check that GPFS is configured on a cluster we run both the mmlsconfig 

command and the mmlscluster command. Example 5-37 shows the following 

important information from the mmlsconfig command: 

► Cluster name - gpfsNSD 

► pagepool - if not shown, it means it is set to default (64 MB) 

► cipherList set to AUTHONLY 

► local file system device name is /dev/gpfs1 

Example 5-37 mmlsconfig output on the gpfsNSD cluster 

[p630n01][/]> mmlsconfig 

Configuration data for cluster gpfsNSD: 

-----------------------------------------clusterName 

gpfsNSD 

clusterId 12402351657622744194 

clusterType lc 

multinode yes 

autoload no 

useDiskLease yes 

maxFeatureLevelAllowed 813 

cipherList AUTHONLY 

[p630n01_fn] 

File systems in cluster gpfsNSD: 

----------------------------------- 

/dev/gpfs1 


We also ran the mmlscluster command. Example 5-38 shows the following 

relevant information from the mmlscluster command: 

► Cluster name - gpfsNSD 

► Remote shell command - /usr/bin/ssh 

► Remote copy command - /usr/bin/scp 

► Configured nodes - in our cluster all three nodes participate in quorum 

decisions 

Example 5-38 The mmlscluster output on the gpfsNSD cluster 

[p630n01][/]> mmlscluster 


======================== 

GPFS cluster name: gpfsNSD 


GPFS UID domain: gpfsNSD 




----------------------------------- 

Primary server: p630n01_fn 

Secondary server: p630n02_fn 

Node number Node name IP address Full node name Remarks 

----------------------------------------------------------------------------------- 

1 p630n01_fn 172.30.1.31 p630n01_fn quorum node 



As in the previous Example 5-38, we can see that GPFS is configured correctly. If 

you have problems with GPFS cluster configuration, refer to the installation and 

problem determination instructions found in GPFS manuals (see 5.3.11, 

“References” on page 264). 

Checking that GPFS-FS can mount on the gfpsNSD cluster 

To mount the GPFS file system on the gpfsNSD check the locally configured file 

system device name from “Checking that GPFS-FS is configured on gfpsNSD 

cluster” on page 261, then use the mount command: 

[p630n01][/]> mount /dev/gpfs1 

GPFS: 6027-514 Cannot mount /dev/gpfs1 on /gpfs1: Already mounted. 

As you can see, in this case the file system was already mounted. 


Checking that the GPFS-FS disks are available on gpfsNSD 

Use the mmlsdisk command to check that the disks belonging to the GPFS-FS 

are available and sane. This is only required if the file system cannot be locally 

mounted on the gpfsNSD cluster (Example 5-39). 

Example 5-39 Checking disk availability 

[p630n01][/]> mmlsdisk /dev/gpfs1 

disk driver sector failure holds holds 

name type size group metadata data status 

availability 

------------ -------- ------ ------- -------- ----- ------------- 

------------ 

GPFS1_n01_b nsd 512 1 yes yes ready up 

GPFS2_n01_a nsd 512 1 yes yes ready up 

GPFS3_n02_b nsd 512 2 yes yes ready up 

GPFS4_n02_a nsd 512 2 yes yes ready up 

As you can see, in this case all the disks allocated for the GPFS-FS are ready 

and available. If some of these disks were unavailable (“down”) this would 

prevent the GPFS-FS from mounting. To fix this problem, first check that all disks 

are properly connected to the servers and available to the operating system, then 

use the mmchdisk command to recover the disks to the ready state. 

Checking that bgIO cluster is authorized to mount the 

GPFS-FS 

Use the mmauth show all command to show if the bgIO cluster is authorized to 

mount the GPFS-FS, as shown in Example 5-40. 

Example 5-40 Checking file system is “exported” on storage cluster (gpfsNSD) 

[p630n01][/]> mmauth show all 

Cluster name: bgIO.itso.ibm.com 

Cipher list: AUTHONLY 

SHA digest: 22813e0fa7f4aa76982cd33cf705ee8c085b21a0 

File system access: gpfs1 (rw, root allowed) 

Cluster name: gpfsNSD (this cluster) 

Cipher list: AUTHONLY 

SHA digest: d02d0d706c8f7f14ce6366e1d6fe8a1a217ae1c5 

File system access: (all rw) 

As you can see, in this case the bgIO cluster named bgIO.itso.ibm.com is 

authorized to mount the locally created GPFS-FS device called gpfs1. 


5.3.11 References 

For more information about the GPFS command and concepts, refer to the 

GPFS V2.3 documentation, which also available from the Cluster Information 

Center: 

► Concepts, Planning, and Installation Guide, GA22-7968-02 

http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.clus 

ter.gpfs.doc/gpfs23/bl1ins10/bl1ins10.html 

► Administration and Programming Reference, SA22-7967-02 


ter.gpfs.doc/gpfs23/bl1adm10/bl1adm10.html 

► Problem Determination Guide, GA22-7969-02 


ter.gpfs.doc/gpfs23/bl1pdg10/bl1pdg10.html 

► Documentation updates 


ter.gpfs.doc/gpfs23_doc_updates/docerrata.html 


Chapter 6. Scenarios 

6 

This chapter contains a variety of problem determination scenarios that we 

captured while running on the Blue Gene/L system that we used in the 

development of this redbook. We constructed each scenario based on the 

problem determination methodology that we discuss throughout the book. 

We approach each problem with similar patterns: 

► A description of the problem 

► Detailed checking on how related problems can be revealed 

► How to resolve the problem or to transfer the problem to other scenarios 

► Lessons learned 



In some scenarios, we need to intentionally inject the error that we assume that 

could happen in real life and cause problems. Creating these scenarios helps to 

hone the problem determination procedures. And, if we can create the error, 

there are chances that a similar problem in the field can occur. The following list 

contains the error injection scenarios that are presented in this chapter. 

Due to design and usability considerations, we have divided a Blue Gene/L 

environment into core system and functional components, and the scenarios are 

grouped into categories of problems in: 

► Blue Gene/L core system 

– Hardware (cards, power supplies, cables and so forth) 

– Software (DB2 and processes) 

– System configuration (remote command execution: ssh, rsh, NFS) 

► File system (NFS and GPFS) 

► Job submission (mpirun and LoadLeveler) 

Depending on past experiences and the actual situations, usually there is more 

than one way to approach a problem. Our intention was that our problem 

determination methodology leads to one of these categories. At the beginning of 

a category, scanning through the table of scenarios can help you to spot a similar 

(if not identical) problem. 

To describe the problem at hand, we try not to give away the cause of the 

problem. However, having some hunch on what has just happened or gone 

wrong can serve as a starting point. This starting point is chosen from multiple 

hypotheses listed in the problem description section. 

The starting point leads to a checklist. The checklists that we discuss in this 

chapter are specific to the category. Only this time, with a problem at hand, 

detailed and specific checking are performed in every step of identifying a 

problem, based on the checklist. 

At the end of each scenario, the problem might be resolved or just identified. 

Otherwise, some pointer is provided to aid in transferring the problem 

determination to another scenario. Because all of this process is carried out on a 

new system that has just been set up, we might run into pitfalls and unexpected 

findings. These findings are included in the section on what we have learned. 

When a new problem is discovered, we use the same methodology. A new 

scenario is created and added into a category. Odd problems are gathered under 

the miscellaneous scenarios. 


6.2 Blue Gene/L core system scenarios 

This section looks into problem scenarios that will affect the core Blue Gene/L 

components. The core system consists of: 

► Blue Gene/L racks 

► Functional Network 

► Service Network 

► Service Node 

► Blue Gene/L Database (running on Service Node) 

► Blue Gene/L system processes 

Here is a list of scenarios we tested for the core Blue Gene/L system 

1. Hardware error: Compute card error 

2. Functional network: Defective cable 

3. Service network: Defective cable 

4. Service Node functional network interface down 

5. SN service network interface down 

6. The /bgl file system full on the SN (no GPFS) 

7. The / file system full on the SN 

8. The /tmp file system is full on the SN 

9. The ciodb daemon is not running on the SN 

10.The idoproxy daemon not running on the SN 

11.The mmcs_server is not running on the SN 

12.DB2 not started on the SN 

13.The bglsysdb user OS password changed (Linux) 

14.Uncontrolled rack power off 

In each scenario, we follow the same process. First, we check the system is 

currently operational. We do this by allocating a block in mmcs_db_console, 

submit_job to run the hello.rts application and free_block to de-allocate the 

block. After we have proved our system, we then inject the scenario that we want 

to test. Then, we try our job submission again. This method should cause the 

problems and then we investigate. 

Each of the scenarios is split into the following sections. 

► Error injection 

► Problem determination 


Chapter 6. Scenarios 267

6.2.1 Hardware error: Compute card error 

In this scenario, we replaced a compute card in a node card with a compute card 

with a defect chip. 

Error injection 

Power off the rack, and replace a compute card with the faulty compute card. 

The discovery process successfully detected the compute card and populated 

the DB2 database. 

Problem determination 

Because all resources are available, we try and allocate a block which includes 

the faulty compute card. This fails with the message shown in Example 6-1. 

Example 6-1 A node fails to boot 


FAIL 

Microloader Assertion 

Apr 01 16:55:06 (E) [1088451808] test1:R000_J104_32 RAS event: 

KERNEL FATAL: Microloader Assertion 

Apr 01 16:55:06 (E) [1088451808] test1:R000_J104_32 RAS event: 

KERNEL FATAL: VALIDATE_LOAD_IMAGE_CRC_IN_DRAM 

Because the content of the log file mentions about the RAS event, we then look 

into the RAS event Web page (shown in Figure 6-1). 

Figure 6-1 RAS event indicates a location of the faulty card 

From the RAS event we find the location of the failing compute card - 

R00-M0-N1-CJ16-U01. 


Lessons learned 

The error has occurred while manipulating a block and has been recorded in the 

mmcs server log file. RAS event shows detailed information which was not 

included in the log file. 

6.2.2 Functional network: Defective cable 

In this scenario, we simulate a cable failure on the functional network. We do this 

by removing one external ethernet connection from a Node Card. 


Physically pulled out one of the cables out from the front of R00-M0-N2. 

Problem Determination 

We try and allocate the block. It fails with: 

mmcs$ 

no ethernet link 

Looking in the bglsn-bgdb0-mmcs_db_server-current.log (latest 

mmcs_db_server.log) we see: 

Mar 09 13:57:53 (E) [1086973152] root:R000_128 RAS event: KERNEL 

FATAL: no ethernet link 

We can see a RAS event has been raised, so we look for further evidence in the 

RAS database. It shows the same message with a location code: 

R00-M0-N2-I:J18-U01. We then look at the physical hardware and find the fault. 


Functional network is essential to the operation of Blue Gene/L. The system will 

not operate even with one link disabled 

6.2.3 Service network: Defective cable 

In this scenario, we simulate a cable failure on the service network. We do this by 

removing a connection to one the external ports on the Service Card. 


Physically pulled out the GB cable from the front of Service Card 


We try and allocate the block. It fails with the message shown in Example 6-2. 


Example 6-2 mmcs message: service card link failure 

mmcs$ 

idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT connection lost 

to node/link/service card [R00-M0-N0] 

ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL 

Looking in the latest mmcs_db_server.log we find the message shown in 

Example 6-3. 

Example 6-3 Message from mmcs_db_server log 

Mar 09 14:43:32 (I) [1084843232] root:R000_128 allocate: FAIL;connect: 




We conclude that we cannot talk to the Service Card over the service network 

and fix the problem by replacing the cable. 

We then went on to see what happens if we replace the service network GBit link 

and pull the IDo link on the Service Card to the service network. We found the 

block booted and the application could run normally. 

We then moved onto using Discovery. Nothing happened when we unplugged 

the Service Network port marked ‘IDo’ when discovery was running. We we did 

the same for the GBit port, and hardware was marked M for missing in the 

database. 


We learned the following lessons: 

► The Service network is needed to boot a block. However, booting a block 

does not use the IDo port on the Service Card. 

► A simple ping does not work to the Service card as a method to see whether 

it is alive (as this is does not use IP/ICMP communication). 

► The idoproxy uses the GBit port on the front of the Service Card, as does the 

Discovery process. 

► We observed that the System Controller uses the IDo link to initialize the 

Service Card, after it initializes everything to use the GBit connection. 


6.2.4 Service Node functional network interface down 

In this scenario we simulate the functional network interface being disabled or 

removed from the Service Node (SN). 


We have used the following command on the SN to disable the functional 

network interface: 

bglsn:/bgl/BlueLight/logs/BGL # ifconfig eth3 down 


When we try and boot the block we get the following message: 

mmcs$ 

Error:unable to mount filesystem. 

Looking in the latest mmcs_db_server.log we see: 

Mar 09 17:37:43 (E) [1086973152] root:R000_128 RAS event: KERNEL 

FATAL: Error: unable to mount filesystem 

This message is reported for every I/O node. We know that the I/O nodes use 

NFS from the SN. SN check list get us to look at the network configuration. This 

shows us that eth3 is disabled. 


The NFS file system is mounted to the I/O nodes over the Functional Network 

and is required to run jobs. The functional Network is required to run an 

application. 

6.2.5 SN service network interface down 

In this scenario we simulate the service network interface being disabled or 

removed from the SN. 


We have used the following command on the SN to disable the service network 

interface: 

bglsn:/bgl/BlueLight/logs/BGL # ifconfig eth0 down 



When we booted the block the message shown in Example 6-4 appears in the 

mmcs_db_console. 

Example 6-4 The mmcs message: no network connection to service card 

mmcs$ 




Looking in the latest mmcs_db_server.log, we find the message shown in 

Example 6-5. 

Example 6-5 Failed connection message recorded mmcs_db_server log 





Using the knowledge gained from previous scenarios we deduce this is a Service 

Network fault. Using the methodology we check the network interfaces on the SN 


Example 6-6 Checking interface status 

bglsn:/bgl/BlueLight/logs/BGL # ip ad 









In this case, eth0 is not up on the SN. We bring it back up using the ifup 

command, and check again, as in Example 6-7. 

Example 6-7 Bringing up the Service Network interface (eth0) 

bglsn:/bgl/BlueLight/logs/BGL # ifup eth0 

bglsn:/bgl/BlueLight/logs/BGL # ip ad 












After bringing up the interface we can boot the block. 


The service network is required to boot a block because the idoproxy sends the 

microloader through the Service Network (IDo bridge). 

6.2.6 The /bgl file system full on the SN (no GPFS) 

This is a scenario to see what happens when /bgl is 100% full on the SN. 

Note: In this scenario we do not use GPFS. 


Use dd to create a huge file in /bgl on the SN that fills the file system up to 100%. 

A similar error would appear if you run out of inodes on the same /bgl file system. 


We submitted the job but did not receive the output of that job. We got the 

message shown on Example 6-8. 

Example 6-8 Job error due to /bgl file system full 


OK 

JOBID STATUS USERNAME BLOCKID 

EXECUTABLE 

28 E root R000_128 

/bgl/hello/hello.rts 

Job is in error state. We look at the output for the job: 

could not open /bgl/hello/R000_128-28.stdout: No space left on 

device 


Looking in the system logs we see the ciodb records shown in Example 6-9. 

Example 6-9 Extract from ciodb records revealing file system full 

Mar 13 14:35:35 (I) [1074048864] Starting Job 28 

Mar 13 14:35:35 (I) [1079563488] New thread 1079563488, for jobid 28 

Mar 13 14:35:35 (I) [1079563488] Jobid is 28, homedir is /bgl/hello 

Mar 13 14:35:35 (E) [1079563488] 0x4058bea8 

Mar 13 14:35:35 (E) [1079563488] could not open 

/bgl/hello/R000_128-28.stdout: No space left on device 

Mar 13 14:35:35 (E) [1079563488] Job 28 set to START_ERROR, exit 

status= 255, errtext= could not open /bgl/hello/R000_128-28.stdout: No 

space left on device 

Mar 13 14:35:35 (I) [1079563488] cleanup job polling thread 1079563488 

Correlating the messages, we determine that the /bgl file system is full (this file 

system is used for this job’s output) by using df /bgl on the SN. 



► We need space for jobs to write their output (stdout(1), stderr(2)). 

► ciodb records any problems while writing the output file. In Example 6-9 (I) 

means I/O node output, (E) means error. ciodb talks to the ciod daemons on 

the I/O nodes (which are doing the file I/O operations), and the ciod daemons 

report back the that they are not able to write on the NFS file system (/bgl 

exported from the SN). 

6.2.7 The / file system full on the SN 

This is a scenario that tests what happens when / is 100% full on the SN. 


Use the dd command to create a huge file in / on the SN that fills the file system 

(100% reported by the df / command run on the SN). 


Submitted the job. Job ran OK. There are no error messages related to the job. 




► The root (“/”) file system is not written to when running a job. 

► Having / full does not affect running jobs on Blue Gene/L. However, some OS 

functionality (Linux) and the database (DB2) will eventually have problems if 

they cannot write to ‘/’, and this in turn will have an effect on Blue Gene/L 

system processes. 

6.2.8 The /tmp file system is full on the SN 

This is a scenario to see what happens when /tmp is 100% full on the SN. 


Use dd to create a huge file in /tmp on the SN that fills the file system up to 100%. 


Allocate a block fails with the following message: 


DBBlockController::allocateBlock failed: invalid XML 

Looking in the latest mmcs_db_server.log we see the output shown in 

Example 6-10. 

Example 6-10 Message when allocating a block fails due to /tmp full 

Mar 13 12:12:29 (I) [1084843232] root allocate R000_128 

Mar 13 12:12:29 (I) [1084843232] root 


Mar 13 12:12:32 (I) [1084843232] root:R000_128 allocate: 

FAIL;DBBlockController::allocateBlock failed: invalid XML 

Mar 13 12:12:32 (I) [1084843232] root:R000_128 

DBMidplaneController::removeBlock(R000_128) 

Mar 13 12:12:32 (I) [1084843232] root DBBlockController::disconnect() 

setBlockState(R000_128, FREE) successful 

Using the methodology we are directed to check to see if any file systems are 

full. We find /tmp 100% full. Freeing space up allows the block to allocate and the 

job to run. However, some other processes might be affected also (no pty can be 

created when /tmp is full), thus a reboot in maintenance mode might be required 

for the SN. 



The /tmp directory is used when a block is manipulated. An image of the block in 

XML format is written to /tmp. You must have space in /tmp for Blue Gene/L to 

work. 

6.2.9 The ciodb daemon is not running on the SN 

In this scenario check what happens when the ciodb daemon is not running, and 

if this affects starting a job. 


Even though this scenario might seem unlikely, we considered that this could 

actually happen (ciodb daemon not running) due to a bad library (OS upgrade, 

Blue Gene/L driver update, and so on). 

First we tried to kill the ciodb daemon (kill -9 ciodb_pid), but the bglmaster 

daemon respawns the process. Then we decided to attach a debugger to the 

process and stop its execution, as shown in Example 6-11. 

Example 6-11 Attaching the debugger to the ciodb process 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster start 

Starting BGLMaster 

Logging to 

/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-12:29:03.log 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster status 






bglsn:/bgl/BlueLight/logs/BGL # gdb --pid=30626 


We then submit a job and check it for a while, and we see that it just sits there in 

starting (S) state (see Example 6-12), JOBID #38. 

Example 6-12 Checking the job status 


OK 



OK 


JOBID STATUS USERNAME BLOCKID 

EXECUTABLE 

38 S root R000_128 

/bgl/hello/hello.rts 

We further investigate the RAS events, but we cannot find anything relevant. 

We looked at the Runtime information in the Web browser and found that the 

status of the job #38 listed as “Ready to start” and the block was in the initialized 

(I) state. 

We further checked the Configuration information in the Web browser and found 

no disabled hardware (everything OK). 

Looking in the system logs for ciodb, we see the messages shown in 

Example 6-13. 

Example 6-13 The ciodb log messages 

03/14/06-12:29:03 ./startciodb STARTED 

03/14/06-12:29:03 RUN CIODB ./ciodb --useDatabase BGL --dbproperties 

db.properties --nortschecking 

03/14/06-12:29:03 logging to 

/bgl/BlueLight/logs/BGL/bglsn-ciodb-2006-0314-12:29:03.log 

Mar 14 12:29:03 (I) [1074052960] ciodb[30626]: started: $Name: 

V1R2M1_020_2006 $ 

Mar 14 12:29:03 (E) [1074052960] No running job records in the database 

We look further for job with JOBID #38, but there is no record for this JOBID. 

Instead we see the messages for the job with the JOBID #37, which looks OK, a 


Example 6-14 Records from ciodb showing correct execution of Job #37 

Mar 14 12:27:02 (I) [1074052960] Starting Job 37 

Mar 14 12:27:02 (I) [1079567584] New thread 1079567584, for jobid 37 

Mar 14 12:27:02 (I) [1079567584] Jobid is 37, homedir is /bgl/hello/ 

contacting control node 0 at 172.30.2.1:7000...ok 







contacting control node 7 at 172.30.2.4:7000...Mar 14 12:27:04 (I) 

[1079567584] Job loaded: 37 


Mar 14 12:27:04 (I) [1079567584] About to launch /bgl/hello/hello.rts 

Mar 14 12:27:04 (I) [1079567584] Job 37 set to RUNNING 

Mar 14 12:27:04 (E) [1079567584] Job 37 set to TERMINATED, exit status= 

0, errtext= 

Mar 14 12:27:04 (I) [1079567584] cleanup job polling thread 1079567584 

We can conclude that ciodb does not seem to be handing our job submission. At 

this point we suspect ciodb. We try and stop and restart the bglmaster (see 

Example 6-15). 

Example 6-15 Stopping and checking the bglmaster 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster stop 

Stopping BGLMaster 








Timed out on socket connection to BGLMaster daemon at 127.0.0.1:32035 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # netstat -a | grep 32035 

tcp 0 0 localhost:32035 *:* 

LISTEN 

tcp 1 0 localhost:32035 localhost:42999 

CLOSE_WAIT 

tcp 0 0 localhost:42999 localhost:32035 

FIN_WAIT2 

At first, we can see that the bglmaster does not stop. We can see the that the 

bglmaster status command returns a socket timeout message for port 

localhost:32035. Further we check with netstat -a| grep tcp and see 

FIN_WAIT2 status. 


The connection should be closed on this port for the bglmaster to be able to 

finish the stopping request. We use the ps -ef command to see the status of the 

system processes, bglmaster, ciodb, idoproxy and mmcs_db_server, as shown 

in Example 6-16. 

Example 6-16 Checking the bglmaster process 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i bglmaster|grep -v grep 

root 30619 1 0 12:29 ? 00:00:00 ./BGLMaster --consoleip 127.0.0.1 

--consoleport 32035 --configfile bglmaster.init --autorestart y --db2profile 

/dbhome/bgdb2cli/sqllib/ db2profile --dbproperties db.properties 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i idoproxy|grep -v grep 

root 21551 26793 0 14:59 pts/5 00:00:00 grep -i idoproxy 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ciodb |grep -v grep 

root 30626 1 0 12:29 ? 00:00:00 [ciodb] 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i mmcs_server |grep -v grep 

root 21601 26793 0 15:00 pts/5 00:00:00 grep -i mmcs_server 

We can see that idoproxy and mmcs_db_server are not running. However, 

because bglmaster would not stop and we also see a ciodb process, 

we first kill the bglmaster, then the ciodb process: 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # kill -9 30619; kill -9 

30626 

Now that we have removed the bad processes, we can try and restart them, as 


Example 6-17 Restarting the bglmaster after cleanup 



Logging to 








When the bglmaster started clean, our job (#38) that was stuck in (S)tarting 

status ran correctly. 




► The ciodb daemon controls the submission of the job to the Blue Gene/L. If it 

is not running jobs do not start. 

► If your jobs are not running, it is a good idea to check your system’s Blue 

Gene/L daemons using the ps -ef command, and look for any 

Blue Gene/L system processes. 

6.2.10 The idoproxy daemon not running on the SN 

In this scenario we stop idoproxy daemon to check the impact on starting and 

running a job. 


Again, we use the same debugger technique as we did for ciodb: attach the gdb 

debugger to the idoproxy process, as in Example 6-18. 

Example 6-18 Attaching gdb to the idoproxy process 







bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ido 

root 12455 14324 0 10:42 ? 00:00:00 ./idoproxydb 

-enableflush -loguserinfo db.properties BlueGene1 



When we try to allocate the block we get the following error in the 

mmcs_db_console: 

connect: idoproxy communication failure: socket recv timeout 

The following message was also found in the mmcs server log: 

Mar 14 10:48:33 (I) [1084843232] root:R000_128 allocate: 

FAIL;connect: idoproxy communication failure: socket recv timeout 


No other message was found in any of the remaining system logs. Moreover, 

bglmaster shows everything is running, as shown in Example 6-19. 

Example 6-19 Bglmaster status with idoproxy stopped 







Because we suspect an issue with idoproxy (which is controlled by the 

bglmaster), we then try to restart the bglmaster: 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster restart 


.... 

The command hangs when trying to start the bglmaster, waiting for the socket on 

port 32035 (owned by idoproxy) to close the connection (so it can finish the 

actual stopping process). However, this does not happen, thus we have to open 

another terminal to be able to kill the “hanging” process, then restart the 

bglmaster, as in Example 6-20. 

Example 6-20 Recovering from a “hanging” idoproxy daemon 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i bglmaster| grep -v grep 

root 14324 1 0 Mar09 ? 00:00:00 ./BGLMaster --consoleip 127.0.0.1 

--consoleport 32035 --configfile bglmaster.init --autorestart y --db2profile 

/dbhome/bgdb2cli/sqllib/ db2profile --dbproperties db.properties 

## >> 



## > 


Timed out on socket connection to BGLMaster daemon at 127.0.0.1:32035 

## > 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i idoproxy|grep -v grep 

root 12455 1 0 12:29 ? 00:00:00 [idoproxy] 


glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # kill -9 12455 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i ciodb |grep -v grep 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ps -ef | grep -i mmcs_server |grep -v grep 

root 14395 14395 0 15:00 pts/5 00:00:00 grep -i mmcs_server 

## > 



## > 



Logging to /bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0406-17:07:53.log 







Finally, we can now start mmcs_db_console and run a job. 



► mmcs_db_console requires a connection to idoproxy to work. 

► If you are having problems with Blue Gene/L system processes (controlled by 

bglmaster), ensure that all old instances of the system processes are 

properly cleaned up before restarting bglmaster. 

6.2.11 The mmcs_server is not running on the SN 

In this scenario, we stop mmcs_db_server to check the impact on starting or 

running a job. 


We use the same technique as we used for ciodb and idoproxy (attach the 

debugger, then stop the process), as shown in Example 6-21. 


Example 6-21 Attaching the debugger to the mmcs_server process 









We try to connect to the mmcs_server using the mmcs_db_console: 



The mmcs_db_console hangs and never returns a prompt (which means we 

cannot submit a job), thus we connect to the SN using another terminal, and start 

the problem determination. 

First, we checked the RAS events log, but we could not find anything relevant. 

Next, we looked at the “Runtime” and “Configuration” information in the Blue 

Gene/L Web browser, but we could not find anything related to this issue. 

Next, we check the status of the Blue Gene/L system processes: 


Timed out on socket connection to BGLMaster daemon at 

127.0.0.1:32035 

We follow the same procedure as in 6.2.9, “The ciodb daemon is not running on 

the SN” on page 276 (ciodb) and 6.2.10, “The idoproxy daemon not running on 

the SN” on page 280 (idoproxy). We have to kill all the bglmaster spawned 

processes, then we restart the bglmaster. 

We can now start mmcs_db_console. We are able to submit and run a job. 



► The mmcs_server daemon is essential to the running of Blue Gene/L. 

► mmcs_db_console is an interface to this process. 

► If you are having problems with system processes, ensure that all old system 

processes are properly cleaned up before restarting bglmaster. 


6.2.12 DB2 not started on the SN 

In this scenario we reboot the SN on a system where the DB2 Blue Gene/L 

instance is not set to start automatically. This could happen if during DB2 

installation on the SN, automatic restart of the DB2 instances was not chosen. 


We turned off automatic DB2 start at system startup using the command shown 


Example 6-22 Turning off automatic DB2 start 

bglsn:~ # su -l bglsysdb 


/opt/IBM/db2/V8.1/instance/db2iauto -off bglsysdb 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # db2set -i bglsysdb 

DB2COMM=tcpip 

Note: If the instance was set to autostart, we would see: 


/opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb 


DB2COMM=tcpip 


We then rebooted the SN. 


After the SN rebooted, we try to start the Blue Gene/L system processes, as 


Example 6-23 Starting bglmaster daemon when DB2 is stopped 



Logging to /bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-18:35:29.log 

./BGLMaster: error while loading shared libraries: libdb2.so.1: cannot open shared 

object file: No such file or directory 

bglmaster start command failed: ./BGLMaster --consoleip 127.0.0.1 --consoleport 32035 

--configfile bglmaster.init --autorestart y --db2profile ~bgdb2cli/sqllib/db2profile 

--dbproperties db.properties 2>&1 

>/bgl/BlueLight/logs/BGL/bglsn-bglmaster-2006-0314-18:35:29.log 

## > 


glsn:/bgl/BlueLight/ppcfloor/bglsys/bin # . /discovery/db.src 

SQL30081N A communication error has been detected. Communication protocol 

being used: "TCP/IP". Communication API being used: "SOCKETS". Location 

where the error was detected: "127.0.0.1". Communication function detecting 

the error: "connect". Protocol specific error code(s): "111", "*", "*". 

SQLSTATE=08001 

The db.src file contains a DB2 connect statement (db2 connect to bgdb0 user 

bglsysdb using bglsysdb) which gives the error message seen in Example 6-23 

(SQL30081N), thus we can conclude that DB2 was not started. 

Next, we tried to see what the RAS drill down shows, and we found the message 

shown in Example 6-24 at the top of the Web page. 

Example 6-24 Message returned to the Web page when DB2 not running 

Warning: odbc_connect(): SQL error: [IBM][CLI Driver] SQL30081N A 

communication error has been detected. Communication protocol being 

used: "TCP/IP". Communication API being used: "SOCKETS". Location where 

the error was detected: "127.0.0.1". Communication function detecting 

the error: "connect". Protocol specific error code(s): "111", "*", "*". 

SQLSTATE=08001 , SQL state 08001 in SQLConnect in 

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/web/ras.php on line 110 

Again, the same SQL message (SQL30081N) indicates that the Web interface 

cannot communicate with DB2 either. 

We then executed the command shown in Example 6-25 to confirm that DB2 is 

not running. 

Example 6-25 Checking the DB2 processes 

bglsn:~ # ps -ef | grep db2 

root 7711 1 0 18:24 ? 00:00:00 /opt/IBM/db2/V8.1/bin/db2fmcd 

root 8593 1 0 18:30 pts/0 00:00:00 /dbhome/bgdb2cli/sqllib/bin/db2bp 

8468A0 5 A 

bglsysdb 10119 1 0 18:32 pts/0 00:00:00 /dbhome/bglsysdb/sqllib/bin/db2bp 

9735A1000 5 A 

root 10725 1 0 18:35 pts/4 00:00:00 /dbhome/bgdb2cli/sqllib/bin/db2bp 

10277A0 5 A 

root 12387 12339 0 18:46 pts/7 00:00:00 grep db2 


Now we start DB2 on the SN; after DB2 starts, we also start the bglsmaster 

daemon, as shown in Example 6-26. 

Example 6-26 Starting DB2 and bglmaster 

bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> db2start 

03/14/2006 19:36:50 0 0 SQL1063N DB2START processing was 

successful. 

SQL1063N DB2START processing was successful. 

bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> # ./bglmaster start 

bglsysdb@bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin> # ./bglmaster status 






We submit a job which starts and runs to completion. Finally, we make sure DB2 

is going to start automatically the next time system is started, as shown in 

Example 6-27. 

Example 6-27 Turning on automatic DB2 start 

bglsn:~ # su -l bglsysdb 


/opt/IBM/db2/V8.1/instance/db2iauto -on bglsysdb 


DB2COMM=tcpip 




► DB2 is the core element of Blue Gene/L. Nothing works if it is not started. 

► DB2 should be configured to automatically start on reboot. 


6.2.13 The bglsysdb user OS password changed (Linux) 

Because DB2 user (bglsysdb) password authentication is set to “unix” during the 

installation of the SN, we decided to try and change the UNIX password and test. 


We changed the OS password for the bglsysdb user: 

bglsn:/bgl/ # passwd bglsysdb 


We then try to allocate a block so we can run our job. 

mmcs_console error : 


FAIL 

lost connection to mmcs_server 

use mmcs_server_connect to reconnect 

When we check the system logs we find the following messages in the mmcs and 

ciodb logs (see Example 6-28). 

Example 6-28 Error message in system logs when bglsysdb password changed 

--CLI ERROR-------------cliRC 

= -1 

line = 167 

file = DBConnection.cc 

SQLSTATE = 08001 

Native Error Code = -30082 

[IBM][CLI Driver] SQL30082N Attempt to establish connection failed 

with security reason "24" ("USERNAME AND/OR PASSWORD INVALID"). 

SQLSTATE=08001 

------------------------- 

Values used for connection: database is {bgdb0}, user is {bglsysdb}, 

schema is {bglsysdb} 

Unable to connect, aborting... 

We can see for this we cannot connect to the system as bglsysdb, with the 

message "USERNAME AND/OR PASSWORD INVALID". However, as superuser 

(root) we can switch to bglsysdb (su - bglsysdb) and check if the user context is 

fine, so it must be the password we need to change, and update the 

db.properties file with the new password. 


After changing the db.properties with the new password we need to restart the 

bglmaster, as shown in Example 6-29. 

Example 6-29 Changing the password in DB2.properties file 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # vi db.properties 

... >; save and exit... 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster restart 



Logging to 








We can now use the mmcs_db_console to allocate a block, as shown in 

Example 6-30. 

Example 6-30 Allocating a block 






OK 

Finally, we run a job. 


DB2 user password control is tied to unix authentication. The password that the 

Blue Gene/L system processes use to talk to DB2 is contained in the 

db.properties file. If you change the DB2 user password you MUST update this 

file. 


6.2.14 Uncontrolled rack power off 

This time we decided to power off the rack without doing any of the 

PrepareForService preparation before hand. 


We switched the rack circuit breaker off. We wait for 20 seconds and power back 

on. This could happen in real life when there are power line fluctuations. 


We try to run a job, but block allocation fails twice in a row—first with 

communication failure and next time with initchip error invalid JtagID, as 


Example 6-31 Trying to allocate a block after a rack power glitch 


FAIL 

connect: idoproxy communication failure: BGLERR_IDO_PKT_TIMEOUT 

connection lost to node/link/service card [R00-M0-N0] 



FAIL 

connect: initchip error invalid JtagID 

As we can see, idoproxy cannot talk to the Service Card. 

We next check the system logs. mmcs_server logs and see the messages shown 


Example 6-32 Message in mmcs_server log 

Mar 15 17:04:29 (I) [1086321888] root allocate R000_128 

Mar 15 17:04:29 (I) [1086321888] root 


Mar 15 17:04:29 (I) [1086321888] root:R000_128 

BlockController::connect() 

Mar 15 17:04:33 (I) [1086321888] root:R000_128 

BlockController::disconnect() releasing node and ido connections 






Mar 15 17:04:33 (I) [1086321888] root:R000_128 


The idoproxy logs shows the messages in Example 6-33. 

Example 6-33 The idoproxy messages 

Mar 15 17:04:29 (I) [1102423264] 

root:ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL OPEN(s) 

Mar 15 17:04:33 (E) [1084568800] Send Timeout... IPAddr=10.0.0.18 

IPMask=255.255.0.0 LicensePlate=ff:f2:9f:16:f0:95:00:0d:60:e9:0f:6a 

Mar 15 17:04:33 (E) [1084568800] packet failure -1... IPAddr=10.0.0.18 

IPMask=255.255.0.0 LicensePlate=ff:f2:9f:16:f0:95:00:0d:60:e9:0f:6a 

Mar 15 17:04:33 (I) [1102423264] 

root:ido://FF:F2:9F:16:F0:95:00:0D:60:E9:0F:6A/CTRL CLOSE 

► The RAS errors show this for all Node Cards. 

► System processes seem running fine, apart from the logged messages 

(Example 6-32 and Example 6-33). 

► Checking DB2 reveals it is up and running. 

Because we have communication errors, we then check out the Service Network 

and the Functional Network (eth0 and eth3 in our case), using the ifconfig and 

ethtool commands, as shown in Example 6-34. 

Example 6-34 Checking network interfaces 

bglsn:/ # ifconfig 

eth0 Link encap:Ethernet HWaddr 00:0D:60:4D:28:EA 

inet addr:10.0.0.1 Bcast:10.0.255.255 Mask:255.255.0.0 

inet6 addr: fe80::20d:60ff:fe4d:28ea/64 Scope:Link 

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 

RX packets:35228867 errors:0 dropped:0 overruns:0 frame:0 

TX packets:35220594 errors:0 dropped:0 overruns:0 carrier:0 

collisions:0 txqueuelen:1000 

RX bytes:4595599148 (4382.7 Mb) TX bytes:6839888628 (6523.0 

Mb) 

Base address:0xe800 Memory:f8120000-f8140000 

..... >.... 

eth3 Link encap:Ethernet HWaddr 00:11:25:08:30:90 

inet addr:172.30.1.1 Bcast:172.30.255.255 Mask:255.255.0.0 

inet6 addr: fe80::211:25ff:fe08:3090/64 Scope:Link 

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 

RX packets:14054383 errors:0 dropped:0 overruns:0 frame:0 

TX packets:24871645 errors:0 dropped:0 overruns:0 carrier:0 


collisions:0 txqueuelen:1000 

RX bytes:7954413066 (7585.9 Mb) TX bytes:20968793643 

(19997.3 Mb) 

Base address:0xec00 Memory:c0080000-c00a0000 

..... >.... 

bglsn:/ # ethtool eth0 












Duplex: Full 


PHYAD: 0 




Wake-on: g 



bglsn:/ # ethtool eth3 












Duplex: Full 


PHYAD: 0 




Wake-on: g 




► From Example 6-34 we can also see that the IP configuration is correct on the 

network interfaces. 

► Next, we use the ping command over Functional and Service Networks, and 

check the lights of the RJ45 jacks functional and service interface. These 

verifications do no reveal any problem. 

► Following, we check the lights of the Service and Node Cards. Node Cards 

Light are out and the Service Cards lights are cycling. This means the cards 

are uninitialized. We need to run the discovery process to re-discover the 

system. 

► For this, we to do a PrepareForService operation to get the system into a 

good state, as shown in Example 6-35: 

Example 6-35 Running PrepareForService on the system 

bglsn:/discovery # ./PrepareForService R00 

Logging to 

/bgl/BlueLight/logs/BGL/PrepareForService-2006-03-16-10:47:01.log 

Mar 16 10:47:02.912 EST: PrepareForService started 

Mar 16 10:47:03.125 EST: Preparing 1 Midplanes in rack R00 

Mar 16 10:47:03.702 EST: @ killMidplaneBlocks - kill_midplane_jobs R000 

failed (FAIL;command?) 

Mar 16 10:47:06.213 EST: Freed any blocks using R000 

Mar 16 10:47:06.222 EST: @ killMidplaneBlocks - Retried 1 time(s) 

before we were able to 'free any blocks using this midplane' - 

Midplane(R000)! 

Mar 16 10:47:06.222 EST: 

Mar 16 10:47:11.403 EST: @ buildServiceCardObj - Exception occurred 

while building an iDo for ServiceCard(mLctn(R00-M0-S), 

mCardSernum(2033394a3033373900000000594c31304b3530363130304a), 

mLp(FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), mIp(10.0.0.12), mType(2)) 

Mar 16 10:47:11.403 EST: @ buildServiceCardObj - Exception was 

(java.io.IOException: Could not contact iDo with 

LP=FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2 and IP=/10.0.0.12 because 

java.lang.RuntimeException: Communication error: (DirectIDo for 

Uninitialized DirectIDo for 

FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2@/10.0.0.12:0 is in state = 

COMMUNICATION_ERROR, sequenceNumberIsOk = false, ExpectedSequenceNumber 

= 0, Reply Sequence Number = -1, timedOut = true, retries = 5, timeout 

= 1000, Expected Op Command = 5, Actual Op Reply = -1, Expected Sync 

Command = 10, Actual Sync Reply = -1)) 

..... > ..... 


Mar 16 10:48:11.470 EST: @ buildServiceCardObj - Exception occurred 

while building an iDo for ServiceCard(mLctn(R00-M0-S), 

mCardSernum(2033394a3033373900000000594c31304b3530363130304a), 

mLp(FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2), mIp(10.0.0.12), mType(2)) 

Mar 16 10:48:11.471 EST: @ buildServiceCardObj - Exception was 

(java.io.IOException: Could not contact iDo with 

LP=FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2 and IP=/10.0.0.12 because 

java.lang.RuntimeException: Communication error: (DirectIDo for 

Uninitialized DirectIDo for 

FF:F2:9F:15:1E:4D:00:0D:60:EA:E1:B2@/10.0.0.12:0 is in state = 

COMMUNICATION_ERROR, sequenceNumberIsOk = false, ExpectedSequenceNumber 

= 0, Reply Sequence Number = -1, timedOut = true, retries = 5, timeout 

= 1000, Expected Op Command = 5, Actual Op Reply = -1, Expected Sync 

Command = 10, Actual Sync Reply = -1)) 

..... > ..... 

► From the error messages (bold in Example 6-35) we can see the system 

cannot talk to the IDo chips, concluding that the Service Card is NOT 

initialized. 

► We now run discovery on the system: 

cd /discovery 




► The lights of the Service Card come back to normal state, but the Node Cards 

do not. 

► We try to run PrepareForService again, but it fails because the previously 

failed attempt. We need to close the previous service action manually: 

db2 "update bglserviceaction set status = 'C' where id = 2" 

► We then ran PrepareForService again on the rack. 

Note: Unfortunately this failed, as our test system does not have the hardware 

expected in a typical Blue Gene/L rack (1/2 rack at least). 

To overcome this, we added the -FORCE option and the ServiceAction 

completed as expected. 


As EndServiceAction did not complete successfully, we decided to manually 

mark all the IDo chips as missing (M) in the database, so the discovery process 

would pick them up: 

db2 "update bglidochip set status = 'M' where ipaddress like 

'10.0%'" 

Discovery found all existing hardware. After it found all the hardware, we stopped 

the discovery process and restarted the Blue Gene/L system processes. We 

could then allocate a block submit and a job. 



► If a rack is power cycled without preparation, the rack it goes into a 

un-initialized state. Without the discovery processes running (specially 

SystemController), the Service Card does not get initialized and the system 

cannot talk to the IDo chips through the switch on the Service Card. You 

should always use PrepareForService and EndServiceAction and do a 

controlled power down. 

► If you have an unplanned rack power outage, you should leave the rack off, 

then start the discovery process. This marks all hardware as missing, and 

when this is complete, power up and let discovery find and initialize the 

hardware. When complete, stop discovery and start the system processes to 

bring back the system in production. 

6.3 File system scenarios 

This section addresses problem determination for issues related to both NFS 

and GPFS. 

Here is the list of file system scenarios that we ran. In each of these scenarios, 

we injected a problem manually. We show the steps taken to determine the 

problem and the resolution. 

1. Port mapper daemon not running on the SN. 

2. NFS daemon not running on the SN. 

3. GPFS pagepool (wrongly) set to 512 MB on bgIO cluster nodes. 

4. Secure shell (ssh) is broken (interactive authentication required). 

5. The /bgl file system becomes full. 

6. Install of new Blue Gene/L driver code. 

7. Duplicate IP in /etc/hosts. 


8. Missing IO node in /etc/hosts. 

9. Duplicate entries (additional aliases) for the SN in /etc/hosts. 

In each of the scenario, s we first run the system to prove that the system is 

working. To do this we use LoadLeveler to run the IOR application that writes 

data to the GPFS file system from all nodes. After this has run successfully, we 

then inject the problem and rerun the same job. Each of the scenarios is split into 

the following sections. 




6.3.1 Port mapper daemon not running on the SN 

In this scenario we are not using GPFS or LoadLeveler but just the core system 

loading the application from the /bgl file system. In this scenario, we cause the 

problem by killing the portmap daemon that is required by the NFS server. 

Error Injection 

The error injected here was to kill the portmap process running on the SN. 


Here we ran the ““Hello world!” program from a mmcs_db_console session. Here is 

the error that we received: 

allocate R000_128 == failed with : 

mmcs_console : Error: unable to mount filesystem 

Example 6-36 presents mmcs_db_server.log when portmapd is not running. 

Example 6-36 Messages in mmcs_db_server.log (portmapd not running) 

Mar 13 16:02:02 (I) [1084843232] root:R000_128 DBBlockController::waitBoot(R000_128) 

Mar 13 16:02:13 (E) [1086973152] root:R000_128 RAS event: KERNEL FATAL: Error: 

unable to mount filesystem 







Mar 13 16:02:17 (I) [1084843232] root:R000_128 allocate: FAIL;Error: unable to mount 

filesystem 


Mar 13 16:02:17 (I) [1084843232] root:R000_128 


Mar 13 16:02:17 (I) [1084843232] root BlockController::quiesceMailbox() waiting for 

ras events and I/O node shutdown 

Mar 13 16:02:17 (I) [1097856224] mmcs DatabaseCommandThread started: block 

R000_128, user root, action 3 

Mar 13 16:02:17 (I) [1097856224] mmcs setusername root 

Mar 13 16:02:17 (I) [1097856224] root db_free R000_128 

Mar 13 16:02:17 (I) [1097856224] root DBMidplaneController::addBlock(R000_128) 

Mar 13 16:02:17 (I) [1097856224] root:R000_128 DBBlockController::freeBlock() 

setBlockState(R000_128, TERMINATING) successful 

As we can see the file system cannot be mounted, according to the problem 

determination methodology we go straight to the NFS checks: 

► Check the export list on the SN: 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # showmount -e 

mount clntudp_create: RPC: Port mapper failure - RPC: Unable to 

receive 

The RPC: Port mapper failure - RPC: Unable to receive error indicates a 

port mapper daemon issue. To fix this problem, we ran the following command: 

/etc/init.d/portmap restart && /etc/init.d/nfsserver restart 



► If the port mapper daemon dies on the SN we get the following error in either 

the mmcs_db_console or the mmcs_db_server error log. 

Error: unable to mount filesystem 

► To diagnose the problem, we can run the showmount -e command on the SN. 

6.3.2 NFS daemon not running on the SN 

In this scenario we use only the core system (SN, racks, networks) loading the 

application from the /bgl file system exported by the SN. 

Error Injection 

The error injected here was to kill the nfsd process running on the SN. 



Here we ran the “Hello world!” program from a mmcs_db_console session. Here is 

the error that we received: 

allocate R000_128 == failed with : 

mmcs_console : Error: unable to mount filesystem 

Now looking at the mmcs_db_server log we see the error: 

KERNEL FATAL: Error: unable to mount filesystem 

Using the problem determination methodology we go straight to the NFS checks. 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # showmount -e 

mount clntudp_create: RPC: Program not registered 

This error indicates a problem with the NFS server.To fix this we did the 

following. 




► If the nfsd daemon dies on the SN, we get the following error in either the 

mmcs_db_console or the mmcs_db_server error log when allocating a block: 

Error: unable to mount file system 

► To diagnose the problem we can run the ‘showmount -e’ command on the SN 

6.3.3 GPFS pagepool (wrongly) set to 512MB on bgIO cluster nodes 

This scenario we change the GPFS pagepool for the I/O nodes on the bgIO 

cluster. This is pinned kernel memory used for file and metadata caching by the 

GPFS daemon. Due to limited memory (RAM) on the I/O nodes, a large 

pagepool will prevent the applications from running (or even prevent GPFS 

daemon from starting on the I/O nodes). 


Example 6-37 shows the error we injected (no blocks were allocated at this time). 

Example 6-37 Changing the GPFS pagepool 

bglsn:/bgl/BlueLight/logs/BGL # mmchconfig pagepool=512M 


mmchconfig: Propagating the changes to all affected nodes. 




In this scenario we use IBM LoadLeveler to submit a job (for details see 4.4, “IBM 

LoadLeveler” on page 167). We leave LoadLeveler to automatically allocate a 

block. We observe that the job did not produce any I/O files after five minutes 

(usually, this job runs in under one minute). As a starting point we check the 

mmcs_server log: 

bglsn:/bgl/BlueLight/logs/BGL # view 

bglsn-mmcs_db_server-2006-0316-15:42:04.log 

Initially we did not find any relevant message, so we decided to log on the an I/O 

node and check from there (see Example 6-38). 

Example 6-38 Checking GPFS on one I/O node 

bglsn:/bgl/BlueLight/logs/BGL # ssh root@ionode5 



$ df 


rootfs 7931 1644 5878 22% / 

/dev/root 7931 1644 5878 22% / 

172.30.1.1:/bgl 9766544 2449008 7317536 26% /bgl 

172.30.1.33:/bglscratch 

36700160 64448 36635712 1% /bglscratch 

$ /usr/lpp/mmfs/bin/mmgetstate 


----------------------------------------- 

6 ionode5 down 

Note: We could also use mmgetstate -a command on the SN. This would 

return the status of all nodes in the bgIO GPFS cluster. However, we have 

chosen to go directly to one of the nodes allocated for the LoadLeveler job. 

We then look into GPFS log (mmfs.log.latest) on the respective I/O node, as 


Example 6-39 GPFS log on the I/O node 

$ cat /var/mmfs/gen/mmfslog 

/bin/cat: /proc/kallsyms: No such file or directory 

Tue Mar 28 15:56:47 2006: mmfsd initializing. {Version: 2.3.0.10 

Built: Jan 16 2006 13:08:25} ... 


Tue Mar 28 15:56:48 2006: Not enough memory to allocate internal data 

structure. 

Tue Mar 28 15:56:48 2006: The mmfs daemon is shutting down abnormally. 

Tue Mar 28 15:56:48 2006: mmfsd is shutting down. 

Tue Mar 28 15:56:48 2006: Reason for shutdown: LOGSHUTDOWN called 

Tue Mar 28 15:56:49 EST 2006 runmmfs starting 


Tue Mar 28 15:56:49 EST 2006 runmmfs: respawn 9 waiting 336 seconds 

before restarting mmfsd 

From the GPFS log we can see that the mmfs daemon is respawning and we also 

see the following message, which indicates where the problem might be: 

Not enough memory to allocate internal data structure. 

Moreover, 10 minutes after the job submission, we also get in the application 

(job) log file the messages shown in Example 6-40. 

Example 6-40 Application messages 

test1@bglfen1:/bglscratch/test1> view ior-gpfs.out 

IOR-2.8.3: MPI Coordinated Test of Parallel I/O 

Run began: Tue Mar 28 15:51:26 2006 

Command line used: /bglscratch/test1/applications/IOR/IOR.rts -f 


Machine: Blue Gene L ionode3 

Summary: 

api = MPIIO (version=2, subversion=0) 

test filename = /bubu/Examples/IOR/IOR-output-MPIIO-1 

access = single-shared-file 

clients = 128 (16 per node) 

repetitions = 4 

xfersize = 32768 bytes 

blocksize = 1 MiB 

aggregate filesize = 128 MiB 

delaying 1 seconds . . . 

** error ** 

** error ** 

** error ** 

** error ** 

** error ** 

** error ** 

** error ** 


** error ** 

** error ** 

** error ** 

** error ** 

** error ** 

** error ** 

** error ** 

ERROR in aiori-MPIIO.c (line 130): cannot open file. 

** error ** 

** error ** 







** error ** 



MPI File does not exist, error stack: 




ADIOI_BGL_OPEN(54): File /bubu/Examples/IOR/IOR-output-MPIIO-1 does not 

exist 


exist 


exist 

** exiting ** 

** exiting ** 

** exiting ** 

** exiting ** 

From the application output file, we can see that the application is unable to find 

the output file in GPFS which is also a good indication that there is a problem 

with GPFS. 

Note: The GPFS file system (/bubu) is available on the SN, thus the 

application output file can be found inside the file system. Only the I/O nodes 

cannot write (nor read) to (from) the GPFS file system. 


We can also check one of the I/O node logs in the /bgl/BlueLight/logs/BGL 

directory on the SN, as in Example 6-41. 

Example 6-41 I/O node log messages 

Mar 28 15:41:15 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Starting 

GPFS 

Mar 28 15:41:15 (I) [1088451808] root:RMP28Mr154042300 {0}.0: 

Disabling protocol version 1. Could not load host key 



R00-M0-N0-I:J18-U01 /var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up 

on I/O nod 


/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O node ionode1 : 

172.30.2.1 


[ciod:initialized] 

Mar 28 15:51:20 (I) [1088451808] root:RMP28Mr154042300 {0}.0: e 

ionode1 : 172.30.2.1 

Starting syslog services 

Starting ciod 

Starting XL Compiler Environment for I/O node 

ciod: version "Jan 10 2006 16:25:12" 

ciod: running in virtual node mode with 32 processors 




/bin/sh: can't access tty; job control turned off 

$ 


Switching to coprocessor mode 


Mar 28 15:52:03 (I) [1088451808] root:RMP28Mr154042300 {0}.0: Mar 28 

15:52:02 

Mar 28 15:52:03 (I) [1088451808] root:RMP28Mr154042300 {0}.0: ionode1 

mmfs: mmfsd: Error=MMFS_ABNORMAL_SHUTDOWN, ID=0x1FB9260D, Tag=0 

Mar 28 15:52:02 ionode1 mmfs: Shutting down abnormally due to error in 

/project/sprelcs3b/build/rcs3bs010a/src/avs/fs/mmfs/ts/bufmgr/linux/buf 

mgr-plat.C line 170 retCode -1, reasonCode 0 


15:56:47 








16:02:30 






bglsn:/bgl/BlueLight/logs/BGL # 

From Example 6-41 we can see that GPFS was started at 15:41:15 and it is not 

until 15:51:19 that we get the following message from S40gpfs: 

: GPFS did not come up on I/O node ionode1 : 

Looking in the file /bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d/gpfs, we can see why 

from the code shown in Example 6-42. 

Example 6-42 Excerpt from GPFS startup script for I/O nodes 

# This file will be created by mmfsup.scr to signal that GPFS startup 

is 

# complete 

upfile=/tmp/mmfsup.done 

# Create mmfsup script that will run when GPFS is ready 

cat

then ras_advisory "$0: GPFS did not come up on I/O node 

$HOSTID" 

exit 1 

fi 

done 

rm -f $upfile 

echo "$0: GPFS is ready on I/O node $HOSTID_LOC" 

;; 

stop) 

# Set defaults for GPFS configuration variables 

GPFS_STARTUP=0 

# Obtain overrides from config file 

GPFSFILE=/etc/sysconfig/gpfs 

[ -r $GPFSFILE ] && . $GPFSFILE 

The loop (in bold) explains the timeout: 300 x 2 seconds = 600 seconds. Thus, 

the 10 minutes wait if GPFS fails to come up. 


Always make sure you follow the Blue Gene/L documentation when changing ant 

GPFS parameters. Because I/O nodes have a very particular configuration, you 

need to be extra careful with GPFS. 

6.3.4 Secure shell (ssh) is broken 

For this scenario we remove the following files so that is should be impossible for 

the root user to communicate between GPFS nodes in the bgIO cluster without 

interactive authentication (keys acceptance or password prompting): 

► /bgl/dist/root/.ssh/known_hosts 

► /bgl/dist/root/.ssh/authorized_keys 

► /root/.ssh/known_hosts 

► /root/.ssh/authorized_keys 


Example 6-43 shows the way that we injected the error. 

Example 6-43 Removing ssh authentication files 

bglsn:/bgl/BlueLight/ppcfloor/dist/etc/rc.d/init.d # cd /root/.ssh 

bglsn:~/.ssh # ls -lrt 

total 38 


-rw-r--r-- 1 root root 220 Mar 19 17:12 id_rsa.pub 

-rw------- 1 root root 887 Mar 19 17:12 id_rsa 

-rw-r--r-- 1 root root 2976 Mar 19 19:12 known_hosts.b4gpfs 

-rw-r--r-- 1 root root 440 Mar 28 13:40 authorized_keys 

-rw-r--r-- 1 root root 4140 Mar 28 17:31 known_hosts 

drwx------ 2 root root 280 Mar 28 17:31 . 

drwxr-xr-x 27 root root 1768 Mar 30 14:54 .. 

bglsn:~/.ssh # mv known_hosts known_hosts.orig 

bglsn:~/.ssh # mv authorized_keys authorized_keys.orig 

bglsn:~/.ssh # cd /bgl/dist/root/.ssh/ 

bglsn:/bgl/dist/root/.ssh # ls -lrt 

total 32 

drwx------ 3 root root 72 Mar 17 15:57 .. 

-rw-r--r-- 1 root root 220 Mar 19 17:05 id_rsa.pub 

-rw------- 1 root root 887 Mar 19 17:05 id_rsa 

-rw-r--r-- 1 root root 4091 Mar 19 19:14 known_hosts_gpfs 

-rw-r--r-- 1 root root 440 Mar 28 13:55 authorized_keys 

-rw-r--r-- 1 root root 940 Mar 28 17:33 known_hosts 

drwx------ 2 root root 272 Mar 28 17:33 . 

bglsn:/bgl/dist/root/.ssh # mv known_hosts known_hosts.orig 

bglsn:/bgl/dist/root/.ssh # mv authorized_keys authorized_keys.orig 

bglsn:/bgl/dist/root/.ssh # 


Check the status of LoadLeveler as shown in Example 6-44. 

Example 6-44 Checking the LoadLeveler status before submitting the job 

test1@bglfen1:/bglscratch/test1> llq 

llq: There is currently no job status to report. 

test1@bglfen1:/bglscratch/test1> llstatus 











Next, submit the LoadLeveler job named ior-gpfs.cmd (see Example 6-45). 

Example 6-45 Submitting the LoadLeveler job 

test1@bglfen1:/bglscratch/test1> set -o vi 

test1@bglfen1:/bglscratch/test1> ls 

applications hello-file-2.rts hello.rts ior-gpfs.out hello128.cmd 

ello-file.rts ior-gpfs.cmd ior-gpfs.out.ciod-hung-scenario hello.cmd 

hello-gpfs.err ior-gpfs.err hello-file-1.rts hello-gpfs.out 

ior-gpfs.err.ciod-hung-scenario 

test1@bglfen1:/bglscratch/test1> llsubmit ior-gpfs.cmd 




Running On 

------------------------ ---------- ----------- -- --- ------------ 

----------bglfen1.37.0 

test1 3/28 10:50 I 50 small 

1 job step(s) in queue, 1 waiting, 0 pending, 0 running, 0 held, 0 

Example 6-46 shows that the LoadLeveler the job is actually running. 

Example 6-46 LoadLeveler queue shows job is “Running” 



Running On 

------------------------ ---------- ----------- -- --- ------------ 

---------bglfen1.37.0 

test1 3/28 10:50 R 50 small 

bglfen1 

Now, check at the IOR job output file (see Example 6-47). 

Example 6-47 IOR job output file 

test1@bglfen1:/bglscratch/test1> tail -f ior-gpfs.out 


Run began: Tue Mar 28 12:11:48 2006 





Summary: 










access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter 

------ --------- ---------- --------- -------- -------- -------- ---write 

7.27 1024.00 32.00 0.104517 17.00 1.59 0 


read 75.77 1024.00 32.00 0.004636 1.68 0.002146 0 


write 7.15 1024.00 32.00 0.027025 17.35 1.51 1 


read 76.39 1024.00 32.00 0.004542 1.67 0.002301 1 


bglsn:/bubu/Examples/IOR # ls -lrt 

total 245888 

-rw-r--r-- 1 test1 itso 33554432 Mar 21 17:14 IOR-output 

drwxrwxrwx 3 root root 32768 Mar 22 10:00 .. 

-rw-r--r-- 1 test1 itso 134217728 Mar 23 10:37 IOR-output-MPIIO 

drwxr-xr-x 2 test1 itso 32768 Mar 28 12:14 . 

-rw-r--r-- 1 test1 itso 133922816 Mar 28 12:14 IOR-output-MPIIO-1 



test1@bglfen1:/bglscratch/test1> 

Apparently, the job ran correctly. Assuming that we are not application 

specialists, we need to consult the application owner to verify that the output is 

what was expected. 


We get the confirmation that the application’s output fine, however we decide to 

perform additional checks. First do some basic checks on the SN, as shown in 

Example 6-48. 

Example 6-48 Additional node checks 

##### GPFS checks: 

bglsn:/bubu/Examples/IOR # df 


/dev/sdb3 70614928 4743632 65871296 7% / 

tmpfs 1898508 8 1898500 1% /dev/shm 

/dev/sda4 489452 95272 394180 20% /tmp 

/dev/sda1 9766544 2448544 7318000 26% /bgl 

/dev/sda2 9766608 699636 9066972 8% /dbhome 




p630n03:/bglhome 2621440 98464 2522976 4% /bglhome 

bglsn:/bubu/Examples/IOR # touch /bubu/foo 

bglsn:/bubu/Examples/IOR # mmgetstate 


----------------------------------------- 


bglsn:/bubu/Examples/IOR # mmgetstate -a 

The authenticity of host 'ionode6 (172.30.2.6)' can't be established. 

RSA key fingerprint is 0a:0b:9b:76:dc:f1:a0:55:69:9f:74:31:e1:e9:54:0b. 

Are you sure you want to continue connecting (yes/no)? The authenticity 

of host 'ionode7 (172.30.2.7)' can't be established. 






















of host 'bglsn_fn.itso.ibm.com (172.30.1.1)' can't be established. 

RSA key fingerprint is 52:81:f5:75:be:d6:8e:cf:65:a4:b5:23:46:1a:c6:94. 

Are you sure you want to continue connecting (yes/no)? no 

^C 

mmgetstate: Interrupt received.============================== 

Because we should not be prompted to accept the host identity, we realize that 

there is a problem. We doublecheck using the mmdsh and mmchconfig commands, 


Example 6-49 Checking GPFS remote command execution 

bglsn:/bubu/Examples/IOR # export WCOLL=/tmp/ionodes 

bglsn:/bubu/Examples/IOR # mmdsh date 

mmdsh: ionode1 rsh process had return code 1. 





ionode1: ionode1: Connection refused 








bglsn:/bubu/Examples/IOR # mmchconfig pagepool=128M -i 


The authenticity of host 'bglsn_fn.itso.ibm.com (172.30.1.1)' can't be 

established. 

RSA key fingerprint is 52:81:f5:75:be:d6:8e:cf:65:a4:b5:23:46:1a:c6:94. 


























Are you sure you want to continue connecting (yes/no)? no 

no 

no 

no 

Please type 'yes' or 'no': no 



no 

no 

Please type 'yes' or 'no': 'NO' 


^C 

mmchconfig: Interrupt received: changes not propagated. 

From Example 6-49, we can conclude that we have GPFS authentication 

problems. However, this does not affect the job, because authentication is only 

required when GPFS executes commands to start/stop daemons and modify 

GPFS cluster configuration. 


Now that we know that the GPFS configuration commands have a problem, we 

check that the block is booted, and try to connect interactively into an I/O node 

using ssh, as in Example 6-50. 

Example 6-50 Allocating a block w/ GPFS and ssh authentication broken 






OK 

RMP28Mr121102270 root(0) connected 

mmcs$ redirect RMP28Mr121102270 on 

OK 


FAIL 

block not selected 

mmcs$ select_block RMP28Mr121102270 

OK 

$ 


The authenticity of host 'ionode5 (172.30.2.5)' can't be established. 


Are you sure you want to continue connecting (yes/no)? yes 

Warning: Permanently added 'ionode5,172.30.2.5' (RSA) to the list of 

known hosts. 

root@ionode5's password: 

Permission denied, please try again. 


Permission denied, please try again. 


Permission denied (publickey,password,keyboard-interactive). 


Because the root user on the I/O nodes does not have a password, ssh does not 

allow us to login, so we need to check the ssh authentication. 

Before we leave this scenario, let us check whether GPFS can be stopped and 

started with the ssh broken, as shown in Example 6-51. 

Example 6-51 Checking GPFS can be stopped/started 







OK 

RMP28Mr121102270 root(0) connected 

mmcs$ free RMP28Mr121102270 

OK 

mmcs$ 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmshutdown 

Tue Mar 28 13:33:58 EST 2006: mmshutdown: Starting force unmount of 

GPFS file Tue Mar 28 13:34:13 EST 2006: mmshutdown: Shutting down GPFS 

daemons 

Tue Mar 28 13:34:43 EST 2006: mmshutdown: Finished 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate 


----------------------------------------- 

1 bglsn_fn down 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmstartup 

Tue Mar 28 13:35:46 EST 2006: mmstartup: Starting GPFS ... 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # df 


/dev/sdb3 70614928 4732208 65882720 7% / 

tmpfs 1898508 8 1898500 1% /dev/shm 

/dev/sda4 489452 95276 394176 20% /tmp 

/dev/sda1 9766544 2448564 7317980 26% /bgl 

/dev/sda2 9766608 699636 9066972 8% /dbhome 






bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # mmgetstate 


----------------------------------------- 


Even though it might look strange, GPFS started correctly and mounted the 

remote file system. This is according to GPFS behavior (see the following note). 



From this scenario, we learn that the ssh authentication within the bgIO cluster is 

only required for changes to GPFS configuration and does not affect data traffic 

or the stopping and starting of GPFS on a local node. 

Note: Because GPFS is started on each node individually when the node is 

booted, the fact that ssh authentication is broken does not affect the GPFS 

cluster as long as the GPFS configuration file for each node (mmsdrfs) does 

not change. 

6.3.5 The /bgl file system full (Blue Gene/L uses GPFS) 

This scenario analyzes what happens when /bgl file system fills up to 100%. We 

ran this scenario again to see how it affects a complex Blue Gene/L system. 

Note: In this scenario, our Blue Gene/L system USES GPFS. 


We used the dd command to create a file called /bgl/largefile that filled the /bgl 

file system (100%). 


We started by running the LoadLeveler job (ior-gpfs.cmd). We see that while 

LoadLeveler shows the job as running we see that the block is still in the process 

of booting. Looking at the mmcs_db_server log reveals the GPFS message shown 

in bold in Example 6-52. 

Example 6-52 GPFS cannot be started 

Mar 27 17:34:33 (I) [1096078560] {0}.0: Mon Mar 27 17:34:32 EST 2006 

mmautoload: GPFS is waiting for cluster data reposi 

Mar 27 17:34:33 (I) [1096078560] {0}.0: tory 

Mar 27 17:37:39 (I) [1096078560] {17}.0: tory 

mmautoload: The GPFS environment cannot be initialized. 

mmautoload: Correct the problem and use mmstartup to start GPFS. 


Mar 27 17:37:39 (I) [1096078560] {0}.0: Mon Mar 27 17:37:38 EST 2006 

mmautoload: GPFS is waiting for cluster data reposi 

Mar 27 17:37:39 (I) [1096078560] {0}.0: tory 



mmautoload: Correct the problem and use mmstartup to start GPFS. 


We now check the block status as shown by the mmcs_console (see 


Example 6-53 Block status 

mmcs_console: 

list bglblock 


_blockid = RMP27Mr173143250 

_numpsets = 0 

_numbps = 0 

_owner = root 


_sizex = 0 

_sizey = 0 

_sizez = 0 

_description = LoadLeveler Partition 

_mode = C 

_options = 

_status = B 


_mloaderimg = 


_blrtsimg = 


_linuximg = 


_ramdiskimg = 




_createdate = 2006-03-27 17:31:43.673653 


We can see that the partition is still in B (booting) state. Next, we check the 

LoadLeveler cluster and queue (job) status as shown in Example 6-54. 

Example 6-54 LoadLeveler cluster and job status 

test1@bglfen1:/bglscratch/test1/applications/IOR> llstatus 


bglfen1.itso.ibm.com Avail 1 1 Run 1 0.00 30 PPC64 Linux2 








test1@bglfen1:/bglscratch/test1/applications/IOR> llq 


------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.35.0 

test1 3/27 16:15 R 50 small bglfen1 


Next, we test the file system status on the SN (see Example 6-55). 

Example 6-55 SN file system status 

test1@bglfen1:/bglscratch/test1/applications/IOR> df 


/dev/sdc3 70614928 15459236 55155692 22% / 

tmpfs 3955528 8 3955520 1% /dev/shm 

p630n03:/nfs_mnt 104857600 11339904 93517696 11% /nfs_mnt 


bglsn_fn:/bgl 9766560 9766560 0 100% /bgl 



As we can see in Example 6-54, the job with the ID bglfen1.35.0 appears to be 

in Running state, so we look for more details (Example 6-56). 

Example 6-56 Details on LoadLeveler job bglfen1.35.0 

test1@bglfen1:/bglscratch/test1/applications/IOR> llq -s bglfen1.35.0 




Step Name: 0 


Owner: test1 

Queue Date: Mon 27 Mar 2006 04:15:20 PM EST 

Status: Running 




Submitting Cluster: 

Sending Cluster: 

Requested Cluster: 

Schedd History: 

Outbound Schedds: 

Submitting User: 

Execution Factor: 1 

Dispatch Time: Mon 27 Mar 2006 05:31:44 PM EST 

Completion Date: 

Completion Code: 

Favored Job: No 

User Priority: 50 

user_sysprio: 0 

class_sysprio: 0 

group_sysprio: 0 

System Priority: -275512 

q_sysprio: -275512 

Previous q_sysprio: 0 

Notifications: Complete 

Virtual Image Size: 472 kb 

Large Page: N 

Checkpointable: no 

Ckpt Start Time: 

Good Ckpt Time/Date: 

Ckpt Elapse Time: 0 seconds 

Fail Ckpt Time/Date: 

Ckpt Accum Time: 0 seconds 

Checkpoint File: 


Ckpt Execute Dir: 

Restart From Ckpt: no 

Restart Same Nodes: no 

Restart: yes 

Preemptable: no 

Preempt Wait Count: 0 

Hold Job Until: 

RSet: RSET_NONE 

Mcm Affinity Options: 

Cmd: /usr/bin/mpirun 

Args: -exe /bglscratch/test1/applications/IOR/IOR.rts 

-args "-f /bglscratch/test1/applications/IOR/ior-inputs" 

Env: 

In: /dev/null 

Out: /bglscratch/test1/ior-gpfs.out 

Err: /bglscratch/test1/ior-gpfs.err 

Initial Working Dir: /bglscratch/test1/applications/IOR 

Dependency: 

Resources: 

Requirements: (Arch == "PPC64") && (OpSys == "Linux2") 

Preferences: 

Step Type: Blue Gene 

Size Requested: 128 

Size Allocated: 128 

Shape Requested: 

Shape Allocated: 

Wiring Requested: MESH 

Wiring Allocated: MESH 

Rotate: True 

Blue Gene Status: 

Blue Gene Job Id: 

Partition Requested: 

Partition Allocated: RMP27Mr173143250 

Error Text: 

Node Usage: shared 

Submitting Host: bglfen1.itso.ibm.com 

Schedd Host: bglfen1.itso.ibm.com 

Job Queue Key: 

Notify User: test1@bglfen1.itso.ibm.com 

Shell: /bin/bash 

LoadLeveler Group: No_Group 

Class: small 

Ckpt Hard Limit: undefined 

Ckpt Soft Limit: undefined 

Cpu Hard Limit: undefined 


Cpu Soft Limit: undefined 

Data Hard Limit: undefined 

Data Soft Limit: undefined 

Core Hard Limit: undefined 

Core Soft Limit: undefined 

File Hard Limit: undefined 

File Soft Limit: undefined 

Stack Hard Limit: undefined 

Stack Soft Limit: undefined 

Rss Hard Limit: undefined 

Rss Soft Limit: undefined 

Step Cpu Hard Limit: undefined 

Step Cpu Soft Limit: undefined 

Wall Clk Hard Limit: 00:30:00 (1800 seconds) 

Wall Clk Soft Limit: 00:30:00 (1800 seconds) 

Comment: 

Account: 

Unix Group: itso 

NQS Submit Queue: 

NQS Query Queues: 

Negotiator Messages: 

Bulk Transfer: No 

Step Adapter Memory: 0 bytes 

Adapter Requirement: 

Step Cpus: 0 

Step Virtual Memory: 0.000 mb 

Step Real Memory: 0.000 mb 

==================== EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.35.0 

==================== 

The status of job step is : Running 

Since job step status is not Idle, Not Queued, or Deferred, no attempt 

has been made to determine why this job step has not been started. 

Important: Because job step status is not Idle, Queued, or Deferred, no 

attempt has been made to determine why this job step has not been started. 

However, as the job stays in Running state for quite some time, we decide to 

investigate further. We know from the LoadLeveler job command file that a 

GPFS file system is used for storing the files related to running this job. Thus, we 

decide to investigate GPFS on the I/O nodes. 


We realize that we cannot find either I/O node boot messages, nor GPFS log 

messages (which are written on the /bgl file system NFS exported from the SN), 

and we realize that /bgl file system this is full (100%); we identify and clear the 

/bgl/largefile and the block continues to boot. After the block booted, we logged 

in to one of the I/O nodes (ssh) and got the messages shown in Example 6-57. 

Example 6-57 Checking the GPFS file system 

$ df /bubu 

Mar 27 17:47:39 (I) [1096078560] {119}.0: [ciod:initialized] 


Mar 27 17:47:39 (I) [1096078560] {17}.0: R00-M0-N0-I:J18-U11 

/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O nod 

Mar 27 17:47:39 (I) [1096078560] {0}.0: R00-M0-N0-I:J18-U01 

/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O nod 


Mar 27 17:47:39 (I) [1096078560] {0}.0: e ionode1 : 172.30.2.1 


Starting ciod 


ciod: version "Jan 10 2006 16:25:12" 





$ df /bubu 

df: `/bubu': No such file or directory 

$ mount 




172.30.1.1:/bgl on /bgl type nfs (rw,v3,rsize=8192,wsize=8192, 


Mar 27 17:47:39 (I) [1096078560] {68}.0: rootfs on / type rootfs (rw) 



Mar 27 17:47:39 (I) [1096078560] {51}.0: df: 



Mar 27 17:47:39 (I) [1096078560] {0}.0: /var/etc/rc.d/rc3.d/S40gpfs: 

GPFS did not come up on I/O node ionode1 : 172.30.2.1 

Mar 27 17:47:39 (I) [1096078560] {0}.0: 

hard,udp,nolock,addr=172.30.1.1) 



$ mount 




172.30.1.1:/bgl on /bgl type nfs 

(rw,v3,rsize=8192,wsize=8192,hard,udp,nolock,addr=172.30.1.1) 



$ 

Mar 27 17:47:39 (I) [1096078560] {85}.0: df: 

Mar 27 17:47:39 (I) [1096078560] {85}.0: `/bubu': No such file or 

directory 

Because the GPFS file system (/bubu) cannot be found, we checked the 10 

minute GPFS start timeout and see that it has expired. So, we have to reboot the 

block for GPFS to be able to start and allow the job to run. 



► If the /bgl file system becomes full then GPFS has a problem stated and we 

see the following message in the mmcs_db_console log. 

The GPFS environment cannot be initialized. 

► After the 10 minute timeout period has expired and after fixing the problem 

(making some space available in /bgl), we need to reboot the block. 


Note: Even though we can manually correct the problem and bring up GPFS, 

it is better to reboot the block to allow GPFS to cleanly start and mount the file 

system. 

6.3.6 Installing new Blue Gene/L driver code (driver update) 

This scenario documents the process of upgrading the Blue Gene/L code. It is 

designed to show what happens if the GPFS code is not also updated at the 

same time. 


The error injected is updating of the Blue Gene/L code itself. This section is 

rather long because we present all the checks that we went through before 

updating the code and the actions that are required after the update RPMs are 

installed. 

First we check the code levels: 

1. Check the correct code levels and the process that we intend to use. 

Example 6-58 shows the Blue Gene/L driver levels. 

Example 6-58 Current Blue Gene/L code levels 

bglsn:~ # rpm -qa | grep bgl 

libglade-0.17-230.1 

bglmpi-2006.1.2-1 

bglbaremetal-2006.1.2-1 

bglos-4.1-0 

bglcmcs-2006.1.2-1 

libglade2-2.0.1-501.3 

bglblrtstool-2006.1.2-1 

bglcnk-2006.1.2-1 

bgldiag-2006.1.2-1 

bglmcp-2006.1.2-1 

Example 6-59 shows the code levels for the GPFS installed for the SN. 

Example 6-59 GPFS for Blue Gene/L 

bglsn:~ # rpm -qa | grep gpfs 

gpfs.docs-2.3.0-10 

gpfs.gpl-2.3.0-10 

gpfs.msg.en_US-2.3.0-10 

gpfs.base-2.3.0-10 


Example 6-60 lists the installed code levels for the GPFS for the I/O node or 

nodes. 

Note: You do not need to compile the GPL portability layer for the MCP. The 

gpfs.gplbin package contains the compiled modules for the portability code. 

Example 6-60 GPFS for PPC440 (I/O nodes processor) levels 

bglsn:/bgl/BlueLight # rpm --root 

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/bin/bglOS \ -qa | grep gpfs 


gpfs.gpl-2.3.0-10 



gpfs.gplbin-2.3.0-11 

2. Check for the currently installed version of the Blue Gene/L driver code, as 


Example 6-61 Current Blue Gene/L driver level 





mmcs$ version 

OK 

mmcs_db_server $Name: V1R2M1_020_2006 $ Jan 10 2006 16:23:15 

mmcs$ 

3. Ensure that we have all the required update packages downloaded onto the 

SN. Again, we need two sets of packages for GPFS, one for SN (PPC64-bit), 

one for I/O nodes’ MCP (PPC440 - 32-bit), as shown in Example 6-62. 

Example 6-62 GPFS for PPC440 (MCP - I/O nodes OS) 

bglsn:/mnt/LPP/GPFS # cd PPC64_PTF11 

bglsn:/mnt/LPP/GPFS/PPC64_PTF11 # ls -lrt 

total 4882 


gpfs.base-2.3.0-11.ppc64.rpm 








bglsn:/mnt/LPP/GPFS/PPC64_PTF11 # cd ../MCP_PTF11 

bglsn:/mnt/LPP/GPFS/MCP_PTF11 # ls -lrt 

total 5196 

-rw-r--r-- 1 root root 4161526 Mar 31 19:21 gpfs.base-2.3.0-11.ppc.rpm 






gpfs.gplbin-2.3.0-11.ppc.rpm 



We now review the install process from the readme file that we downloaded from 

the Blue Gene/L Web site along with the new driver. Here we present a list of the 

actions that are relevant for our configuration: 

1. Download RPMs into appropriate directories. 

Note: You can use a naming convention of your choice for the download 

directories. You just need to make sure you install the right code into the 

right directories. 

2. Install RPMs and rebuild BLRTS tool chain. 

3. If you have any customized scripts that mount your file systems stored in 

/bgl/BlueLight/ppcfloor/dist/etc/rc.d/rc3.d directory, you have to re-create 

them with each driver upgrade. 

4. Stop the control system jobs running on the SN. 

5. Update the following symbolic link: 

rm /bgl/BlueLight/ppcfloor 

ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc 

/bgl/BlueLight/ppcfloor 

6. Determine the home directory of user bgdb2cli in the file 

/bgl/BlueLight/ppcfloor/bglsys/bin/db2profile: 

echo ~bgdb2cli 

Change "INSTHOME=/u/bgdb2cli" to "INSTHOME=X", where X = the result 

from the echo ~bgdb2cli command. 


In the file /bgl/BlueLight/ppcfloor/bglsys/bin/db2cshrc change "setenv 

INSTHOME /u/bgdb2cli" to "setenv INSTHOME X", where X = the result from the 

echo ~bgdb2cli command. 

Check the new settings: 

bglsn:~ # echo ~bgdb2cli 

/dbhome/bgdb2cli 

bglsn:~ # grep INSTHOME= 

/bgl/BlueLight/ppcfloor/bglsys/bin/db2profile 

INSTHOME=/dbhome/bgdb2cli 

7. Rebind the new jar file. 

8. Update discovery directory. 

Everyone should run these two commands: 

cp /bgl/BlueLight//ppc/bglsys/discovery/runPopIpPool 

/discovery 

cp 

/bgl/BlueLight//ppc/bglsys/discovery/ServiceNetwork.c 

fg /discovery 

9. Update the database schema. If you are upgrading from V1R1M1, run: 

/bgl/BlueLight/ppcfloor/bglsys/schema/updateSchema.pl 

--dbproperties XX --driver V1R2M1 

where XX = your db.properties file (for example, 

/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties). 

10.The Blue Gene/L upgrade is now complete. When ready, start the control 

system to resume any jobs that you stopped in Step 4. 

Having checked these steps, verify that the updated system can run a job using 

LoadLeveler: 


The ouptut file shows job ran OK. 


Step 1 

Now we install the new Blue Gene/L RPMs as shown in Example 6-63. 

Example 6-63 Installing new Blue Gene/L driver RPMs 

bglsn:/mnt/BGL/BGL_V1R2M2 # rpm -ivh bgl*.rpm 

Preparing... ########################################### [100%] 

1:bglos ########################################### [ 11%] 

2:bglblrtstool ########################################### [ 22%] 

===================================================================================== 

= 

=== RPM has installed but automatic building of the blrts toolchain not successful 

=== 

=== Unable to attempt blrts toolchain build, /bgl/downloads not found 

=== 

=== Follow the manual instructions for building the toolchain 

=== 

===================================================================================== 

= 

error: %post(bglblrtstool-2006.1.2-2) scriptlet failed, exit status 1 

3:bgliontool ########################################### [ 33%] 

=================================================================================== 

=== RPM has installed but automatic building of the IO toolchain not successful === 

=== Unable to attempt blrts toolchain build, /bgl/downloads not found === 

=== Follow the manual instructions to build the IO node toolchain === 

=================================================================================== 

error: %post(bgliontool-2006.1.2-2) scriptlet failed, exit status 1 

4:bglmpi ########################################### [ 44%] 

5:bglbaremetal ########################################### [ 56%] 

6:bglcmcs ########################################### [ 67%] 

7:bglcnk ########################################### [ 78%] 

8:bgldiag ########################################### [ 89%] 

9:bglmcp ########################################### [100%] 

bglsn:/mnt/BGL/BGL_V1R2M2 # 

The output from the rpm install command shows that we now need to rebuild the 

BLRTS toolchain. 

Step 2 

To rebuild the BLRTS toolchain, we first have to edit the script as the original 

curl -O commands in the retrieveToolChains.sh script located in the 

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/toolchain/ directory. The actual 

script would not work on our system due to our firewall configuration which only 

allows Web traffic (http protocol). 


We got around this by copying the file to the /tmp/downloads directory and 

substituting the wget command for the curl -O commands. Example 6-64 shows 

the modified script. 

Example 6-64 Modified retrieveToolChains.sh script 

bglsn:/bgl/downloads # cat retrieveToolChains.sh 

#! /bin/bash 

################################################################### 

# Product(s): */ 

# 5733-BG1 */ 

# */ 

# (C) Copyright IBM Corp. 2004, 2004 */ 

# All rights reserved. */ 

# US Government Users Restricted Rights - */ 

# Use, duplication or disclosure restricted */ 

# by GSA ADP Schedule Contract with IBM Corp. */ 

# */ 

# Licensed Materials-Property of IBM */ 

# */ 

################################################################## 

# This script is to help facilitate the retrieval of the GNU 

# components necessary to build toolchains 

# 

# The script utilizes curl to go to the ftp.gnu.org site and 

# ftp the appropriate tarballs to your system. These 

# files will be installed in the CWD. 

# 

# To utilize the script, you will need to insure curl is 

# installed on your sytem and your system has ftp access to 

# outside sites. 

# 

# Once you run this script to download the appropriate 

# packages, you can install the BlueGene/L patches and 

# commence on building the Toolchain. 

############################################################## 

# Grab all of the Toolchain components necessary to build both 

# the blrts and the linux toolchains for BlueGene/L 

wget ftp://ftp.gnu.org/gnu/binutils/binutils-2.13.tar.gz 

wget ftp://ftp.gnu.org/gnu/gcc/gcc-3.2/gcc-3.2.tar.gz 

wget ftp://ftp.gnu.org/gnu/gdb/gdb-5.3.tar.gz 

wget ftp://ftp.gnu.org/gnu/glibc/glibc-2.2.5.tar.gz 

wget ftp://ftp.gnu.org/gnu/glibc/glibc-linuxthreads-2.2.5.tar.gz 


We used the updated script to retrieve the toolchain, as shown in Example 6-65. 

Example 6-65 Retrieving the BLRTS tool chain 

bglsn:/bgl/downloads # ./retrieveToolChains.sh 

--12:02:45-- ftp://ftp.gnu.org/gnu/binutils/binutils-2.13.tar.gz 

=> `binutils-2.13.tar.gz' 

Resolving ftp.gnu.org... 199.232.41.7 

Connecting to ftp.gnu.org[199.232.41.7]:21... connected. 

Logging in as anonymous ... Logged in! 

==> SYST ... done. ==> PWD ... done. 

==> TYPE I ... done. ==> CWD /gnu/binutils ... done. 

==> PASV ... done. ==> RETR binutils-2.13.tar.gz ... done. 

Length: 12,790,277 (unauthoritative) 

100%[================================================================== 

====>] 12,790,277 1.67M/s ETA 00:00 

12:02:52 (1.67 MB/s) - `binutils-2.13.tar.gz' saved [12790277] 

--12:02:52-- ftp://ftp.gnu.org/gnu/gcc/gcc-3.2/gcc-3.2.tar.gz 

=> `gcc-3.2.tar.gz' 





==> TYPE I ... done. ==> CWD /gnu/gcc/gcc-3.2 ... done. 

==> PASV ... done. ==> RETR gcc-3.2.tar.gz ... done. 


100%[================================================================== 

====>] 26,963,731 1.73M/s ETA 00:00 

12:03:08 (1.69 MB/s) - `gcc-3.2.tar.gz' saved [26963731] 

--12:03:08-- ftp://ftp.gnu.org/gnu/gdb/gdb-5.3.tar.gz 

=> `gdb-5.3.tar.gz' 





==> TYPE I ... done. ==> CWD /gnu/gdb ... done. 

==> PASV ... done. ==> RETR gdb-5.3.tar.gz ... done. 



100%[================================================================== 

====>] 14,707,600 1.72M/s ETA 00:00 

12:03:16 (1.67 MB/s) - `gdb-5.3.tar.gz' saved [14707600] 

--12:03:16-- ftp://ftp.gnu.org/gnu/glibc/glibc-2.2.5.tar.gz 

=> `glibc-2.2.5.tar.gz' 





==> TYPE I ... done. ==> CWD /gnu/glibc ... done. 

==> PASV ... done. ==> RETR glibc-2.2.5.tar.gz ... done. 


100%[================================================================== 

====>] 16,657,505 1.74M/s ETA 00:00 

12:03:26 (1.68 MB/s) - `glibc-2.2.5.tar.gz' saved [16657505] 

--12:03:26-- 

ftp://ftp.gnu.org/gnu/glibc/glibc-linuxthreads-2.2.5.tar.gz 

=> `glibc-linuxthreads-2.2.5.tar.gz' 





==> TYPE I ... done. ==> CWD /gnu/glibc ... done. 

==> PASV ... done. ==> RETR glibc-linuxthreads-2.2.5.tar.gz ... 

done. 

Length: 226,543 (unauthoritative) 

100%[================================================================== 

====>] 226,543 1.15M/s 

12:03:26 (1.15 MB/s) - `glibc-linuxthreads-2.2.5.tar.gz' saved [226543] 


Example 6-66 shows the files that were downloaded into the /bgl/downloads 

directory. 

Example 6-66 Downloaded files for the tool chain 

bglsn:/bgl/downloads # ls -lrt 

total 69751 

-rwxr-xr-x 1 root root 1913 Apr 1 12:02 retrieveToolChains.sh 

-rw-r--r-- 1 root root 12790277 Apr 1 12:02 binutils-2.13.tar.gz 

-rw-r--r-- 1 root root 26963731 Apr 1 12:03 gcc-3.2.tar.gz 

-rw-r--r-- 1 root root 14707600 Apr 1 12:03 gdb-5.3.tar.gz 

-rw-r--r-- 1 root root 16657505 Apr 1 12:03 glibc-2.2.5.tar.gz 

-rw-r--r-- 1 root root 226543 Apr 1 12:03 

glibc-linuxthreads-2.2.5.tar.gz 

We can now proceed to rebuild the BLRTS toolchain. This process can take 

some time (it took us about half an hour). Thus in Example 6-67, we only show 

the last part of this process. We could not tell precisely that this process was 

successful, thus we tested the return code to verify that the tool chain had in fact 

been built correctly. 

Example 6-67 BLRTS tool chain built 

bglsn:/bgl/downloads # 

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/toolchain/buildBlrtsToolChain 

.sh 

..... > ..... 

make[4]: Entering directory 

`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb/gdbse 

rver' 

n=`echo gdbserver | sed 's,x,x,'`; \ 

if [ x$n = x ]; then n=gdbserver; else true; fi; \ 

/bin/sh /bgl/downloads/gnu/gdb-5.3/install-sh -c gdbserver 

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/blrts-gnu/bin/$n; \ 

/bin/sh /bgl/downloads/gnu/gdb-5.3/install-sh -c -m 644 

/bgl/downloads/gnu/gdb-5.3/gdb/gdbserver/gdbserver.1 

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/blrts-gnu/man/man1/$n.1 

make[4]: Leaving directory 

`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb/gdbse 

rver' 


`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb' 


`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build/gdb' 



`/bgl/downloads/gnu/build-powerpc-bgl-blrts-gnu/gdb-5.3-build' 

bglsn:/bgl/downloads # echo $? 

0 

Step 3 

We saved our customized scripts (containing our local NFS and GPFS 

configuration) 

Step 4 

Because the tool chain build was successful, we stop the control system 

(Example 6-68). 

Example 6-68 Stopping the bglmaster 
















BGLMaster is not started 

We also check the LoadLeveler status and stop, as shown in Example 6-69. 

Example 6-69 Stopping LoadLeveler 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llstatus 











bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llctl -g stop 

llctl: Sent stop command to host bglsn.itso.ibm.com 

llctl: Sent stop command to host bglfen1.itso.ibm.com 







llstatus: 2512-301 An error occurred while receiving data from the LoadL_negotiator 

daemon on host bglsn.itso.ibm.com. 

Step 5 

We now update the ppcfloor symbolic link: 

bglsn:/ # rm /bgl/BlueLight/ppcfloor 

bglsn:/ # ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc 


Step 6 

Next, we check and update db2profile and db2cshrc files, as shown in 

Example 6-70. 

Example 6-70 Updating the DB2 environment files 

bglsn:/ # vi /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile 

# Also in this file .. 

bglsn:/ # vi /bgl/BlueLight/ppcfloor/bglsys/bin/db2cshrc 

# Check the original password in the old driver db.properties 




# database_password=db24bgls 

database_schema_name=bglsysdb 

system=BGL 

min_pool_connections=1 

# Web Console Configuration 

mmcs_db_server_ip=127.0.0.1 


mmcs_db_server_port=32031 

mmcs_max_reply_size=8029 

mmcs_max_history_size=2097152 

mmcs_redirect_server_ip=default 

mmcs_redirect_server_port=32032 

Now we update the DB2 user password in the db.properties file that is located in 

the /bgl/BlueLight/ppcfloor/bglsys/bin/ directory (use vi) and use this updated 

password to rebuild the jar file. 

Step 7 

We rebind the new jar file. 

bglsn:/ # java -cp /bgl/BlueLight/ppcfloor/bglsys/bin/ido.jar 

com.ibm.db2.jcc.DB2Binder -url jdbc:db2://127.0.0.1:50001/bgdb0 

-user bglsysdb -password bglsysdb -size 200 

Step 8 

We update the /discovery directory saving the original configuration files, as 


Example 6-71 Updating the /discovery directory 

bglsn:/ # cp /bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/runPopIpPool 

/discovery 

bglsn:/ # cp 

/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/discovery/ServiceNetwork.cfg 

/discovery 

Step 9 

We also update the database schema, as shown in Example 6-72. 

Example 6-72 Updating the Blue Gene/L database schema 

bglsn:/ # /bgl/BlueLight/ppcfloor/bglsys/schema/updateSchema.pl 

--dbproperties /bgl/BlueLight/ppcfloor/bglsys/bin/db.properties 

--driver V1R2M1 

BlueGene/L database schema update utility, version V1R2M1. 

database bgdb0 

schema bglsysdb 

username bglsysdb 

previous driver 020-6 

target driver 080-6 

verbose level: 0 


Log will be written to 

/bgl/BlueLight/logs/BGL/updateSchema-2006-04-01-13:18:29.log. 

Connected to database bgdb0 as user bglsysdb 

Updating to driver 080-6 schema. 

Finished updating database schema to driver 080-6. 

We have now completed the upgrade process (which constitutes the error 

injection for this scenario). 


To determine the problem, we followed these steps: 

1. We start the control system as shown in Example 6-73. 

Example 6-73 Starting the bglmaster process 



Logging to 








2. We start LoadLeveler (see Example 6-74). 

Example 6-74 Starting LoadLeveler 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # llctl -g start 

llctl: Attempting to start LoadLeveler on host bglsn.itso.ibm.com 

LoadL_master 3.3.2.0 rmer0611a 2006/03/15 SLES 9.0 130 

CentralManager = bglsn.itso.ibm.com 

llctl: Attempting to start LoadLeveler on host bglfen1.itso.ibm.com 










glfen2.itso.ibm.com Avail 0 0 Idle 0 0.00 9999 PPC64 Linux2 







3. We check that the upgrade was successful (Example 6-75). 

Example 6-75 Checking the driver version 





mmcs$ version 

OK 

mmcs_db_server $Name: V1R2M2_120_2006 $ Mar 28 2006 17:54:10 

mmcs$ 

4. We now try and boot a block: 

OK 

mmcs_db_server $Name: V1R2M2_120_2006 $ Mar 28 2006 17:54:10 


This command apparently hung for 10 minutes because the sitefs file still has 

the entry to start GPFS and the ~ppcfloor/dist/etc/rc.d/init.d/gpfs file has the 

10 minute timeout because it cannot start GPFS. 

Example 6-76 shows the output from an I/O node log before the 10 minutes 

GPFS load timeout. 

Example 6-76 I/O node log messages showing GPFS cannot start 

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {51}.0: Starting GPFS 

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {51}.0: 

/var/etc/rc.d/rc3.d/S40gpfs: 2: /usr/lpp/mmfs/bin/mmautoload: not found 

Apr 01 13:29:01 (I) [1088451808] root:R000_128 {51}.0: Disabling 

protocol version 1. Could not load host key 

Apr 01 13:29:01 (I) [1088451808] root:R000_128 {51}.0: 

bglsn:/bgl/BlueLight/logs/BGL # 


However, we were able to ssh to the I/O node and run the df command (see 


Example 6-77 Running the df command on an I/O node 

$ df 


rootfs 7931 1640 5882 22% / 

/dev/root 7931 1640 5882 22% / 

172.30.1.1:/bgl 9766544 4008736 5757808 42% /bgl 


This check proves that the sshd daemon was started. Using the ps -ef 

command on that I/O node during the 10 minute GPFS timeout, we confirmed 

that the S40gpfs process is still running (with a sleep thread running). 

After the 10 minutes, we see the block booted, but the /bubu file system 

(GPFS) is not available (see Example 6-78). 

Example 6-78 GPFS startup timeout has expired 

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0: Starting SSH 

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0: Starting GPFS 

Apr 01 13:29:00 (I) [1088451808] root:R000_128 {17}.0: 

/var/etc/rc.d/rc3.d/S40gpfs: 2: /usr/lpp/mmfs/bin/mmautoload: not found 

Apr 01 13:29:01 (I) [1088451808] root:R000_128 {17}.0: Disabling 

protocol version 1. Could not load host key 

Apr 01 13:29:01 (I) [1088451808] root:R000_128 {17}.0: 

Apr 01 13:39:01 (I) [1088451808] root:R000_128 {17}.0: 

R00-M0-N0-I:J18-U11 /var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up 

on I/O nod 

Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0: 

[ciod:initialized] 

Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0: e ionode2 : 

172.30.2.2 


Starting ciod 





$ ciod: version "Mar 28 2006 17:56:13" 



Apr 01 13:39:02 (I) [1088451808] root:R000_128 {17}.0: 

/var/etc/rc.d/rc3.d/S40gpfs: GPFS did not come up on I/O node ionode2 : 

172.30.2.2 

5. We switch the ppcfloor link back to the old driver to see if we can immediately 

recover the situation and begin running jobs. However, we stop the control 

system to make the change, as shown in Example 6-79. 

Example 6-79 Stopping bglmaster to prepare reverting to previous driver 




BGLMaster is not started 


6. We switch the symbolic link, as shown in Example 6-80. 

Example 6-80 Reverting to the previous driver version (ppcfloor link) 

bglsn:/ # #ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc /bgl/BlueLight/ppcfloor 

bglsn:/ # ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor 

bglsn:/ # ls -als /bgl/BlueLight/ppcfloor 

0 lrwxrwxrwx 1 root root 41 Apr 1 12:53 /bgl/BlueLight/ppcfloor -> 

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc 

bglsn:/ # ln -fs /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor 





bglsn:/ # ln -s /bgl/BlueLight/V1R2M1_020_2006-060110/ppc /bgl/BlueLight/ppcfloor 




7. We then restart the control system (Example 6-81). 

Example 6-81 Restarting bglmaster after reverting to previous driver 

bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # ./bglmaster statut 








8. We try to run a job. We start by allocating a block as in Example 6-82. 

Example 6-82 Allocating a block (mixed bgl driver versions) 






OK 


OK 

mmcs$ quit 

OK 









$ df 


rootfs 7931 1644 5878 22% / 

/dev/root 7931 1644 5878 22% / 

172.30.1.1:/bgl 9766544 4008760 5757784 42% /bgl 




Here can see that the GPFS file system is again working at the old Blue 

Gene/L driver level. Now we try to run the job, as shown in Example 6-83. 

Example 6-83 Running a job (mixed driver versions) 





------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.50.0 



test1@bglfen1:/bglscratch/test1> ls -lrt 

total 1598796 

-rwxr-xr-x 1 test1 itso 7553497 2006-03-21 11:33 hello-file-1.rts 

-rwxr--r-- 1 test1 itso 1826286 2006-03-21 15:53 hello.rts 

-rwxr-xr-x 1 test1 itso 7553579 2006-03-22 11:05 hello-file-2.rts 

drwxr-xr-x 4 test1 itso 4096 2006-03-23 10:16 applications 

-rw-r--r-- 1 test1 itso 638 2006-03-23 11:08 hello.cmd 

-rw-r--r-- 1 test1 itso 639 2006-03-23 11:16 hello128.cmd 

-rw-r--r-- 1 test1 itso 635 2006-03-24 16:48 ior-gpfs.cmd 

-rw-r--r-- 1 test1 itso 2643 2006-03-24 16:53 ior-gpfs.out.ciod-hung-scenario 

-rw-r--r-- 1 test1 itso 9435 2006-03-24 17:20 ior-gpfs.err.ciod-hung-scenario 

-rw-r--r-- 1 test1 itso 637 2006-03-28 17:41 cds_ior-gpfs.cmd 

-rwxr-xr-x 1 test1 itso 7546841 2006-03-29 11:47 hello-file.rts 

-rw-r--r-- 1 test1 itso 3755 2006-03-29 11:49 core.0 

-rw-r--r-- 1 root root 0 2006-03-29 19:36 R000_128-233.stdout 

-rw-r--r-- 1 root root 0 2006-03-29 19:36 R000_128-233.stderr 

-rw-r--r-- 1 root root 0 2006-03-29 19:38 R000_128-235.stdout 

-rw-r--r-- 1 root root 0 2006-03-29 19:38 R000_128-235.stderr 

-rw-r--r-- 1 test1 itso 1536 2006-03-29 19:44 hello-gpfs.out 

-rw-r--r-- 1 test1 itso 10415 2006-03-29 19:44 hello-gpfs.err 

-rwxr-xr-x 1 test1 itso 7539442 2006-03-30 10:02 hello-world.rts 

-rw-r--r-- 1 test1 itso 1605021456 2006-03-30 19:05 hello.out 

-rw-r--r-- 1 test1 itso 9339 2006-03-30 19:06 hello.err 

-rw-r--r-- 1 test1 itso 0 2006-04-01 15:48 ior-gpfs.out 

-rw-r--r-- 1 test1 itso 4284 2006-04-01 15:48 ior-gpfs.err 



Run began: Sat Apr 1 14:41:26 2006 





Summary: 











------ --------- ---------- --------- -------- -------- -------- ---write 

6.96 1024.00 32.00 0.108306 17.77 1.71 0 


read 76.59 1024.00 32.00 0.004671 1.67 0.001274 0 


write 6.61 1024.00 32.00 0.026399 19.02 1.35 1 


read 77.22 1024.00 32.00 0.004509 1.65 0.001283 1 


write 6.65 1024.00 32.00 0.024646 18.50 1.85 2 


read 76.83 1024.00 32.00 0.004666 1.66 0.001285 2 


This test proves that the job did indeed run as expected. Now, we switch the 

ppcfloor link back to the new driver, as shown in Example 6-84. 

Example 6-84 Switching driver versions 



bglsn:/bgl/BlueLight/ppcfloor/bglsys/bin # cd / 





bglsn:/ # ln -s /bgl/BlueLight/V1R2M2_120_2006-060328/ppc 





bglsn:/ # ./bglmaster start 


-bash: ./bglmaster: No such file or directory 

bglsn:/ # cd - 

/bgl/BlueLight/ppcfloor/bglsys/bin 



Logging to 









9. Having switched to the latest Blue Gene/L driver, we now upgrade or install 

the compatible GPFS code just on the I/O nodes to see if that will be able to 

access the GPFS file system on the I/O nodes (see Example 6-85). 

Example 6-85 Installing or updating GPFS for I/O nodes (for new driver) 

bglsn:/mnt/LPP/GPFS/MCP_PTF11 # rpm --root 

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/bglsys \ /bin/bglOS --nodeps -ivh 

gpfs*.rpm 

Preparing... ########################################### [100%] 

1:gpfs.msg.en_US ########################################### [ 20%] 

2:gpfs.base ########################################### [ 40%] 

3:gpfs.docs ########################################### [ 60%] 

4:gpfs.gpl ########################################### [ 80%] 

5:gpfs.gplbin ########################################### [100%] 

10.We check that the installed or upgraded GPFS for the I/O nodes works (see 


Example 6-86 Checking the new GPFS for I/O node installation 

bglsn:/ # rpm --root 

/bgl/BlueLight/V1R2M2_120_2006-060328/ppc/bglsys/bin/bglOS -qa | grep 

gpfs 


gpfs.gpl-2.3.0-11 



gpfs.gplbin-2.3.0-11 


11.We test that we can boot a block and that the GPFS file system becomes 

available, as shown in Example 6-87. 

Example 6-87 Booting a block and checking GPFS availability 


OK 

mmcs$ quit 

OK 









$ df 


rootfs 7931 1644 5878 22% / 

/dev/root 7931 1644 5878 22% / 

172.30.1.1:/bgl 9766544 4025264 5741280 42% /bgl 



We can see that the GPFS file system (/bubu) is available again on the I/O 

nodes by just installing the new code. 

12.We try to run a LoadLeveler job, but first we free up the block, as shown in 

Example 6-88. 

Example 6-88 Unallocating the block for preparing the LoadLeveler run 






OK 


mmcs$ free R000_128 

OK 



test1@bglfen1:/bglscratch/test1> llstatus 















------------------------ ---------- ----------- -- --- ------------ ----------bglfen1.51.0 





Run began: Sat Apr 1 15:11:14 2006 




Summary: 











------ --------- ---------- --------- -------- -------- -------- ---write 

7.46 1024.00 32.00 0.100233 16.60 1.51 0 


read 77.16 1024.00 32.00 0.004662 1.65 0.002156 0 



write 6.61 1024.00 32.00 0.034111 18.72 1.87 1 


read 76.49 1024.00 32.00 0.004613 1.67 0.002134 1 


write 6.25 1024.00 32.00 0.024525 20.08 1.67 2 


read 77.15 1024.00 32.00 0.004607 1.65 0.001303 2 


write 6.88 1024.00 32.00 0.023640 18.38 1.27 3 


read 76.10 1024.00 32.00 0.004649 1.68 0.001303 3 


This series of steps proves that a LoadLeveler initiated job ran successfully as 

well. 



► When upgrading the Blue Gene/L driver version we must also re-install the 

GPFS code for the I/O nodes. 

► If we upgrade the Blue Gene/L driver version and forget to upgrade the GPFS 

code, or do not have it available for install, then Blue Gene/L operation can be 

restored by simply switching the ppcfloor symbolic link back to the original 

version. 

► Again, we found that there is a 10 minute timeout for the GPFS startup on the 

I/O nodes to boot if there is a problem. 


6.3.7 Duplicate IP addresses in /etc/hosts 

This scenario deals with errors in the /etc/hosts file on the SN (duplicate IP 

addresses). 


Here the error is injected with no blocks allocated. Now change the /etc/hosts file 

on the SN.The /etc/hosts file was changes by changing the IP address listed for 

ionode2, as shown in Example 6-89. 

Example 6-89 Duplicate IP address in /etc/hosts 

# BGL I/O nodes 

172.30.2.1 ionode1 

172.30.2.1 ionode2 

172.30.2.3 ionode3 

172.30.2.4 ionode4 

172.30.2.5 ionode5 

172.30.2.6 ionode6 

172.30.2.7 ionode7 


When we ran the job this time using the mpirun command, it ran without any 

problems, as shown in Example 6-90. 

Example 6-90 Job running with mpirun (duplicate I/O node IP address) 

test1@bglfen1:/bglscratch/test1> mpirun -partition R000_128 -np 128 -exe 

/bglscratch/test1/applications/IOR/IOR.rts -args "-f 

/bglscratch/test1/applications/IOR/ior-inputs" -cwd 

/bglscratch/test1/applications/IOR 

+ mpirun -partition R000_128 -np 128 -exe /bglscratch/test1/applications/IOR/IOR.rts 

-args '-f /bglscratch/test1/applications/IOR/ior-inputs' -cwd 



Run began: Wed Apr 5 15:23:49 2006 




Summary: 












------ --------- ---------- --------- -------- -------- -------- ---write 

6.97 1024.00 32.00 0.122974 17.73 1.55 0 


read 76.59 1024.00 32.00 0.004633 1.66 0.002440 0 


write 7.46 1024.00 32.00 0.026446 16.61 1.59 1 


read 76.72 1024.00 32.00 0.004517 1.66 0.001283 1 


write 6.14 1024.00 32.00 0.023408 20.57 1.39 2 


read 75.71 1024.00 32.00 0.004603 1.69 0.001307 2 


write 5.99 1024.00 32.00 0.025519 21.04 1.97 3 


read 75.01 1024.00 32.00 0.004542 1.70 0.001311 3 


write 6.90 1024.00 32.00 0.025010 18.18 1.71 4 


read 75.54 1024.00 32.00 0.004614 1.69 0.001288 4 

Max Write: 7.46 MiB/sec (7.83 MB/sec) 

Max Read: 76.72 MiB/sec (80.45 MB/sec) 

Run finished: Wed Apr 5 15:25:45 2006 

test1@bglfen1:/bglscratch/test1> 

We now login through ssh into an I/O node and check the /etc/hosts file: 

$ cat /etc/hosts 

172.30.2.1 ionode1 

172.30.2.2 ionode2 

172.30.2.3 ionode3 

As we can see, the /etc/hosts file on the I/O nodes was not affected by the 

injected error. 


We now investigate possible effects on GPFS (see Example 6-91). 

Example 6-91 Investigating GPFS 

bglsn:/etc # mmgetstate -a 


----------------------------------------- 










bglsn:/etc # 

We also see that output of the mmgetstate -a command looks correct on the SN. 


Duplicate IP addresses in the /etc/hosts file on the SN have little effect on the 

running jobs. Even though we can assume GPFS is going to be impacted, this is 

revealed only if a cluster GPFS change occurs. 

6.3.8 Missing I/O node in /etc/hosts 

This scenario shows missing I/O node entries form the /etc/hosts file on the SN. 


Example 6-92 shows the error injected with no blocks allocated. 

Example 6-92 Missing I/O node in /etc/hosts on the SN 

bglsn:/ # cat /etc/hosts 

.... > .... 

172.30.2.1 ionode1 

#172.30.2.2 ionode2 

172.30.2.3 ionode3 

172.30.2.4 ionode4 

172.30.2.5 ionode5 

172.30.2.6 ionode6 

.... > .... 



When we ran the job using the mpirun command there were no problems. So we 

further investigate the state of GPFS on the SN (see Example 6-93). 

Example 6-93 The mmgetstate command reveals a missing node 

bglsn:/tmp # mmgetstate -a 


----------------------------------------- 









mmgetstate: The following nodes could not be reached: ionode2 

Here, we see that ionode2 is not listed in the output, so we check the /etc/hosts 

files entries to get the IP address of ionde2: 

bglsn:/tmp # grep ionode2 /etc/hosts 

#172.30.2.2 ionode2 

bglsn:/tmp # grep ionode2 /etc/hosts 

We then try to login using ssh into ionode2 to check the GPFS status, as shown 


Example 6-94 GPFS status 

bglsn:/tmp # ssh root@172.30.2.2 



$ df 


rootfs 7931 1644 5878 22% / 

/dev/root 7931 1644 5878 22% / 

172.30.1.1:/bgl 9766544 4076160 5690384 42% /bgl 



$ hostname 

ionode2 




► We learn that the mmgetstate command relies on system name resolution (n 

this case the /etc/hosts file) to be able to query the state of the remote nodes 

► Missing entries in the /etc/hosts file do not necessarily mean that GPFS are 

unable to load on the I/O nodes. Because the GPFS configuration has not 

been altered, this actually depends which node has initiated the GPFS 

communication. 

Note: We emphasize again the importance of correct and identical name 

resolution for all nodes in all inter-connected GPFS clusters. 

6.3.9 Adding an extra alias for the SN in /etc/hosts 

This scenario adds an extra alias SN in the /etc/hosts file on the SN. 


Here the error is injected with no blocks allocated. 

We changed the /etc/hosts file by adding an SN alias: 

# Service Node interfaces 

10.0.0.1 bglsn_sn.itso.ibm.com bglsn_sn 

#172.30.1.1 bglsn_fn.itso.ibm.com bglsn_fn 

172.30.1.1 fluff bglsn_fn.itso.ibm.com bglsn_fn 


When we ran the job this time using the mpirun command, it ran without any 

problems. So, while the job was running, we investigate the state of GPFS from 

the SN (see Example 6-95). 

Example 6-95 Checking GPFS node status 



----------------------------------------- 











We ran the job, and it was successful. We then stopped GPFS, freed the block, 

and restarted GPFS, as shown in Example 6-96. 

Example 6-96 Restarting the bgIO GPFS cluster 

bglsn:/tmp # mmstartup -a 

Wed Apr 5 15:49:25 EDT 2006: mmstartup: Starting GPFS ... 









ionode7: ssh: connect to host ionode7 port 22: Connection refused 








However, the errors in the mmstartup command were expected because no block 

was initialized at the time: 



----------------------------------------- 


bglsn:/tmp # 


We learn that the mmgetstate command relies on entries in the /etc/hosts file to 

be able to access the state of the remote nodes, but the addition of alias entries 

in the /etc/hosts file doesn’t mean that GPFS will be unable to load on the I/O 

nodes (name resolution still works fine). 


6.4 Job submission scenarios 

This section presents scenarios related to mpirun and IBM LoadLeveler. We try 

to address the common problems observed by the users of the system. 

However, in the real world there might be other unknown problems which we 

might not have addressed in this chapter. Our basic goal is to introduce the 

common problem determination methodology for the job running environment on 

Blue Gene/L. 

Depending on the process of job submission method used (that is mpirun or 

LoadLeveler), we have divided this section in two subsections, one for running 

jobs using mpirun, and one for LoadLeveler. In each subsection we first make 

sure we can submit a job which is executed successfully. We then inject the 

errors, resulting in a job failure and then perform problem determination following 

the methodology mentioned in Chapter 2, “Problem determination methodology” 

on page 55. 

6.4.1 The mpirun command: scenarios description 

In this section, we discuss problems generally encountered while submitting jobs 

to a Blue Gene/L system. For example, job submission has as mandatory 

prerequisites the Blue Gene/L database (DB2) and the Control system 

processes up and running (at the higher end), appropriate partitions availability 

(number of nodes, partition shape), and the file systems available (at lower end). 

In our scenarios we do not care about application errors (wrong libraries, 

execution arguments, bad programming and so forth), rather focus on the job 

submission action itself, and how the job interacts with the system. We have 

tested the following list of scenarios, concentrated on what we consider as basic 

errors observed while submitting jobs on the system: 

1. Environment variables not set before submitting jobs 

– $MMCS_SERVER_IP variable on FEN 

– $BRIDGE_CONFIG_FILE variable on SN 

– $DB_PROPERTY variable on SN 

– Database user profile not sourced (db2profile) 

2. Remote command execution not set correctly (rsh) 

Each scenario consists of the following topics: 





6.4.2 The mpirun command: environment variables not set 

As previously mentioned, in this scenario we deal with unset environment 

variables. 

Scenario 1 

Environment variable MMCS_SERVER_IP is not set on the Front-End Node (FEN). 


We omit the line with MMCS_SERVER_IP in users profile (~/.bashrc or ~/.cshrc) 

test1@bglfen1:~> env |grep MMCS_SERVER_IP 

test1@bglfen1:~> 

As we can see, the environment variable is missing (not set). 


By default, the mpirun command writes standard output (1) and error (2) on to the 

console from where the job was submitted, unless specifically redirected in the 

application. In our scenario, the job uses the console, as shown in Example 6-97. 

Example 6-97 Job’s stderr(2) on the console (missing $MMCS_SERVER_IP) 

test1@bglfen1:/bglscratch/test1> mpirun -partition R000_J108_32 -exe 

/bglscratch/test1/hello.rts -cwd /bglscratch/test1 

FE_MPI (ERROR): BG/L Service Node not set: 

FE_MPI (ERROR): Please set 'MMCS_SERVER_IP' 

env. variable or 

FE_MPI (ERROR): specify the Service Node 

with '-host' argument. 

Usage: 

or 

mpirun [options] 

mpirun [options] binary [arg1 arg2 ...] 

Try "mpirun -h" for more details. 

We can see from the command output the reason of the failure. We omitted the 

$MMCS_SERVER_IP variable and the -host argument. By referring to the “The 

mpirun checklist” on page 166 and following the checks mentioned there, we find 

a missing environment variable (MMCS_SERVER_IP). 


Scenario 2 

Environment variable BRIDGE_CONFIG_FILE is not set on the SN. 


We omit the line with BRIDGE_CONFIG_FILE in user’s profile (~/.bashrc or ~/.cshrc). 

Example 6-98 shows the ~/.bashrc file for user test1. 

Note: Normally users (other than system administrator - root) are not allowed 

to login onto the SN. 

Example 6-98 Environment variable BRIDGE_CONFIG_FILE not set on SN 

test1@bglfen1:~> cat ~/.bashrc 


then 

#export BRIDGE_CONFIG_FILE=/bglhome/test1/bridge_config_file.txt 



fi 




application. In our scenario, the job uses the console, as shown in Example 6-99. 

Example 6-99 Jobs stderr(2) for missing BRIDGE_CONFIG_FILE on SN 

test1@bglfen1:/bglscratch/test1> mpirun -partition R000_128 -np 128 

-exe /bglscratch/test1/hello-world.rts -verbose 1 



loaded 

BE_MPI (ERROR): The environment parameter 

"BRIDGE_CONFIG_FILE" is not set, set it to point to the configuration 

file 


BE_MPI (Info) : BG/L Job alredy terminated / 

hasn't been added 

BE_MPI (Info) : Partition wasn't allocated by 

the mpirun - No need to remove 




service node) 



FE_MPI (ERROR): Failure list: 

FE_MPI (ERROR): - 1. Failed to get machine 

serial number (bridge configuration file not found?) (failure #15) 



We can see from the command output the reason of the failure. Also, for clarity 

reasons, we used the -verbose 1 option. By referring to the “The mpirun 

checklist” on page 166 and following the checks mentioned there, we find a 

missing environment variable (BRIDGE_CONFIG_FILE not set). 

Scenario 3 

Environment variable DB_PROPERTY is not set on the SN. 


We omit the line with DB_PROPERTY in user’s profile (~/.bashrc or ~/.cshrc). 


Example 6-100 Environment variable DB_PROPERTY not set on SN 

test1@bglsn:~> cat ~/.bashrc 


then 


#export DB_PROPERTY=/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties 


fi 




application. In our scenario, the job uses the console, as shown in 

Example 6-101. 

Example 6-101 Job’s stderr(2) for missing DB_PROPERTY variable on SN 

test1@bglfen1:/bglscratch/test1> mpirun -partition R000_J108_32 -exe 

/bglscratch/test1/hello.rts -cwd /bglscratch/test1 -verbose 1 



loaded 

BE_MPI (ERROR): db.properties file not found 


BE_MPI (ERROR): If it is not in the CWD, 

please set DB_PROPERTY env. var. to point to the file's location. 


BE_MPI (Info) : BG/L Job alredy terminated / 

hasn't been added 

BE_MPI (Info) : Partition wasn't allocated by 

the mpirun - No need to remove 




service node) 



FE_MPI (ERROR): - 1. Failed to locate 

db.properties file (failure #14) 





checklist” on page 166 and following the checks mentioned there, we find a 

missing environment variable (DB_PROPERTY not set). 

Scenario 4 

Database user profile not sourced (db2profile) on the SN. 


We omit to source the db2profile in user’s profile (~/.absurd or ~/.cshrc). 


Example 6-102 db2profile not sourced on the SN 

test1@bglsn:~> cat ~/.bashrc 


then 



# source /bgl/BlueLight/ppcfloor/bglsys/bin/db2profile 

fi 





application. In our scenario, the job uses the console, as shown in 


Example 6-103 db2profile not set on the SN 

test1@bglfen1:/bglscratch/test1/applications/IOR> mpirun -partition 

R000_128 -np 128 -exe /bglscratch/test1/applications/IOR/IOR.rts -args 

"-f /bglscratch/test1/applications/IOR/ior-inputs" -cwd 


-CLI INVALID HANDLE----cliRC 

= -2 

line = 148 



= -2 

line = 167 





= -2 

line = 167 





= -2 

line = 167 





= -2 

line = 167 






= -2 

line = 167 




Unable to connect, aborting... 

FE_MPI (ERROR): blk_receive_incoming_message() 

- ! 


- ! Receiver thread exited 


- ! Error code = 11 


- ! 


- Switching to cleanup sequence... 


FE_MPI (ERROR): - 1. One or more threads 

died (failure #57) 



checklist” on page 166 and following the checks mentioned there, we find that the 

db2profile was not sourced on the SN. 

6.4.3 The mpirun command: incorrect remote command 

execution (rsh) setup 

In this scenario, we simulate bad rsh authentication between one FEN and SN. 

This prevents mpirun front-end (running on the FEN) from talking to the mpirun 

back-end (running on the SN). 


The error has been injected by omitting (in this case, commented with “#”) the 

line corresponding to user test1@bglfen1 in ~/.rhosts file of the user test1@bglsn 

(see Example 6-104). 

Example 6-104 User’s .rhosts file 

test1@bglfen1:~> rsh bglsn ls 

Permission denied. 

test1@bglsn:~> cat ~/.rhosts 


# bglfen1 test1 

bglfen2 test1 

bglsn test1 


We submit a job and check the console (default is the terminal from were the job 

was submitted). We see the stderr(2) messages shown in Example 6-105. 

Example 6-105 remote command execution fails (FEN to SN) 

test1@bglfen1:~> mpirun -np 32 -partition R000_J106_32 -exe 

/bglscratch/test1/hello-world.rts -cwd /bglscratch/test1 -verbose 1 



loaded 


FE_MPI (ERROR): waitForBackendConnections() - 

child process died unexpectedly 

FE_MPI (ERROR): Failed to get control and data 

connections from service node 


FE_MPI (ERROR): - 1. Failed to execute 

Back-End mpirun on service node (failure #13) 



This error shown in Example 6-105 is observed on the console when a job is 

submitted using mpirun. The error message (shown in bold) is Permission 

denied and also Failed to execute Back-End mpirun on the Service Node. 

Because we got the error from the mpirun command, and there is little 

information (stderr(2)) about the reason of the failure (Permission denied), we 

check according to “The mpirun checklist” on page 166. Going through the steps 

one by one we could see that the environment variables are set correctly on FEN 

and SN. We drill down and use the -only_test_protocol flag for mpirun, which 

tests the remote shell environment across the FENs and the SN, as shown in 


Example 6-106 The mpirun argument of only_test_protocol 

test1@bglfen1:~> mpirun -np 32 -partition R000_J106_32 -exe 

/bglscratch/test1/hello-world.rts -cwd /bglscratch/test1 -verbose 1 

-only_test_protocol 




loaded 


====================================== 

FE_MPI (WARN) : = Front-End - Only checking 

protocol = 

FE_MPI (WARN) : = No actual usage of the BG/L 

Bridge = 


====================================== 


FE_MPI (ERROR): waitForBackendConnections() - 

child process died unexpectedly 

FE_MPI (ERROR): Failed to get control and data 

connections from service node 


FE_MPI (ERROR): - 1. Failed to execute 

Back-End mpirun on service node (failure #13) 



From Example 6-106 the highlighted messages indicate the same set of errors 

which we have observed during our initial job run. This points to us to a problem 

with the remote command execution. Depending on the users’ shell environment 

(for example, rsh in this case), we proceed to the respective checklist. Because 

by default rsh is used by mpirun, we decide to execute a simple date command 

from the FEN to the SN: 

test1@bglfen1:~> rsh bglsn date 


This states clearly Permission denied, therefore we first check the ~/.rhosts file 

of test1 user, and detect the missing line for test1@bglfen1. We correct and 

re-submit the job successfully. 



► Environment variables on the FEN and SN should be set appropriately before 

submitting the jobs on the Blue Gene/L system. 

► Users’ shell environment needs to be set appropriately in order to 

successfully submit a job on the system 


6.4.4 LoadLeveler: scenarios description 

The scenarios in this section deal with problems that you can encounter when 

running IBM LoadLeveler in a Blue Gene/L environment. We assume that you 

have basic knowledge on IBM LoadLeveler. See the IBM LoadLeveler Using and 

Administering Guide, SA22-7881, for reference. 

To better understand LoadLeveler problems on Blue Gene/L, a basic knowledge 

of Blue Gene/L system administration is also required. The following sections 

present the scenarios that we developed and tested for this book. 

6.4.5 LoadLeveler: job failed 

In this section, we analyze some possible reasons for a LoadLeveler job to fail. 

This scenario does not contain manual error injection, because we encountered 

these issues during our testing. 

Problem description 

There are many reasons a job could fail. In this scenario, we do not look into 

application specific failures. Based on the discussion in 4.4.3, “How LoadLeveler 

plugs into Blue Gene/L” on page 172, a MPI job submitted into LoadLeveler 

queue might fail if one of the following happen: 

► LoadLeveler cannot talk to mpirun processes 

► mpirun processes cannot talk to LoadLeveler 

► Some files or file systems are not accessible 

► Problems with Blue Gene/L partitions 

Detailed checking 

If one or all of the hypothesized problems above happened, there might be some 

error messages recorded in the job stderr(2) file. Therefore, the first place to start 

is the job output. 

To determine where the job saves those file, the job command file is one place to 

look. The file used in this scenario is listed in Example 4-16 on page 178. It 

contains the following two lines: 

#@ output = 

/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out 

#@ error = 

/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err 


LoadLeveler replaces these variables such as $(schedd_host), $(jobid), $(stepid) 

with their values. The three variables are combined to make up the jobid. The 

two file names are: 

/bgl/loadl/out/hello.bglfen2.2.0.out 

/bgl/loadl/out/hello.bglfen2.2.0.err 

Another way to determine where the output files are stored is to use the llq -l 

command, which displays a long list of job attributes, as shown 

Example 6-107. The output of this command is relevant only if the job is still in 

Running (R) state. 

Example 6-107 Finding job stdout and stderr(2) from llq -l command 

loadl@bglfen1:/bgl/loadlcfg> llq 


Running On 

------------------------ ---------- ----------- -- --- ------------ 

----------bglfen2.2.0 

test1 3/30 14:17 R 50 small 

bglfen1 


preempted 

loadl@bglfen1:/bgl/loadlcfg> llq -l bglfen2.2.0 

... 

Out: /bglscratch/test1/ior-gpfs.out 

Err: /bglscratch/test1/ior-gpfs.err 

... 

Investigating the stderr(2) and stdout(1) files might reveal that the stdout(1) file is 

empty, which means the job either did not start, or did not advance much in 

execution before it failed. However, there is information in the stderr(2) file. The 

contents of the file is listed in Example 6-108. 

Example 6-108 Job stderr(2) contents 


FE_MPI (Debug): Started, pid=2983, 

exe=/bgl/BlueLight/V1R2M1_020_2006-060110/ppc/bglsys/bin/mpirun, 

cwd=/bgl/loadlcfg 

FE_MPI (Debug): Collecting arguments ... 

FE_MPI (Debug): Checking for arguments in the 

environment 

FE_MPI (Debug): Parsing command line arguments 

FE_MPI (Debug): Checking usage... 



loaded 

FE_MPI (Debug): Collecting arguments from 

external source (scheduler) 

FE_MPI (Debug): Calling external source to 

fill parameters... 

FE_MPI (Debug): Setting failure #12 

FE_MPI (Debug): Closing messaging queues 

FE_MPI (Debug): FE threads are down with the 

following return codes: 

FE_MPI (Debug): FE Sender : N/A (not 

initialized) 

FE_MPI (Debug): FE Receiver : N/A (not 

initialized) 

FE_MPI (Debug): FE Input : N/A (not 

initialized) 

FE_MPI (Debug): FE Output : N/A (not 

initialized) 

FE_MPI (Debug): Child shell process never 

started. 


FE_MPI (ERROR): - 1. Front-End 

initialization failed (failure #12) 



In Example 6-108 the two ERROR messages are: 


FE_MPI (ERROR): - 1. Front-End 

initialization failed (failure #12) 

These messages are from mpirun (FE_MPI). They indicate a Front-End 

initialization failure (failure #12). 

According to the job command file, from the line we can see that the job was run 

with mpirun option flag -verbose 2: 

#@ arguments = -verbose 2 -exe /bgl/hello/hello.rts 

The next thing to try is to increase the verbosity to a higher number such 

(-verbose 4) to get more details on ERROR messages. In our case, the 

-verbose 4 option produced the messages shown in Example 6-108. 


From here we get another clue: the option flag value set on the #@ arguments 

line in the job command file does not seem to get passed to mpirun, which 

means that mpirun cannot talk to LoadLeveler. 

Drilling down the LoadLeveler checklist, we scan through the following items: 

► LoadLeveler run queue shows normal operation. 

► Simple job submission: This is a simple ""Hello world!" job that fails. 

Therefore, the simple job submission check is not relevant. 

► Job command file: The file used for this job has been checked and altered as 

previously mentioned (-verbose 4) with little impact. 

► LoadLeveler processes/daemons seem to be working as expected (llstatus 

and llq commands). 

► LoadLeveler logs: This is a job failure so the log file to check is StartLog and 

StarterLog. The StartLog file contains the messages shown in 


Example 6-109 StartLog file contents 

03/23 10:23:24 TI-57 JOB_STATUS: Job bglsn.itso.ibm.com.21.0 Status = 

READY 

03/23 10:23:24 TI-57 QUEUED_STATUS: Step bglsn.itso.ibm.com.21.0 

queueing status to schedd at bglsn.itso.ibm.com 


RUNNING 



03/23 10:23:24 TI-61 Notification of user tasks termination received 

from Starter for job step bglsn.itso.ibm.com.21.0 


COMPLETED 



03/23 10:23:24 TI-62 Cleanup_dir: Job 

/root/execute/bglsn.itso.ibm.com.21.0 Removing directory. 


The log messages indicate the job completed. There is no failure information. 

Looking into the StarterLog file reveals some error messages even though it 

indicates the job completed, as shown in Example 6-110. 

Example 6-110 StarterLog file contents 

03/23 10:23:24 TI-0 bglsn.21.0 llcheckpriv program exited, termsig = 0, 

coredump = 0, retcode = 0 

03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending READY status to 

Startd 

03/23 10:23:24 TI-0 bglsn.21.0 Main task program started (pid=11862 

process count=1). 

03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending RUNNING status to 

Startd 

03/23 10:23:24 TI-0 LoadLeveler: 2539-453 Illegal protocol (121), 

received from another process on this machine - bglsn.itso.ibm.com. 

This daemon "LoadL_starter" is running protocol version (130). 

03/23 10:23:24 TI-0 bglsn.21.0 Task exited, termsig = 0, coredump = 0, 

retcode = 1 

03/23 10:23:24 TI-0 bglsn.21.0 User environment epilog not run, no 

program was specified. 

03/23 10:23:24 TI-0 bglsn.21.0 Epilog not run, no program was 

specified. 

03/23 10:23:24 TI-0 bglsn.itso.ibm.com.21.0 Sending COMPLETED status to 

Startd 

03/23 10:23:24 TI-0 bglsn.21.0 ********** STARTER exiting 

*************** 

The ERROR messages Illegal protocol (121)... indicate some libraries are 

mismatched. Therefore, we perform one more check of environment variables 

and library links. Because all other checks have been performed, the last step is 

to make sure the environment variables and library links are set up correctly on 

the SN. Example 6-111 shows how we checked the variables. 

Example 6-111 Checking Blue Gene/L environment variables 

loadl@bglsn:~> echo $BRIDGE_CONFIG_FILE 


loadl@bglsn:~> echo $DB_PROPERTY 

/bgl/BlueLight/ppcfloor/bglsys/bin/db.properties 

loadl@bglsn:~> echo $MMCS_SERVER_IP 



Checking the library links is more subtitle, because this problem might occur if a 

link was removed (easy to detect) or a library binary was replaced (not obvious, 

requires further investigation). The following error message from the 

LoadLeveler Starter log does not explicitly point to a specific library: 

03/23 10:23:24 TI-0 LoadLeveler: 2539-453 Illegal protocol (121), 

received from another process on this machine - bglsn.itso.ibm.com. 

This daemon "LoadL_starter" is running protocol version (130). 

This message means that you need to check all LoadLeveler related libraries on 

both the SN and FENs. The script shown in Example 4-14 on page 175 can be 

used for this checking. 

In our scenario two libraries (binaries) are replaced on all nodes and the problem 

is resolved. 



► A simple "Hello world!" job can serve as a LoadLeveler checking tool. If this 

job fails, it’s a good indication that mpirun and basic job submission are OK. 

Nevertheless, investigating the job stderr(2) and stdout(1) files is a good 

place to start. See 4.3, “Submitting jobs using built-in tools” on page 149. 

► When the error messages from the job are not explicit, there are usually other 

ways to get more debugging information. Because the job goes through 

different components such as the LoadLeveler daemons and mpirun 

processes, their respective log files should be investigated to find traces (and 

errors) of the job. The various logs might reflect different status of the job 

depending on the different states. Understanding the job life cycle at various 

levels is key to finding errors in job submission process. See 4.3.3, “Example 

of submitting a job using mpirun” on page 163. 

► Checking the libraries used by LoadLeveler is a sensitive step related to a job 

failure. However, when such errors occur, we found out that mismatched 

libraries are often the causes. 

6.4.6 LoadLeveler: job in hold state 

In this section we investigate some possible causes of a job held in queue for a 

long time. We do not perform any error injection, as we have already 

encountered some of these issues while working with LoadLeveler. 



A job submitted to LoadLeveler is going to run in batch mode. Depending on the 

resources available, the job might not be able to run immediately after 

submission. It is important to check the status of the job in the LoadLeveler 

queue. In this scenario, the job is seen in the Hold state. There are many 

hypotheses. Some of them are as following: 

► A dynamic problem happens on the node that the Scheduler is not aware of at 

the time of sending the job. 

► When Starter starts the job, it encounters a problem and the job cannot be 

started. 

► The job could be held by owner or administrator for some reasons. 


It could be the job’s owner, who does not see the job running for a while after it is 

submitted. She checks the job output and does not see it. The LoadLeveler 

administrator could also spot the job in Hold state in the queue. See 


Example 6-112 Job in Hold state 

test1@bglfen1:/bgl/loadl> llq 


Running On 

------------------------ ---------- ----------- -- --- ------------ 

----------bglfen1.23.0 

test1 3/26 14:06 H 50 small 


preempted 

A quick check on LoadLeveler shows the cluster working normally. See 


Example 6-113 Normal LoadLeveler cluster status 

test1@bglfen1:/bgl/loadl> llstatus 


OpSys 


Linux2 


Linux2 








When we verify the run queue (llq command), we can see that the queue is OK. 

If the job is Hold state it might be because an user or administrator has done this. 

Example 6-114 shows the queue status. 

Example 6-114 Job in Hold state by user 

loadl@bglfen1:/bgl/loadl> llq -l bglfen1.3.0 | more 




Step Name: 0 


Owner: loadl 

Queue Date: Thu 06 Apr 2006 10:01:18 AM EDT 

Status: User Hold 




.... 

Next, we check in the LoadL_config file for the following keywords 

(Example 6-115): 

► MAX_JOB_REJECT 

► ACTION_ON_MAX_REJECT 

Example 6-115 MAX_JOB_REJECT and ACTION_ON_MAX_REJECT 

# The MAX_JOB_REJECT value determines how many times a job can be 

# rejected before it is canceled or put on hold. The default 

value 

# is 0, which indicates a rejected job will immediately be 

canceled 

# or placed on hold. MAX_JOB_REJECT may be set to unlimited 

rejects 

# by specifying a value of -1. 

# 


MAX_JOB_REJECT = 0 

# 

# When ACTION_ON_MAX_REJECT is HOLD, jobs will be put on user hold 

# when the number of rejects reaches the MAX_JOB_REJECT value. 

When 

# ACTION_ON_MAX_REJECT is CANCEL, jobs will be canceled when the 

# number of rejects reaches the MAX_JOB_REJECT value. The default 

# value is HOLD. 

# 

ACTION_ON_MAX_REJECT = HOLD 

If the job is already in Hold state, the llq -s command cannot analyze 

its status. To verify this, we released the job using the llhold -r command (put it 

back to Idle state), then quickly checked with llq -s bglfen1.23.0, as shown in 


Example 6-116 Releasing held job to check for reasons 



Running On 

------------------------ ---------- ----------- -- --- ------------ 

----------bglfen1.23.0 

test1 3/26 14:06 I 50 small 

(alloc) 


preempted 

test1@bglfen1:/bgl/loadl>llq -s bglfen1.23.0 

... 

... 

==================== EVALUATIONS FOR JOB STEP bglfen1.itso.ibm.com.23.0 

==================== 

Step waiting on following partitions to become FREE: 

RMP26Mr151607127 

Step waiting on following partitions to become FREE: 

RMP26Mr151607127 



Running On 


------------------------ ---------- ----------- -- --- ------------ 

----------bglfen1.23.0 

test1 3/26 14:06 H 50 small 


preempted 

From Example 6-116, we can see that the Scheduler allocates the node resource 

to the job. The status shown by llq -s command is normal. 

The next check is to look into the LoadLeveler StartLog and StarterLog files to 

see if there is any information about jobid bglfen1.23.0, as shown in 


Example 6-117 Job REJECTED according to StartLog 

03/26 14:28:59 TI-1667 JOB_START: Received start order from 

bglfen1.itso.ibm.com. 

03/26 14:28:59 TI-1667 inside StartdStep constructor for 269709664 

03/26 14:28:59 TI-1667 shutdown_active_count increasing from 0 to 1 

03/26 14:28:59 TI-1667 JOB_START: Step bglfen1.itso.ibm.com.23.0: 

Starting. Starter process id = 19333 

03/26 14:28:59 TI-4 Starter Table: 

StarterPid = 19333, ClientMachine = bglfen1.itso.ibm.com, user = test1, 

State = 2048, Flags = 8196, sid = bglfen1.itso.ibm.com.23.0, StateTimer 

= 1143401339 

03/26 14:28:59 TI-1671 Notification of user tasks termination received 

from Starter for job step bglfen1.itso.ibm.com.23.0 

03/26 14:28:59 TI-1672 JOB_STATUS: Job bglfen1.itso.ibm.com.23.0 Status 

= REJECTED 

03/26 14:28:59 TI-1672 QUEUED_STATUS: Step bglfen1.itso.ibm.com.23.0 

queueing status to schedd at bglfen1.itso.ibm.com 

03/26 14:28:59 TI-1672 Cleanup_dir: Job 

/home/loadl/execute/bglfen1.itso.ibm.com.23.0 Removing directory. 

03/26 14:28:59 TI-1672 ruid(0) euid(5001) 

03/26 14:28:59 TI-1672 shutdown_active_count decreasing from 1 to 0 


From the StartLog, we can see that the job is rejected but there is no further 

information provided. The next log to check is StarterLog, shown in 


Example 6-118 Error found in StarterLog: permission denied 

03/26 14:07:09 TI-0 ********** STARTER starting up 

*********** 

03/26 14:07:09 TI-0 LoadLeveler: LoadL_starter started, pid = 19333 

03/26 14:07:09 TI-0 Sending starter pid 19333. 

03/26 14:28:59 TI-0 bglfen1.23.0 Prolog not run, no program was 

specified. 

03/26 14:28:59 TI-0 bglfen1.23.0 run_dir = 

/home/loadl/execute/bglfen1.itso.ibm.com.23.0 

03/26 14:28:59 TI-0 bglfen1.23.0 Sending request for executable to 

Schedd 

03/26 14:28:59 TI-0 03/26 14:28:59 TI-0 bglfen1.23.0 User environment 

prolog not run, no program was specified. 


client bglfen1.itso.ibm.com, errno =2. 

03/26 14:28:59 TI-0 bglfen1.23.0 llcheckpriv program exited, termsig = 

0, coredump = 0, retcode = -2 

03/26 14:28:59 TI-0 bglfen1.23.0 LoadL_starter: Cannot open stdout 

file. /bgl/loadl/out/hello.bglfen1.23.0.out: Permission denied (13) 

03/26 14:28:59 TI-0 bglfen1.23.0 User environment epilog not run, no 

program was specified. 

03/26 14:28:59 TI-0 bglfen1.23.0 cleanupStdErr: cannot stat 

/bgl/loadl/out/hello.bglfen1.23.0.err. rc=-1 errno=2 [No such file or 

directory] 

03/26 14:28:59 TI-0 bglfen1.23.0 Epilog not run, no program was 

specified. 

03/26 14:28:59 TI-0 bglfen1.itso.ibm.com.23.0 Sending REJECTED status 

to Startd 

03/26 14:28:59 TI-0 bglfen1.23.0 ********** STARTER exiting 

*************** 

In the StarterLog, it we can see a “Permission denied” problem is associated 

with the stdout(1) file: /bgl/loadl/out/hello.bglfen1.23.0.out. 

Note: There might not be enough information to pinpoint on which node the 

job is started. Checks for StartLog and StarterLog files for multiple nodes in 

the LoadLeveler cluster might have to be performed. 


In our case, it turned out that the directory belongs to user ID loadl and user ID 

test1’s job is trying to write the stderr(2) and stdout(1) into. We corrected the 

permissions and solved the problem. 



► A job can be put into Hold state for many reasons. The job ID is essential for 

checking its status. 

► The job needs to be released in order for llq -s to assess the 

reasons (and get additional clues). 

► LoadLeveler NegotiatorLog might not have any information about this job. In 

this case, the job was rejected by a Starter on one of the (LoadLeveler) 

nodes. 

► The most difficult part is to look for the job ID in StartLog or StarterLog on 

every node 

6.4.7 LoadLeveler: job disappears 

In this section, we investigate some possible causes of a job disappearing from 

the queue. 


We pass a wrong argument to the mpirun command in the LoadLeveler job 

command file hello.cmd. The wrong argument is -verbose 5 (not supported by 

mpirun). This causes the job to quickly fail and disappear from LoadLeveler 

queue. 


After the job is submitted, if it is a small job and there are no other jobs in the 

queue, it might run quickly and disappear from the queue. Sometimes, the job 

owner does not get a chance to see the job in Running state in the LoadLeveler 

queue. We could identify the following reasons: 

► The job might have run quickly (finished successfully) and nothing is wrong. 

► The job runs and fails quickly. 

► The job can be canceled by an user or an administrator. 



We start by checking the job stderr(2) and stdout(1) files, following the job 

command checklist in 4.4.9, “LoadLeveler checklist” on page 186. The job 

command file that we used for this job is hello.cmd. It includes the following lines: 

#@ output = 

/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).out 

#@ error = 

/bgl/loadl/out/hello.$(schedd_host).$(jobid).$(stepid).err 

The two files (stdout(1) and stderr(2)) are located in /bgl/loadl/out/ directory, as 


Example 6-119 Checking job output files 

test1@bglfen1:/bgl/loadl/out> ls -ltr 

-rw-r--r-- 1 loadl loadl 0 Mar 26 16:15 hello.bglfen1.24.0.out 

-rw-r--r-- 1 loadl loadl 193 Mar 26 16:15 hello.bglfen1.24.0.err 

Notice that the stdout(1) file is empty, thus we check the stderr(2) contents (listed 

in Example 6-120). 

Example 6-120 Job error in stderr(2) file 

FE_MPI (ERROR): Incorrect verbose option: '5' 

Usage: 

or 

mpirun [options] 

mpirun [options] binary [arg1 arg2 ...] 

Try "mpirun -h" for more details. 

The error message comes from mpirun front-end, and indicates that there is a 

usage error when mpirun is invoked. This leads back to the job command file 

“hello.cmd”. The keyword to check for is #@ arguments: 

#@ arguments = -verbose 5 -exe /bglscratch/test1/hello.rts 

That concludes this scenario because the -verbose option flag is set to 5 which is 

an invalid value. 

In case of the job is canceled by another user or the system (LoadLeveler) 

administrator, the stderr(2) and stdout(1) files should show different types of 

error messages, thus, the next step would be to check the StartLog and 

StarterLog file on the node that runs the job. 




► When a job runs and fails quickly, you might not see it in the queue. 

► It is essential to set the output and error keywords in the job command file. 

► Check the job stdout(1) and stderr(2) for errors first. 

6.4.8 LoadLeveler: Blue Gene/L is absent 

In this section, we analyze what happens if the LoadLeveler cannot talk to the 

Blue Gene/L database. As LoadLeveler uses the bridge API to interrogate and 

submit requests to the Blue Gene/L Control system, the libraries needed to 

perform these tasks are critical for LoadLeveler. 


We intentionally remove the symbolic link to a Blue Gene/L provided library that 

LoadLeveler uses. 


If the llstatus command displays the message Blue Gene is absent, 

LoadLeveler is not able to communicate with the Blue Gene/L control server 

through the bridge API. On the Blue Gene/L side, there are several scenarios 

that might render the bridge API unavailable. Those are described in the core 

system scenarios, in 6.2, “Blue Gene/L core system scenarios” on page 267. In 

this scenario, we focus on the libraries that LoadLeveler uses to communicate 

with the bridge API. Even though this can rarely happen, the following can cause 

problems with libraries: 

► A library binary file can be removed or corrupted. 

► A symbolic link pointing to the library can be broken. 

► An LoadLeveler or Blue Gene/L upgrade can be perform on some of the 

nodes, but not on all. 

► An upgrade of libraries might also alter the symbolic links to point to the 

wrong binaries. 


Before injecting the error, we perform some basic checking to make sure 

LoadLeveler is operational and a job is submitted and run successfully (see 



Example 6-121 LoadLeveler basic checks 



OpSys 


Linux2 


Linux2 







loadl@bglfen1:~> llctl -g stop 

llctl: Sent stop command to host bglsn.itso.ibm.com 



loadl@bglfen1:~> 

loadl@bglfen1:~> llctl -g start 










loadl@bglfen1:~> 

The link to a library binary is then removed. LoadLeveler is started and no errors 

are returned. See Example 6-122. 

Example 6-122 Starting LoadLeveler 

loadl@bglfen1:~> llctl -g start 











An user or the administrator wants to see status of LoadLeveler. The llstatus 

command shows Blue Gene is absent, as shown in Example 6-123. 

Example 6-123 llstatus message “Blue Gene is absent” 



OpSys 


Linux2 


Linux2 





Blue Gene is absent 


Any jobs submitted at this time will stay in Idle state, as shown in 


Example 6-124 Job in Idle state 



Running On 

------------------------ ---------- ----------- -- --- ------------ 

----------bglfen1.19.0 

loadl 3/24 13:43 I 50 small 


preempted 


We first check the LoadLeveler NegotiatorLog, which shows an error of opening 

a library, as in Example 6-125. 

Example 6-125 The loadBrigeLibrary error in NegotiatorLog 

03/28 18:04:48 TI-1 Machine Number index is 0, adapter list size is 0 



03/28 18:04:48 TI-1 BG: int BgManager::loadBridgeLibrary() - start 

03/28 18:04:48 TI-1 int BgManager::loadBridgeLibrary(): Failed to open 

library, /usr/lib64/libbglbridge.so, errno=25 (libtableapi.so.1: cannot 

open shared object file: No such file or directory) 

03/28 18:04:48 TI-1 int BgManager::initializeBg(BgMachine*): Failed to 

load Bridge API library 

03/28 18:04:48 TI-1 ************************************************* 


03/28 18:04:48 TI-1 ************************************************* 

03/28 18:04:48 TI-1 


03/28 18:04:48 TI-1 void 

LlMachine::queueStreamMaster(OutboundTransAction*): Set destination to 

master. Transaction route flag is now central manager sending 

transaction CMnotifyCmd to Master 

03/28 18:04:48 TI-6 LoadLeveler: Listening on port 9614 service 

LoadL_negotiator 

The message points to the library /usr/lib64/libbglbridge.so, which is a symbolic 

link (we removed earlier) to the shared object file libtableapi.so.1 (binary). 

Note: Should this error occur in real life, we would check both the link and the 

binary file. 

In this scenario, after the link is restored, LoadLeveler detects this, and the next 

llstatus command shows the message: 

Blue Gene is present. 

Jobs will run normal. 




► As LoadLeveler creates multiple links to a binary shared library, we have to 

remove all links to create the problem. 

► Without access to the Blue Gene/L bridge API, the jobs in LoadLeveler queue 

will stay in Idle state. They cannot be run. 

► It is essential to check whether Blue Gene/L is present from LoadLeveler’s 

perspective. 

6.4.9 LoadLeveler: LoadLeveler cannot start 

In this section we investigate possible causes for LoadLeveler daemons not 

starting. These daemons include the Central Manager (Master) and the 

Negotiator. 


When LoadLeveler cannot be started or the commands return errors, it means 

either the LoadLeveler Master or Negotiator daemons cannot start. This scenario 

focuses on the Negotiator problems, which could be one of the following: 

► LoadLeveler configuration files cannot be accessed 

► The Master daemon cannot start the Negotiator daemon on the Central 

Manager node 

► The Negotiator daemon not starting 

► LoadLeveler cannot access the necessary libraries 

Similar scenario can be followed to check problems with other daemons as well. 

However, the libraries, error messages and log files will be different. 


The common signs for the Negotiator problems are that the LoadLeveler 

commands return error messages similar to the ones shown in Example 6-126. 

Example 6-126 LoadLeveler Negotiator error messages 

loadl@bglsn:~/cmd> llq 

03/21 09:35:52 llq: 2539-463 Cannot connect to bglsn.itso.ibm.com 


03/21 09:35:52 llq: 2539-463 Cannot connect to bglsn.itso.ibm.com 


llq: 2512-301 An error occurred while receiving data from the 

LoadL_negotiator daemon on host bglsn.itso.ibm.com. 


The error messages shown in Example 6-126 are the starting point for this 

scenario. We have observed that the LoadLeveler commands do not display 

normal information or status. Therefore, the checks in “LoadLeveler cluster and 

node status” on page 186 cannot be performed. 

This is not a problem related to jobs, thus the check related to “LoadLeveler run 

queue” on page 193 and “Job command file” on page 198 can be skipped for 

now. 

We have to determine which node is the Central Manager node, and which one 

has the LoadL_negotiator daemon running. As we have seen, when Blue Gene/L 

is involved, this is always the SN. In this case, we can perform the checks in 

“LoadLeveler configuration keywords” on page 205 which pinpoint the Central 

Manager node. 

When on the SN, we focus on the checks in “LoadLeveler processes, logs, and 

persistent storage” on page 202. The ps command shows if the Negotiator 

daemon is running. Example 6-127 shows that the LoadL_negotiator process is 

running. Even so, the process could start and crash quickly. Comparing the 

Negotiator’s process ID between two instances of the ps command might reveal 

this if the process ID is different. 

Example 6-127 Looking for the LoadL_negotiator process 

bglsn:~ # ps -ef | grep LoadL 

loadl 20189 1 0 15:40 ? 00:00:00 


loadl 20199 20189 0 15:40 ? 00:00:05 LoadL_negotiator -f -c 

/tmp -C /tmp 

root 24894 24792 0 18:02 pts/22 00:00:00 grep LoadL 

If the Negotiator process starts and crashes, a LoadLeveler command should 

display the same error messages as in Example 6-126. If this happens quickly, 

the Negotiator might have not enough time to write log information, in which case 

the MasterLog should be checked first. 


As highlighted in Example 6-128, the Master daemon has recorded some errors 

while starting LoadL_negotiator. As we can see, the Negotiator died a couple of 

seconds after it started. 

Example 6-128 Negotiator error messages in MasterLog 

04/05 15:23:17 TI-1 CentralManager = bglsn.itso.ibm.com 

04/05 15:23:17 TI-1 Inode monitoring will not be performed on 

/home/loadl/log because it is a Reiser Filesystem which does not limit 

the number of inodes. 

04/05 15:23:17 TI-1 LoadLeveler: LoadL_master started, pid = 17689 

04/05 15:23:17 TI-11 LoadLeveler: 2539-463 Cannot connect to 

bglsn.itso.ibm.com "LoadL_negotiator" on port 9614. errno = 111 

04/05 15:23:17 TI-10 LoadL_negotiator started, pid = 17710 

04/05 15:24:21 TI-10 "LoadL_negotiator" on "bglsn.itso.ibm.com" died 

due to signal 11, attempting to restart 







04/05 15:30:28 TI-28 Got SHUTDOWN command from "loadl" with uid = 7001 

on machine "bglsn.itso.ibm.com" 

04/05 15:30:28 TI-28 Master shutting down now. 



Another error message in the MasterLog also indicates a problem with 

connecting to TCP port 9614. Investigating the NegotiatorLog reveals the same 

error messages (see Example 6-129). 

Example 6-129 Port 9614 error messages in NegotiatorLog 

04/05 15:30:28 TI-1 ************************************************* 


04/05 15:30:28 TI-1 ************************************************* 

04/05 15:30:28 TI-1 


04/05 15:30:28 TI-6 LoadLeveler: 2539-479 Cannot listen on port 9614 

for service LoadL_negotiator. 

04/05 15:30:28 TI-1 void 

LlMachine::queueStreamMaster(OutboundTransAction*): Set destination to 

master. Transaction route flag is now central manager sending 

transaction CMnotifyCmd to Master 


04/05 15:30:28 TI-6 LoadLeveler: Batch service may already be running 

on this machine. 


04/05 15:30:28 TI-7 LoadLeveler: Listening on port 9612 service 

LoadL_negotiator_collector 

04/05 15:30:28 TI-6 LoadLeveler: 2539-479 Cannot listen on port 9614 

for service LoadL_negotiator. 

04/05 15:30:28 TI-6 LoadLeveler: Batch service may already be running 

on this machine. 


04/05 15:30:28 TI-8 LoadLeveler: Listening on path 

/tmp/negotiator_unix_stream_socket 

04/05 15:30:28 TI-10 Dispatching. 

As we can see that there is something with TCP port 9614, we use the netstat 

command to check the status, as shown in Example 6-130. The command output 

shows that TCP port 9614 is already in LISTEN state which is OK for now. 

However, at the time the Negotiator had the problem, this port could’ve been in a 

different state, such as FIN_WAIT, thus the port might not be available because 

Negotiator starts and terminates rapidly. 

Example 6-130 Checking the state of a port or socket 

bglsn:/tmp # netstat -an | grep 9614 

tcp 0 0 0.0.0.0:9614 0.0.0.0:* 

LISTEN 

Additional error messages pointing to problems with creating a socket are 

revealed in SchedLog and StartLog, as shown in Example 6-131. 

Example 6-131 Socket error messages in StartLog and SchedLog 

Startlog: 

04/05 13:00:18 TI-10 LoadLeveler: 2539-484 Cannot start unix socket on 

path /tmp/startd_unix_dgram_socket. errno = 98 

04/05 13:00:18 TI-4 Starter Table: 


bglsn.itso.ibm.com "LoadL_negotiator_collector" on port 9612. errno = 

111 

SchedLog: 


client bglsn.itso.ibm.com, errno =25. 

LoadLeveler: Failed to route task_resource_req_list (43008) in virtual 

int Task::encode(LlStream&) 


LoadLeveler: Failed to route node_tasks (34006) in virtual int 

Node::encode(LlStream&) 

LoadLeveler: Failed to route step_nodes (40033) in virtual int 

Step::encode(LlStream&) 

LoadLeveler: Failed to route StepList Steps (41002) in virtual int 

StepList::encode(LlStream&) 

LoadLeveler: Failed to route job_steps (22009) in virtual int 

Job::encode(LlStream&) 


client bglsn.itso.ibm.com, errno =25. 

LoadLeveler: Failed to route job_environment_vectors (22008) in virtual 

int Job::encode(LlStream&) 

As it turns out, the key to this problem is that in Linux environment, certain 

daemons can choose to open their temporary communication socket in the /tmp 

directory, most LoadLeveler daemons do. 

In normal situations, a daemon opens the socket in /tmp then removes it on clean 

exit. However, if some environment or network problems occur and the daemon 

exits abnormally, the temporary socket file might be left behind. 

When the daemon is started next time, it might not be able to overwrite the same 

file, specially when the daemon is started as a different user ID. In LoadLeveler 

case this could happen if the loadl user starts LoadLeveler, after it has previously 

started (and did not exit cleanly) by root user. 

As a result, the same file (left over by root user) cannot be overwritten by another 

user (loadl). These files need to be cleaned up manually from /tmp directory on 

every node before LoadLeveler can be started again. 

Note: The socket files created by different daemons have different names, as 


Example 6-132 Daemons’ socket files in /tmp 

bglsn:/tmp # ls -l *socket 

srwxrwxrwx 1 loadl loadl 0 Apr 5 15:40 negotiator_unix_stream_socket 

bglfen1:/tmp # ls -l *socket 

srwxrwxrwx 1 loadl loadl 0 Apr 5 13:33 startd_unix_dgram_socket 

srwxrwxrwx 1 loadl loadl 0 Apr 5 13:33 startd_unix_stream_socket 


To detect the left-over socket files in /tmp of every node, you need to stop 

LoadLeveler on all nodes. Use the ps command to make sure no LoadLeveler 

daemons are still around. Then, check the /tmp directory on every node for 

socket files with names similar to the ones in Example 6-132. 

In our case this was the issue and we solved it by stopping LoadLeveler on all 

nodes, cleaning the leftover socket files and restarted LoadLeveler. 

If manual removal of the socket files does not resolve the problem, the next step 

is to look into the core files generated by Negotiator (on abnormal exit). We can 

identify the core files from the following error messages: 



These errors can be found in the MasterLog shown in Example 6-128 on 

page 377. 

On a Blue Gene/L system, it is usual to set up processes to redirect their core 

files into the common directory /bgl/cores (/bgl is NFS mounted on all nodes: 

SN, FENs, and I/O nodes). 

However, LoadLeveler on Linux has one requirement to be able to generate core 

files: LoadLeveler has to be started as the root user ID. 

If LoadLeveler is not set up to run as the root user, this needs to be changed first. 

See Appendix A, “Installing and setting up LoadLeveler for Blue Gene/L” on 

page 409 for a procedure to set up the loadl user ID, then issue the command 

llctl -g start as root. If the Negotiator daemon crashes with signal 11 again, 

this time a core file is going to be generated in /bgl/cores/. 

Then, the core file can be investigated using the debugger (gdb). Depending on 

the information revealed from the trace stack of the memory dump, different 

procedures can be followed. If the trace stack does not have enough debugging 

information, a debugger enabled version of the LoadL_negotiator binary has to 

be used for recreating and generating the core file again. 

Additional checking can continue with the libraries validation and their symbolic 

links. See “Environment variables, network, and library links” on page 206. 




► If a daemon process exits abnormally, a new process might be spawned 

automatically. This is defined in the LoadL_config file with the keyword 

#@RESTART_PER_HOUR. 

► A temporary socket file is generated in the /tmp directory by the LoadLeveler 

daemons. If a daemon exits abnormally, the socket file might be left behind. 

► A Negotiator process exit with signal 11 should generate a core file for 

debugging purposes. On Linux systems, this core file is generated in /tmp 

directory by default. However, on Blue Gene/L system, this is usually 

configured by the system administrator to /bgl/cores/. 



Chapter 7. Additional topics 

This chapter presents two additional topics of interest for a Blue Gene/L 

environment: 

► Cluster Systems Management 

► Secure shell 

7 

Although a basic Blue Gene/L system can function without either Cluster 

Systems Management or secure shell, these products are needed for the 

centralized management and integration of the Blue Gene/L system in your 

computing environment. 


7.1 Cluster Systems Management 

This section provides a high-level introduction to Cluster Systems Management 

(CSM) and its use with Blue Gene/L. We do not provide the detailed information 

that you need to plan, install, and run CSM. For this, refer to the CSM product 

documentation. CSM support for Blue Gene/L was released as part of CSM 1.5 

in November 2005. The information in this section pertains to this level of CSM. 

However, be sure to check the latest CSM documentation for updates, which is 

available at: 

http://www14.software.ibm.com/webapp/set2/sas/f/csm/home.html 

7.1.1 Overview of CSM 

The Blue Gene Service Node software records configuration, RAS, and 

environmental information in the Blue Gene DB2 database. It also provides a 

Web interface and several CLI tools for working with the information that is 

stored. However, it is up to the Blue Gene administrators and users to watch or 

check the database, to determine when a problem occurs, and then to take 

appropriate action. 

CSM is an IBM licensed software product that is used to manage clusters 

consisting of AIX and Linux systems, from one to thousands. CSM has many 

capabilities, but in this section we focus on just one: its rich event monitoring and 

automated response capability. With CSM, you can specify what constitutes an 

event, and what should happen, automatically, if and when that event occurs. 

By installing CSM on your Blue Gene SN, you can automate the task of watching 

the database and taking corrective actions. For example, you could use CSM to 

watch the status of all the midplanes. If a midplane is marked in error or is 

marked as missing, the CSM software can detect this event and take whatever 

action that you have specified automatically. Perhaps you want to be paged or to 

receive an urgent e-mail when the event occurs. Or perhaps a special script 

should be run. Or perhaps all of these responses should happen simultaneously. 

CSM allows you to set up whatever monitoring and automated responses you 

need. 

If you are thinking “This monitoring and automated response stuff sounds 

interesting, but I don’t get it. Why drag a cluster management product like CSM 

into the picture?” Well, one good reason is precisely the monitoring and 

automated response capabilities of CSM! These capabilities are very powerful 

and customizable. You can ignore everything else about CSM if you choose. 


Over time, as you become more familiar with the other capabilities of CSM and 

begin to view your Blue Gene/L systems, such as SN, Front-End Nodes (FENs), 

and File Servers, as a set of systems that you would like to manage from a single 

point of control, you might find more and more reasons to use other features of 

CSM. 

Moreover, if you have a raft of other IBM systems around (running Linux and 

AIX) these could be centrally managed along with your Blue Gene system from 

the same management server. For now, however, we concentrate on a one-node 

CSM cluster, your Blue Gene SN. 

To use CSM with your Blue Gene in the simplest manner possible, begin with a 

fully installed, fully operational Blue Gene/L system, including SN, FENs, and file 

servers. Next, obtain CSM through your IBM sales representative, or just grab 

the free, full-featured 60 day try-and-buy version from: 


Follow the instructions in the manual CSM for AIX 5L and Linux V1.5 Planning 

and Installation Guide, SA23-1344-01 to install and configure the CSM 

management server software on your SN. Then follow the instructions in the 

same book for adding the optional CSM support for Blue Gene. When you’ve 

done this, your Blue Gene SN will double as a CSM management server and as 

the lone managed node in the CSM cluster. 

Note: The CSM software (server and client) is installed on the SN. If you want 

to configure your FENs and File Servers as managed nodes too, you can by 

installing the CSM client software, but that is optional. However, no CSM 

software is installed on the Blue Gene I/O or Compute Nodes. 

In the sections that follow, we discuss the monitoring and automated response 

capabilities of CSM and how to use them. You can find more information in the 

following CSM and Reliable Scalable Cluster Technology (RSCT) product 

publications (all available at the link previously mentioned): 

► CSM for AIX 5L and Linux V1.5 Administration Guide, SA23-1343-01 

► CSM for AIX 5L and Linux V1.5 Command and Technical Reference, 

SA23-1345-01 

► RSCT Administration Guide, SA22-7889-10 

► RSCT for Linux Technical Reference, SA22-7893-10 

Chapter 7. Additional topics 385

7.1.2 Monitoring the Blue Gene/L database with CSM 

When CSM is installed and configured on your SN (including the optional Blue 

Gene support), you can monitor the Blue Gene database for events of interest 

using a few simple commands. 

Assuming we have a condition (BGNodeErr), first you need to associate the 

condition with a response (BroadcastEventsAnyTime) by running the following 

command: 

# mkcondresp BGNodeErr BroadcastEventsAnyTime 

Then, you have to start monitoring the condition by running the command: 

# startcondresp BGNodeErr 

► A condition is a persistent CSM monitoring construct that identifies what to 

monitor, and what to monitor for. Basically, BGNodeErr is concerned with the 

Blue Gene database table TBGLNode, and in particular, is concerned with 

TBGLNode row updates that set the status column to E (Error) or M (Missing). 

► A response is a persistent CSM monitoring construct that identifies an action 

to take. In this example, BroadcastEventsAnyTime is a response that puts up 

a wall message for each event passed to it. 

► An event is a dynamic CSM monitoring construct generated by CSM when a 

monitored condition’s event expression evaluates true (that is, when that 

which is being monitoring for occurs). 

By running startcondresp BGNodeErr you are effectively telling CSM to monitor 

the Blue Gene database table TBGLNode for row updates that set the status 

column to E or M. This means that CSM is expected to generate an event 

whenever either type of row update occurs. Because you ran mkcondresp 

BGNodeErr BroadcastEventsAnyTime before that, CSM also knows that it should 

pass all such events to the BroadcastEventsAnyTime response, which, by 

design, puts up a wall message for each event passed to it. 

BGNodeErr is a predefined condition. CSM provides many predefined 

conditions. To get a list, simply run lscondition. To learn what a particular 

condition is for, run lscondition condition_name, as shown in Example 7-1. 

Example 7-1 Displaying condition information 

# lscondition BGNodeErr 

Displaying condition information: 

condition 1: 

Name = "BGNodeErr" 

Node = "c96m5sn02" 


MonitorStatus = "Not monitored" 

ResourceClass = "IBM.Sensor" 

EventExpression = "SD.Uint32>0" 

EventDescription = "An event will be generated when the node 

status is \"Error\" or \"Missing\" for an I/O or Compute Node in the 

Blue Gene system." 

RearmExpression = "" 

RearmDescription = "" 

SelectionString = "Name==\"BGNodeErr\"" 

Severity = "c" 

NodeNames = {} 

MgtScope = "m" 

BroadcastEventsAnyTime is a predefined response. CSM provides many 

predefined responses. To get a list, simply run lsresponse. To learn what a 

particular response does, run lsresponse response_name, as shown in 

Example 7-2. 

Example 7-2 Displaying response information 

# lsresponse BroadcastEventsAnyTime 

Displaying response information: 

ResponseName = "BroadcastEventsAnyTime" 

Node = "c96m5sn02" 

Action = "wallEvent" 

DaysOfWeek = 1-7 

TimeOfDay = 0000-2400 

ActionScript = "/usr/sbin/rsct/bin/wallevent" 

ReturnCode = 0 

CheckReturnCode = "n" 

EventType = "b" 

StandardOut = "n" 

EnvironmentVars = "" 

UndefRes = "n" 

7.1.3 Customizing the monitoring capabilities of CSM 

The monitoring capabilities and predefined conditions and responses that come 

with CSM are powerful and useful capabilities. However, these capabilities CSM 

is also customizable. But before talking about the commands that you can use to 

customize the monitoring capabilities of CSM, we need to describe in more detail 

the whole CSM Blue Gene database monitoring story. 


To simplify the earlier monitoring discussion, we purposely neglected to mention 

a few things. If you examine the output of lscondition BGNodeErr shown in 

Example 7-1, you will notice that there is no mention of the TBGLNode table, or 

our interest in row updates that set the status column to E or M. 

So where is this encoded? And how does the monitoring of the Blue Gene 

database really work? Let us consider the diagram in Figure 7-1. 

Service Node 

software 

CSM 

response 

CSM 

condition 

CSM 

sensor 

Database stored 

procedure 

Database trigger 

Blue Gene DATABASE 

(DB2) 

Figure 7-1 Blue Gene CSM monitoring diagram 

At the base of the diagram are the Blue Gene database and the SN software that 

writes to the database. Everything above that comes with CSM or is created by 

CSM when you run various commands. At the top are a couple of CSM 

monitoring constructs that we talked about already — a response and a 

condition. 

Below these is a CSM monitoring construct called a sensor, which we will explain 

later. And below the sensor are two DB2 constructs - a stored procedure and a 

trigger. The sensor, condition, and response used are up to you. You can use 

predefined ones, ones that you define, or a mix of the two types. We discuss how 

to define your own later. The trigger and stored procedure are created 


automatically for you by CSM when you start monitoring with the startcondresp 

command mentioned in the previous section. 

The large up-pointing arrow on the right simply indicates the overall flow; in 

layman’s terms, the trigger watches the database for the thing of interest to 

happen. If and when that thing happens, the trigger gathers pertinent data and 

passes it to the stored procedure. The stored procedure is just a middleman that 

passes the data to the sensor. 

When the condition becomes aware of new data in the sensor, the condition 

evaluates its EventExpression, which is based on sensor data. If the 

EventExpression evaluates true, an event is generated and passed to the 

response. The response does its thing — puts up a wall message, or sends an 

e-mail, or runs a script — whatever it is defined to do. 

Whoa! That is pretty complicated! Can’t you just insert your own database trigger 

and stored procedure into the database and initiate a desired action directly? 

The answer is "Yes," but you would need strong DB2 database administrator and 

programmer skills and the willingness to invest the time and effort to develop and 

to test the necessary DB2 constructs and code. 

Before introducing the CSM commands used to create custom monitoring 

constructs, let’s take a closer look at the constructs called sensors. As the 

diagram above shows, a condition is paired with a sensor. Look again at the 

output of lscondition BGNodeErr shown in Example 7-1. Two attributes specify 

with which sensor the condition BGNodeErr is paired: 

ResourceClass = "IBM.Sensor" 

SelectionString = "Name==\"BGNodeErr\"" 

That is, condition BGNodeErr is paired with a sensor of the same name. You can 

obtain sensor details by running the lssensor sensor_name command, as shown 


Example 7-3 Displaying sensor information 

# lssensor BGNodeErr 

Name = BGNodeErr 

ActivePeerDomain = 

Command = /opt/csm/csmbin/bgmanage_trigger -t TBGLNODE -C STATUS 

-o u -x "n.STATUS = 'E' OR n.STATUS = 'M'" -p LOCATION,STATUS 

BGNodeErr 

ConfigChanged = 0 

ControlFlags = 5 


Description = This sensor is updated when the node status is "Error" 

or "Missing" in the Blue Gene system. Use "SD.Uint32>0" as the event 

expression in all corresponding conditions. 

ErrorExitValue = 1 

ExitValue = 0 

Float32 = 0 

Float64 = 0 

Int32 = 0 

Int64 = 0 

NodeNameList = {c96m5sn02} 

RefreshInterval = 0 

SD = [,0,0,0,0,0,0] 

SavedData = 

String = 

Uint32 = 0 

Uint64 = 0 

UserName = bglsysdb 

Here, you see that sensor BGNodeErr’s Command attribute identifies TBGLNode 

as the Blue Gene table of interest, and n.STATUS = 'E' OR n.STATUS = 'M' as 

the column values to watch for. When you start monitoring, the sensor’s 

Command (bgmanage_trigger), is called with the arguments shown. It is 

bgmanage_trigger’s job to create the DB2 Trigger and Stored Procedure 

required. 

The purpose of this exposé is to point out that there are three main CSM 

monitoring constructs that are involved in monitoring the Blue Gene/L database: 

a sensor, a condition, and a response. There are many predefined ones that you 

can use, but if these do not meet your needs entirely, you can define your own. 

7.1.4 Defining your own CSM monitoring constructs 

In general, you define custom sensors with the CSM mksensor command. 

However, for Blue Gene/L sensors, you should use the CSM bgmksensor 

command instead because it understands the Blue Gene/L database and 

provides all the flags and options that are needed for creating a Blue Gene/L 

database sensor. You define custom conditions and responses using the 

standard CSM mkcondition and mkresponse commands. 


For example, suppose that you want to monitor the TBGLFanEnviroment table 

for high fan temperatures. CSM provides no predefined sensor for this type of 

monitoring. However, it is easy to define your own. On the SN, you can run this 

command: 

# bgmksensor -t TBGLFanEnvironment -o i -x “n.temperature>35” 

-p location,temperature BGFanTempHi 

Which translates to: 

Create a Blue Gene sensor named BGFanTempHi. Whenever a row is 

inserted in the TBGLFanEnvironment table with a temperature above 35 

degrees Celsius, the sensor caches the fan location and temperature and 

notifies all conditions that care. (The values to cache are specified with the -p 

flag, and are the values passed to the response in the generated event 

notification). 

Your then create a condition to monitor this new sensor. For example, run a 

command similar to the following on your CSM management server: 

# mkcondition -d ‘Generate an event when the temperature of a Blue 

Gene fan module rises above 35 degrees Celsius.’ -r IBM.Sensor -s 

Name=”BGFanTempHi”’ -m m -e “SD.Uint32>0” BGFanTempHi 

This command translates to: 

Create a condition named BGFanTempHi to monitor the sensor with the 

same name. (The -m and -e flags must simply be set as shown for all 

conditions created to monitor Blue Gene sensors). 

To start monitoring, run the following commands on your CSM management 

server: 

# mkcondresp BGFanTempHi MsgEventsToRootAnyTime "E-mail root 

off-shift" LogCSMEventsAnyTime 

# startcondresp BGFanTempHi 

After you have done all this, CSM creates the necessary DB2 trigger and stored 

procedure for you and turns monitoring on for condition BGFanTempHi 

automatically. 


To verify that all is working properly, check the following items: 

► As root, run the following command: 

lsaudrec -l 

Near or at the bottom of the output, you should see the information shown in 

Example 7-4. 

Example 7-4 Checking monitoring conditions 

# lsaudrec -l 

.....>>> Omitted lines Omitted lines

With monitoring on, each time a row is added to the TBGLFanEnvironment 

table with a temperature above 35 degrees Celsius, an event will be 

generated. Due to the mkcondresp command (see “Monitoring the Blue 

Gene/L database with CSM” on page 386), the event will kick off three 

responses on the management server: 

– A message to root announcing the event (Example 7-7). 

Example 7-7 Message to root user when event BGFanTempHi happens 

Message from root@c96m5sn02 on at 15:01 ... 

Critical Event occurred: 

Condition: BGFanTempHi 

Node: c96m5sn02.ppd.pok.ibm.com 

Resource: BGFanTempHi 

Resource Class: Sensor 

Resource Attribute: SD 

Attribute Type: CT_SD_PTR 

Attribute Value: [location=J302 temperature=3.6E1,0,1,0,0,0,0] 

Time: Friday 07/21/06 15:00:59 

– An e-mail to root user (on CSM management server) with event details if 

the event occurs off-shift. 

– Logging of the event in the /var/log/csm/systemEvents file. 

The e-mail and log entry are shown in Example 7-8. 

Example 7-8 The e-mail and log entry for BGFanTempHi event 

Friday 07/21/06 15:00:59 

Condition Name: BGFanTempHi 

Severity: Critical 

Event Type: Event 

Expression: SD.Uint32>0 

Resource Name: BGFanTempHi 

Resource Class: IBM.Sensor 

Data Type: CT_SD_PTR 

Data Value: ["location=J302 temperature=3.6E1",0,1,0,0,0,0] 

Node Name: c96m5sn02.ppd.pok.ibm.com 

Node NameList: {c96m5sn02.ppd.pok.ibm.com} 

Resource Type: 0 


7.1.5 Miscellaneous related information 

If you are interested in examining the DB2 Trigger and Stored Procedure that 

CSM creates for you, run bgmksensor with the -v flag (bgmksensor formulates the 

SQL statements used to create the Trigger and Stored Procedure before they’re 

actually needed so that it can test them), or query the Blue Gene database 

directly (after you have started monitoring with the startcondresp command). 

The database trigger name is derived from the sensor name by appending _CSM 

suffix (for example, for a sensor named BGFanTempHi, there will be a Trigger 

named BGFanTempHi_CSM when that sensor is actively used in monitoring). 

The database stored procedure story is actually more complicated than we’ve 

described. We created two Stored Procedures. One was named 

COMMON_BGP and the other COMMON_BGP_ext. The trigger calls 

COMMON_BGP which in turn calls COMMON_BGP_ext. COMMON_BGP exists 

to catch any SQL exceptions that occur. COMMON_BGP_ext calls a utility 

named refresh_sensor in a shared library named bgrefresh_sensor.so. 

refresh_sensor writes the data into the sensor. 

We created yet another DB2 construct that we have not mentioned because it 

plays a minor role—a DB2 sequence. Its name is derived from the sensor name 

by appending _CSM (for example, for a sensor named BGFanTempHi, there will be 

a Sequence named BGFanTempHi_CSM). The trigger uses the sequence to 

obtain a new number for each event it forwards to COMMON_BGP. 

After you have started monitoring, you can log on to the SN as bglsysdb and run 

the commands shown in Example 7-9. 

Example 7-9 Examining the constructs created by CSM in the Blue Gene database 

bglsysdb@bglsn~> db2 connect to bgdb0 

bglsysdb@bglsn~> db2 "select text from syscat.triggers where trigname = 

'BGFANTEMPHI_CSM'" 

bglsysdb@bglsn~> db2 "select procname,text from syscat.procedures where procname like 

'COMMON_BGP%'" 

bglsysdb@bglsn~> db2 "select * from syscat.sequences where seqname = ' 

BGFANTEMPHI_CSM'" 

7.1.6 Conclusion 

CSM brings powerful and customizable monitoring and automated response 

capabilities to the Blue Gene/L environment. By exploiting them you can 

minimize or eliminate much of the manual problem determination work often 

facing the Blue Gene/L system administrator. 


7.2 Secure shell 

Furthermore, because these capabilities are extensions to existing CSM 

capabilities, you can easily monitor and set up automated responses for 

non-Blue Gene/L database problems as well. For example, get paged when the 

/var file system on your SN fills up; send an urgent e-mail to the appropriate 

person when the number of users on a FEN crosses some threshold. Run a 

script when the network adapter on a File Server is being overwhelmed. And 

there are so many other examples. Moving beyond monitoring, CSM offers a 

wealth of other capabilities that could help you manage your Blue Gene/L 

systems, such as distributed command execution, configuration file 

management, software maintenance, and so forth. 

This section begins with a short introduction to cryptographic techniques in a 

computing environment, continues with a secure shell overview, and ends with 

an example of how to use secure shell in a clustering environment. 

7.2.1 Basic cryptography 

One of the biggest problems in a networking environment is to design and 

implement a security mechanism that allows the whole computing environment 

to function properly, without interruptions, and also to ensure reliable data 

manipulation (data you can trust). Depending on the information (data) travelling 

across networks, you need to make sure that: 

► The data gets form sender to receiver unaltered: data integrity. 

► The data cannot be interpreted (understood) by anyone eavesdropping on the 

communication channel: data privacy. 

► The data arrived at the receiver side is coming from who the receiver thinks is 

coming from: data authenticity. 

For all these reasons and more, a series of cryptographic techniques has been 

developed and implemented. The cryptographic techniques are based on 

mathematical algorithms translated (programmed) into computer language. 

Thus, security has become an important component of a highly available and 

reliable computing environment. 


The cryptographic technique are employed in data integrity, data privacy, and 

data authenticity in various forms and complexities, depending on the data 

security required: 

► Authentication (verifying identities) 

The parties communicating need to know and verify (with a reasonable level 

of trust) each other’s identity. 

► Authorization (access control) 

After the parties identities have been established, this establishes what 

actions will we allow someone to perform. 

► Data signing 

In addition to establishing the identities at the beginning of the communication 

session, it is also necessary to avoid the “man in the middle” attack. 

► Data encryption 

If we also want to avoid that someone else (third party) understands the data, 

we need to encrypt it in such a way that it can only be decrypted at the 

destination. 

► Accountability 

For multiple reasons we need to be able to trace back any system activity. 

It is also important to establish the effective level of security, in order to ensure 

an acceptable performance level. Sometimes too much security can be 

disturbing, because the systems can actually spend more time enforcing the 

security than for data computing. 

The encryption based mechanisms used in securing network communication 

are: 

► Symmetric key 

► Public/Private key pair (also known as PKI - Public Key Infrastructure) 

► Hash functions 

► Combinations of these techniques 

► HW cryptography 


Symmetric key cryptography 

Symmetric key cryptography (Figure 7-2) uses a single (secret) key known by 

both parties in communication. It is relatively fast, but it has a drawback in the fact 

that the shared secret (key) must be distributed (somehow) to the communicating 

parties. If this shared secret is distributed over the (insecure) network, additional 

precautions must be taken. 

Nina encrypts with Dan's Public Key Dan decrypts using his own Private Key 

Nina 

Figure 7-2 Symmetric key encryption 

Message from Nina to 

Dan 

Unsecured Network 

Dan and Nina's Identical Key 

Algorithms used for symmetric key implementations include: 

► Data Encryption Standard (DES) algorithm 

The most commonly-used bulk cipher or block cipher algorithm, which was 

developed by IBM. 

► Commercial Data Masking Facility (CDMF) 

A method to shrink a 56-bit DES key to a 40-bit key suitable for export, which 

was also designed by IBM. 

► Triple DES 

The same as DES, but information is encrypted three times in a row using 

different keys each time. 

Dan 


► RC2/RC4 algorithms 

RC2: block cipher (similar to DES), RC4: stream cipher (possible 40-bit key). 

Developed by Ron Rivest (RSA Data Security) and permit variable length 

keys 

► International Data Encryption Algorithm (IDEA) 

Has 128-bit key as is not a Government imposed standard. Pretty Good 

Privacy (PGP) uses IDEA and is freely available. 

Public Key Infrastructure 

Public Key Infrastructure (PKI) is based on a pair of asymmetric keys for each 

party in communication: 

► A private key (which never leaves the host where it has been generated 

► A public key which is send over the network to corresponding party for 

various reasons, such as digital signature, authentication, and so forth. 

In PKI, the information encrypted with one key (either public or private), can only 

be decrypted with its pair. There is no need to securely share a secret between 

the sender and receiver, however, this mechanism is much less efficient than 

symmetric key encryption, thus it is not efficient for bulk data encryption. 

The only one widely-used general purpose public key mechanism is Rivest, 

Shamir, and Adelman (RSA) algorithm, which relies on the factorization of large 

numbers, and property of RSA Data Security, Inc. 

Figure 7-3 presents a simplified diagram of how the PKI is used. In this case, 

Nina is sending an encrypted message (data) to Dan. For this, Nina uses Dan’s 

public key, thus ensuring that the data can only be decrypted by Dan. We do not 

show here other mechanisms, like how to make sure Nina has received Dan’s 

public key, or how Dan can be sure the message comes from Nina, and that it 

has not been altered while travelling through the communication channel. 


Figure 7-3 PKI - Nina sending a message to Dan 

Secure shell is based on PKI and uses several of these techniques to make sure 

data that gets from Nina to Dan can be trusted. 

7.2.2 Secure shell basics 

Nina encrypts with Dan's Public Key Dan decrypts using his own Private Key 

Nina's key 

pair 

Private 

Public 

Message from Nina to 

Dan 

Unsecured Network 

Initial key exchange 

(DAN to NINA only) 

Private 

Dan's key 

pair 

Secure shell is a client-server tool implemented for making possible secure 

system administration operations across complex networks. Secure shell 

employs a number of cryptographic techniques and provides a series of facilities 

(functions, protocols, and so forth) for the system administrators to use in their 

job. 

Beside the basic remote login, it also provides functions as remote command 

execution, remote file copy, and more sophisticated techniques, such as 

tunnelling (encrypted communication channel) for various other applications. 

In this section we provide basic information about secure shell server and client. 

Public 


Secure shell server 

The secure shell server runs as a service (daemon) on the host system and 

provides the infrastructure for allowing incoming clients to connect and perform 

various administrative actions. 

Figure 7-4 shows the major dependencies of the secure shell server (sshd). 

Usually, this is started as a service at system boot time using /etc/rc.d/init.d/sshd 

script (depending on the init runtime mode). 

Server keys: 

rsa1, rsa2, 

dsa 

Server 

configuration 

files 

SSL Libraries 

Secure shell 

daemon (sshd) 

Server 

binaries and 

commands 

Figure 7-4 Secure shell server dependencies 

These dependencies are: 

► Secure socket layer (SSL) libraries, commands and headers. 

These provide the cryptographic functions and commands for various 

operations (key generation, encryption/decrypting). 

► Server configuration binaries and scripts 

This is the actual server code (executables, libraries, and scripts). 


Listen on port 

TCP:22 

User 

authentication 

configuration 

System 

authentication 

files and 

libraries

► Server configuration files 

These are used by the server for creating its runtime environment and are 

usually located in /etc/ssh directory. 

► System authentication files and libraries 

Secure shell server uses these files to pass authentication requests to the 

system (user/password). 

► Server keys 

As a daemon (service), ssh server runs under the root user ID, and 

represents an entity with its own identity, thus it has its own pair of keys 

(actually, it has three pairs of keys: rsa1 rsa2, and dsa), also known as ssh 

host keys. 

► User authentication files are actually located in each user’s ~/.ssh directory, 

and they represent the identities (users) known to the server to be 

authenticated using their public keys (ssh server “knows” client’s identity 

based on client’s public key). 

► TCP port 22 

This is the default port used by the ssh server to listen for incoming requests. 

It is configurable in /etc/ssh/sshd_config. 

Secure shell client 

Secure shell client provides (among others) a set of tools used for executing 

administrative tasks on remote systems. On a Linux (and AIX) system these are: 

► /usr/bin/scp 

Secure remote copy program; replaces /usr/bin/rcp 

► /usr/bin/sftp 

Secure file transfer program; replaces /usr/bin/ftp; however it has less 

features than classic ftp 

► /usr/bin/slogin 

Secure remote login program; replaces /usr/bin/rlogin 

► /usr/bin/ssh 

Secure shell program; replaces /usr/bin/rsh 


Secure shell client programs also depend on SSL libraries to perform data 

encryption/decryption. The client software has a general configuration file, 

/etc/ssh/ssh_config, and a directory for each user (~/.ssh) for storing user pair(s) 

of keys and authentication files: 

► known_hosts 

This file stores the public keys of the servers the user has connected. Beside 

the keys, you can also specify certain options for connections to remote 

servers. 

► authorized_keys 

Even though this file is stored in users’ ~/.ssh directory, it is actually used by 

the local ssh daemon, and contains the public keys of the remote identities 

(user@host) allowed to execute commands on the local operating system. It 

can also contain various options, such as command redirection, login 

parameters, and so forth. 

Note: The ~/.ssh/authorized_keys (or authorized_keys2) is actually used 

by the local ssh server (sshd). 

► User’s ssh client configuration file (optional) if you want to specify additional 

configuration parameters (besides the ones in /etc/ssh/ssh_config) 

7.2.3 Sample configuration in a cluster environment 

In a cluster environment, because the nodes (systems) have to work together, 

some type of remote command must be set up between nodes. Generally, the 

requirement for such a program allows known identities (users and services) 

access across the network to remote services based on unprompted 

authentication (establishing the remote identity without interactive 

authentication—prompt for user name or password). 


Figure 7-5 presents the files involved in remote command execution using secure 

shell with un-prompted authentication. 

node1 (ssh client) 

user: root@node1 

files: 

~/.ssh/id_rsa 

~/.ssh/id_rsa.pub 

~/.known_hosts 

command: 

/usr/bin/ssh 

node2 (ssh server) 

Figure 7-5 The ssh files involved in un-prompted authentication 

files: 

/etc/ssh/ssh_host_rsa_key 

/etc/ssh/ssh_host_rsa_key.pub 

user: root@node2 

files: 

~/.ssh/authorized_keys 

sshd: Listen on 

TCP port 22 

In the diagram, we assume that user root@node1 wants to execute a command 

(date) on node2, also as user root (root@node2). 

root@node1_# ssh root@node2 date 

We assume that no previous configuration was done nor that initial trust has 

been established. Thus, the following happens: 

► User root@node1 sends its identity to the server, specifying he wants to 

connect as root@node2. 

► The server (sshd) node2 sends its public key 

(/etc/ssh/ssh_host_rsa_key.pub) to the terminal that was used to initiate the 

connection, asking the user (root@node1) if he wants to accept the key (you 

have to explicitly type in “yes”). 

► When accepted, the key is stored in ~/.ssh/known_hosts file for root@node1 

user, and a session key is generated. The session key will be used to encrypt 

the information further transmitted during this session (until connection 

closes). 

Note: At this point in time, the ~/.ssh directory might not even exist. When 

you accept the server’s key, the directory is created along with the 

known_hosts file. 

► The server looks for root@node2 user’s ~/.ssh/authorized_key file. 


► If this file exists, it checks for root@node1 public key (which has not yet been 

created on node1). 

► Because it cannot authenticate the user, it passes the control to the system 

authentication which prompts for root@node2 password. 

► When typed in correctly, the date command is executed and its result is 

returned to root@node1 user’s terminal. 

However, if configuration and initial trust have been performed, the date 

command is executed without prompting the user for a password. 

To achieve this you have to do the following: 

► On node1, as root, generate the ssh client keys, type rsa, no passphrase: 

root@node1 #/usr/bin/ssh-keygen -t rsa -f ~/.ssh/id_rsa -N '' 

► On node1, as root user, grab node2 ssh server’s key, and store it in local root 

user’s known_hosts file: 

root@node1 #/usr/bin/ssh-keyscan -t rsa node2 >> ~/.ssh/known_hosts 

► On node1, send root user’s public key previously generated to root@node2: 

root@node1 #/usr/bin/scp ~/.ssh/id_rsa.pub 

root@node2:~/.ssh/node1_rsa.pub 

► On node2, add the public key received to root user’s authorized_keys file: 

root@node2 #cat ~/.ssh/node1_rsa.pub >> ~/.ssh/authorized_keys 

Now, you can execute remote commands as root@node2, from root@node1, 

without a password. 

Attention: For simplicity, our example refers to a one-way configuration, that 

is root user using ssh client on node1 executes a command (without 

interactive authentication) as root on node2. 

If you want user root@node2 to be able to run a remote command without 

interactive authentication) as root on node1, you need to configure symmetric 

files on both nodes. 

Objective 

Because many of the today’s clusters are connected through multiple networks, 

we would like our cluster to be as secure as possible and, at the same time, to 

allow access to remote resources in the cluster as seamless as possible. For this 

purpose, secure shell offers a reasonable solution with a good compromise for 

the security/administrative overhead ratio, thus it is the de facto standard tool for 

remote administration tasks. However, some basic security knowledge and a 


good understanding of the secure shell client-server implementation is required 

to achieve good. 

Our objective is to set up a cluster for remote command execution for root user 

on all nodes without interactive authentication, using a single set of keys for all 

ssh servers (daemons) running on every node in the cluster. 

For our exercise we have used the sample cluster shown in Figure 7-6. 

p630n01 p630n01 p630n01 

172.16.1.31 172.16.1.32 172.16.1.33 

Figure 7-6 Sample cluster configuration 

The three nodes run AIX 5.3 TL4, openssl 0.9.7-2g, and openssh 4.1. As it is 

outside the scope of this material, we do not explain how to install the software or 

configure the basic OS. 

Checking and setting up the configuration 

We started from scratch with ssh (no previous connection, nor authentication 

configuration). We have performed the following steps: 

► On node p630n01 we checked the server configuration for the location of the 

server keys, then generated three pairs of keys for the secure shell server 

(rsa1, rsa2, and dsa): 

root@p630n01_#/usr/bin/ssh-keygen -t rsa1 -f /etc/ssh/ssh_host_key 

-N '' 

root@p630n01_#/usr/bin/ssh-keygen -t rsa -f 

/etc/ssh/ssh_host_rsa_key -N '' 

root@p630n01_#/usr/bin/ssh-keygen -t dsa -f 

/etc/ssh/ssh_host_dsa_key -N '' 

► We restarted the ssh daemon (sshd) on p630n01 to learn the new keys. 

► We propagated all three pairs of keys in /etc/ssh directory to p630n02 and 

p630n03 (using scp). 


► We restarted the ssh daemons on p630n02 and p630n03. 

Note: If you work remotely from a single node, it is good idea to restart the 

entire node if you cannot handle just the ssh daemons. 

► To avoid any conflicts, we wiped out the contents of the root user’s 

~/.ssh/known_hosts file on p630n01: 

root@p630n01_# > ~/.ssh/known_hosts 

► We grabbed the ssh server public keys from all nodes in a new known_hosts 

file. For this we use a file (my_nodes) containing the host names (IP labels) of 

nodes p630n01,2,3 (text file, one host name per line): 

root@p630n01_# /usr/bin/ssh-keyscan -t rsa -f my_nodes > 

~/.ssh/known_hosts 

Note: The format of known_hosts file uses one line per every host known. 

Using the ssh-keyscan command assumes all nodes are up and running (sshd 

as well). However, if not all nodes are up, or if you have tens of nodes (or 

more) in your cluster, and you plan to use a single pair of keys for all machines 

in your cluster, you can use wildcard characters to specify the host part of the 

key in the known_hosts file, using one line (and one key) to cover all hosts in a 

specified IP range. For details, check the ssh man pages. 

► We generated the root user pair of keys. For this, we have chosen rsa2 

(which is, in fact, the rsa type) algorithm: 

root@p630n01_#/usr/bin/ssh-keygen -t rsa -f ~/.ssh/id_rsa -N '' 

► We copied the public key into a fresh authorized_keys file (if you already have 

such a file you need to check it for duplicates and correct entries): 

root@p630n01_# cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys 

► We propagated the entire contents or the ~/.ssh directory to all nodes in the 

cluster (p630n01,2,3). 

► We tested the authentication. 

7.2.4 Using ssh in a Blue Gene/L environment 

In a Blue Gene/L environment, secure shell is used to allow remote command 

execution between the SN and the I/O nodes (mainly for GPFS, but not only). It 

can also ne used for mpirun (between FENs and SN). 

As a particularity, in Linux, when using remote command execution between 

FENs and the SN, the pair rsh/rshd does not allow you to interactively login onto 


the SN, as the rlogind (remote login daemon), or telnetd are not usually running 

on the SN (default configuration in SUSE SLES9). However, if you plan to use 

ssh as remote command execution program between the FENs and the SN, you 

need a special configuration for the authorized_keys file of the users allowed to 

execute remote commands on the SN. 

As previously mentioned, the format of the authorized_keys file allows you to 

customize the behavior of the remote command execution with ssh (see the ssh 

man pages). 

Tip: Using the no-pty option in front of the public key in the authorized_keys 

file allows you to execute remote commands without being prompted for 

password. However, you will not get a pseudo-terminal (pty) if you try to open 

a shell (ssh without any command). 

For Open secure shell detailed information, credits and latest news, see: 

http://www.openssh.org 



A 

Appendix A. Installing and setting up 

LoadLeveler for Blue Gene/L 

This appendix describes the steps for installing and setting up LoadLeveler for 

the Blue Gene/L system built during the writing of this red book. This is to show 

one way of setting up and configuring a LoadLeveler cluster. Special attention is 

directed to Blue Gene/L specific items and procedures such as: 

► The additional 32-bit library RPM 

► The specific option flag when running the installation script 

► Blue Gene/L configuration keywords 

► Blue Gene/L environment variables 

At the end of the setting up process, the message Blue Gene is present is key 

indication of a successful installation and configuration of LoadLeveler. 


Installing LoadLeveler on SN and FENs 

Obtaining the rpms 

This section explains how to install LoadLeveler on SN and FENs. 

The RPMs can come from the provided CD-ROM. The upgrade rpms can also be 

downloaded from the IBM service support Web site: 

http://www14.software.ibm.com/webapp/set2/sas/f/loadleveler/download/un 

ix.html 

There are five required rpms for LoadLeveler. Example A-1 lists the rpms for the 

IBM System p platform and the operating system is SLES9. 

Example: A-1 LoadLeveler RPMs 

LoadL-full-SLES9-PPC64-3.3.1.0-3.ppc64.rpm 

LoadL-full-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm 

LoadL-so-SLES9-PPC64-3.3.1.0-3.ppc64.rpm 

LoadL-so-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm 

LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm 

Note: LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm is required for Blue 

Gene/L only. 

On the download Web site, click View to see information about the upgrade 

(Figure A-1). 

Figure A-1 Downloading LoadLeveler rpms from the Web 

In addition to the five RPMs, the Java rpm has to be present in the same 

directory. Although the system has a later version of Java installed, the following 

rpm is required: 

IBMJava2-JRE-ppc64-1.4.2-0.0.ppc64.rpm 


Installing the rpms 

Note: Without this Java rpm, the install_ll script does not run. See IBM TWS 

LoadLeveler Installation Guide (GI10-0763-02) for further information. 

First, the following command installs the LoadLeveler license in directory 

/opt/ibmll/LoadL/. 

rpm -ivh LoadL-full-license-SLES9-PPC64-3.3.1.0-3.ppc64.rpm 

Then, locate the install_ll script in /opt/ibmll/LoadL/sbin/ and invoke it as 

following: 

.install_ll -y -b -d 

where is the directory containing the LoadLeveler rpms. 

Note: The option flag -b is important to tell install_ll to install the 

LoadL-full-lib-SLES9-PPC-3.3.1.0-3.ppc.rpm. Without this flag specified, 

install_ll installs only the rpms for a regular SLES9 node. 

The same installation procedure has to be repeated on all nodes. If available, 

dsh can be used to do installation on remote nodes. 

Setting up the LoadLeveler cluster 

This section explains how to set up the LoadLeveler cluster. 

LoadLeveler user and group IDs 

LoadLeveler requires one user ID to be the administrator. Also, it is 

recommended that a group ID is also created for the purposes of running 

LoadLeveler. If a group ID does not already exist, create a loadl group with the 


groupadd -g 7000 loadl 

Note: The group ID 7000 is arbitrary. However, it has to be the same on all 

nodes in the cluster. 

If a user ID does not already exist, create a loadl user ID with the following 

command: 

useradd -d /home/loadl -g loadl -u 7001 -p loadl loadl 

Appendix A. Installing and setting up LoadLeveler for Blue Gene/L 411

Create the home directory for user loadl. This is a local file system: 

mkdir /home/loadl 

LoadLeveler configuration 

Create a common (NFS mounted) directory to contain LoadLeveler configuration 

files: 

mkdir /bgl/loadlcfg 

Copy the two provided sample files LoadL_admin and LoadL_config from 

/opt/ibmll/LoadL/full/samples/ to /bgl/loadlcfg and make appropriate changes. 

Create the file /etc/LoadL.cfg with this content similar to Example A-2. 

Example: A-2 Content of /etc/LoadL.cfg 

LoadLUserid = loadl 

LoadLGroupid = loadl 

LoadLConfig = /bgl/loadlcfg/LoadL_config 

Create LoadLeveler directories such as log, spool, execute under /home/loadl by 

issuing the following command as user loadl: 

/opt/ibmll/LoadL/full/bin/llinit -local /home/loadl 

Note: The llinit command also creates link to the LoadLeveler bin directory 

from /home/loadl so that user loadl can invoke the LoadLeveler commands 

such as llctl, llstatus, llq, and so forth, without requiring the directory added 

into the $PATH variable. 

Again, the llinit command has to be run on all nodes. 

Enable rsh on all nodes. Then, start LoadLeveler with the following command: 

llctl -g start 


Enabling Blue Gene/L capabilities in LoadLeveler 

Up to this point, we have a regular LoadL cluster on these Linux nodes. 

LoadLeveler does not know anything about Blue Gene/L yet. 

To tell LoadLeveler about the Blue Gene/L system, add these Blue Gene/L 

specific keywords into the global configuration file /bgl/loadlcfg/LoadL_config: 




BG_CACHE_PARTITIONS = true 

Now, stop and start LoadLeveler and it recognizes Blue Gene/L. But, it may 

says "Blue Gene is absent" because LoadLeveler cannot find the relevant Blue 

Gene/L libraries yet. See Figure 4-12 on page 172 for the message Blue Gene is 

present displayed from the llstatus command. 

To create the appropriate symbolic links for the libraries that LoadLeveler needs, 

run the following script as root user: 

/home/loadl/bglinks 

Presumably, the system administrator has to create this file and run it once on 

every node. The contents of the script is described in this red book. See 4.4.5, 

“Making the Blue Gene/L libraries available to LoadLeveler” on page 173. 

Setting Blue Gene/L specific environment variables 

Set up the following environment variables in the .bashrc for user loadl: 

export BRIDGE_CONFIG_FILE = 



export MMCS_SERVER_IP=bglsn.itso.ibm.com 

Note: The actual directory paths vary on different systems. 

Now stop and restart LoadLeveler. It should say Blue Gene is present. See 

Figure 4-12 on page 172 and 4.4.9, “LoadLeveler checklist” on page 186 for 

detail descriptions of LoadLeveler status checking. 


Example A-3 presents a sample LoadL_admin file we used in our environment. 

Example: A-3 Sample LoadL_admin file 

# LoadL_admin file: Remove comments and edit this file to suit your 

installation. 

# This file consists of machine, class, user, group and adapter 

stanzas. 

# Each stanza has defaults, as specified in a "defaults:" section. 

# Default stanzas are used to set specifications for fields which are 

# not specified. 

# Class, user, group, and adapter stanzas are optional. When no 

adapter 

# stanzas are specified, LoadLeveler determines adapters dynamically. 

Refer to 

# Using and Administering LoadLeveler for detailed information about 

# keywords and associated values. Also see LoadL_admin.1 in the 

# ~loadl/samples directory for sample stanzas. 

####################################################################### 

###### 

# DEFAULTS FOR MACHINE, CLASS, USER, AND GROUP STANZAS: 

# Remove initial # (comment), and edit to suit. 

# 

default: type = machine 

default: type = class # default class stanza 

wall_clock_limit = 30:00 # default wall clock limit 

default: type = user # default user stanza 

default_class = No_Class # default class = No_Class (not 

# optional) 

default_group = No_Group # default group = No_Group 

(not 

# optional) 

default_interactive_class = inter_class 

default: type = group # default group stanza 

# priority = 0 # default GroupSysprio 

# maxjobs = -1 # default maximum jobs group 

is allowed 

# to run simultaneously (no 

limit) 

# maxqueued = -1 # default maximum jobs group 

is allowed 


# on system queue (no limit). 

does not 

# limit jobs submitted. 

####################################################################### 

###### 

# MACHINE STANZAS: 

# These are the machine stanzas; the first machine is defined as 

# the central manager. mach1:, mach2:, etc. are machine name labels - 

# revise these placeholder labels with the names of the machines in the 

# pool, and specify any schedd_host and submit_only keywords and values 

# (true or false), if required. 

####################################################################### 

###### 

bglsn.itso.ibm.com: type = machine 



bglfen1.itso.ibm.com: type = machine 


bglfen2.itso.ibm.com: type = machine 



Example A-4 presents a sample LoadL_config file we used in our environment. 

Example: A-4 Sample LoadL_config file 

# 

# Machine Description 

# 

ARCH = PPC64 

# 

# Blue Gene Specific Settings 

# 



#BG_ALLOW_LL_JOBS_ONLY = true 


BG_CACHE_PARTITIONS = true 

# 

# Specify LoadLeveler Administrators here: 

# 

LOADL_ADMIN = loadl root 

# 

# Default to starting LoadLeveler daemons when requested 

# 

START_DAEMONS = TRUE 

# 

# Machine authentication 

# 

# If TRUE, only connections from machines in the ADMIN_LIST are 

accepted. 

# If FALSE, connections from any machine are accepted. Default 

if not 

# specified is FALSE. 

# 

MACHINE_AUTHENTICATE = FALSE 

# 

# Specify which daemons run on each node 

# 

SCHEDD_RUNS_HERE = True 

STARTD_RUNS_HERE = True 

# Specify pathnames 


# 

RELEASEDIR = /opt/ibmll/LoadL/full 

LOCAL_CONFIG = $(tilde)/LoadL_config.local 

ADMIN_FILE = /bgl/loadlcfg/LoadL_admin 

LOG = $(tilde)/log 

SPOOL = $(tilde)/spool 

EXECUTE = $(tilde)/execute 

HISTORY = $(SPOOL)/history 

RESERVATION_HISTORY = $(SPOOL)/reservation_history 

BIN = $(RELEASEDIR)/bin 

LIB = $(RELEASEDIR)/lib 

# 

# Specify port numbers 

# 

MASTER_STREAM_PORT = 9616 

NEGOTIATOR_STREAM_PORT = 9614 

SCHEDD_STREAM_PORT = 9605 

STARTD_STREAM_PORT = 9611 

COLLECTOR_DGRAM_PORT = 9613 

STARTD_DGRAM_PORT = 9615 

MASTER_DGRAM_PORT = 9617 

# 

# Specify a scheduler type: LL_DEFAULT, API, BACKFILL, GANG 

# API specifies that internal LoadLeveler scheduling algorithms be 

# turned off and LL_DEFAULT specifies that the original internal 

# LoadLeveler scheduling algorithm be used. 

# 

SCHEDULER_TYPE = BACKFILL 

# 

# Specify accounting controls 

# To turn reservation data recording on, add the flag A_RES to ACCT 

# 

ACCT = A_OFF A_RES 

ACCT_VALIDATION = $(BIN)/llacctval 

GLOBAL_HISTORY = $(SPOOL) 

# 

# Specify checkpointing intervals 

# 

MIN_CKPT_INTERVAL = 900 

MAX_CKPT_INTERVAL = 7200 

# perform cleanup of checkpoint files once a day 

# 24 hrs x 60 min/hr x 60 sec/min = 86400 sec/day 


CKPT_CLEANUP_INTERVAL = 86400 

# sample source for the ckpt file cleanup program is shipped with 

LoadLeveler 

# and is found in: /usr/lpp/LoadL/full/samples/llckpt/rmckptfiles.c 

# 

# compile the source and indicate the location of the executable 

# as shown in the following example 

CKPT_CLEANUP_PROGRAM = /u/mylladmin/bin/rmckptfiles 

# LoadL_KeyboardD Macros 

# 

KBDD = $(BIN)/LoadL_kbdd 

KBDD_LOG = $(LOG)/KbdLog 

MAX_KBDD_LOG = 64000 

KBDD_DEBUG = 

# 

# Specify whether to start the keyboard daemon 

# 

#if HAS_X 

X_RUNS_HERE = True 

#else 

X_RUNS_HERE = False 

#endif 

# 

# LoadL_StartD Macros 

# 

STARTD = $(BIN)/LoadL_startd 

STARTD_LOG = $(LOG)/StartLog 

MAX_STARTD_LOG = 64000 

STARTD_DEBUG = 

POLLING_FREQUENCY = 5 

POLLS_PER_UPDATE = 24 

JOB_LIMIT_POLICY = 120 

JOB_ACCT_Q_POLICY = 300 

PROCESS_TRACKING = FALSE 

PROCESS_TRACKING_EXTENSION = $(BIN) 

#ifdef KbdDeviceName 

KBD_DEVICE = KbdDeviceName 

#endif 

#ifdef MouseDeviceName 


MOUSE_DEVICE = MouseDeviceName 

#endif 

# 

# LoadL_SchedD Macros 

# 

SCHEDD = $(BIN)/LoadL_schedd 

SCHEDD_LOG = $(LOG)/SchedLog 

MAX_SCHEDD_LOG = 64000 

SCHEDD_DEBUG = 

SCHEDD_INTERVAL = 120 

CLIENT_TIMEOUT = 30 

# 

# Negotiator Macros 

# 

NEGOTIATOR = $(BIN)/LoadL_negotiator 

NEGOTIATOR_DEBUG = D_NEGOTIATE D_FULLDEBUG 

NEGOTIATOR_LOG = $(LOG)/NegotiatorLog 

MAX_NEGOTIATOR_LOG = 64000 

NEGOTIATOR_INTERVAL = 60 

MACHINE_UPDATE_INTERVAL = 300 

NEGOTIATOR_PARALLEL_DEFER = 300 

NEGOTIATOR_PARALLEL_HOLD = 300 

NEGOTIATOR_REDRIVE_PENDING = 90 

NEGOTIATOR_RESCAN_QUEUE = 90 

NEGOTIATOR_REMOVE_COMPLETED = 0 

NEGOTIATOR_CYCLE_DELAY = 0 

NEGOTIATOR_CYCLE_TIME_LIMIT = 0 

# 

# Sets the interval between recalculation of the SYSPRIO values 

# for all the jobs in the queue 

# 

NEGOTIATOR_RECALCULATE_SYSPRIO_INTERVAL = 0 

# 

# GSmonitor Macros 

# 

GSMONITOR = $(BIN)/LoadL_GSmonitor 

GSMONITOR_DEBUG = 

GSMONITOR_LOG = $(LOG)/GSmonitorLog 

MAX_GSMONITOR_LOG = 64000 

# 

# Starter Macros 


# 

STARTER = $(BIN)/LoadL_starter 

STARTER_DEBUG = 

STARTER_LOG = $(LOG)/StarterLog 

MAX_STARTER_LOG = 64000 

# 

# LoadL_Master Macros 

# 

MASTER = $(BIN)/LoadL_master 

MASTER_LOG = $(LOG)/MasterLog 

MASTER_DEBUG = 

MAX_MASTER_LOG = 64000 

RESTARTS_PER_HOUR = 12 

PUBLISH_OBITUARIES = TRUE 

OBITUARY_LOG_LENGTH = 25 

# 

# Specify whether log files are truncated when opened 

# 

TRUNC_MASTER_LOG_ON_OPEN = False 

TRUNC_STARTD_LOG_ON_OPEN = False 

TRUNC_SCHEDD_LOG_ON_OPEN = False 

TRUNC_KBDD_LOG_ON_OPEN = False 

TRUNC_STARTER_LOG_ON_OPEN = False 

TRUNC_NEGOTIATOR_LOG_ON_OPEN = False 

TRUNC_GSMONITOR_LOG_ON_OPEN = False 

# 

# Machine control expressions and macros 

# 

OpSys : "$(OPSYS)" 

Arch : "$(ARCH)" 

Machine : "$(HOST).$(DOMAIN)" 

# 

# Expressions used to control starting and stopping of foreign 

jobs 

# 

MINUTE = 60 

HOUR = (60 * $(MINUTE)) 

StateTimer = (CurrentTime - EnteredCurrentState) 

BackgroundLoad = 0.7 

HighLoad = 1.5 

StartIdleTime = 15 * $(MINUTE) 

ContinueIdleTime = 5 * $(MINUTE) 


MaxSuspendTime = 10 * $(MINUTE) 

MaxVacateTime = 10 * $(MINUTE) 

KeyboardBusy = KeyboardIdle < $(POLLING_FREQUENCY) 

CPU_Idle = LoadAvg = $(HighLoad) 

# 

# See Using and Administering LoadLeveler for an explanation of these 

# control expressions 

# 

# START : $(CPU_Idle) && KeyboardIdle > $(StartIdleTime) 

# SUSPEND : $(CPU_Busy) || $(KeyboardBusy) 

# CONTINUE : $(CPU_Idle) && KeyboardIdle > $(ContinueIdleTime) 

# VACATE : $(StateTimer) > $(MaxSuspendTime) 

# KILL : $(StateTimer) > $(MaxVacateTime) 

START : T 

SUSPEND : F 

CONTINUE : T 

VACATE : F 

KILL : F 

# 

# The following (default) expression for SYSPRIO creates a FIFO 

job queue. 

# 

SYSPRIO: 0 - (QDate) 

#MACHPRIO: 0 - (1000 * (LoadAvg / (Cpus * Speed))) 

# 

# The following (default) expression for MACHPRIO orders 

# machines by load average. 

# 

MACHPRIO: 0 - (LoadAvg) 

# 

# The MAX_JOB_REJECT value determines how many times a job can be 

# rejected before it is canceled or put on hold. The default 

value 

# is 0, which indicates a rejected job will immediately be 

canceled 

# or placed on hold. MAX_JOB_REJECT may be set to unlimited 

rejects 

# by specifying a value of -1. 


# 

MAX_JOB_REJECT = 0 

# 

# When ACTION_ON_MAX_REJECT is HOLD, jobs will be put on user hold 

# when the number of rejects reaches the MAX_JOB_REJECT value. 

When 

# ACTION_ON_MAX_REJECT is CANCEL, jobs will be canceled when the 

# number of rejects reaches the MAX_JOB_REJECT value. The default 

# value is HOLD. 

# 

ACTION_ON_MAX_REJECT = HOLD 

# Filesystem Monitor Interval and Threshholds 

# Monitoring Interval is in minutes and should be set according to how 

# Fast the filesystem grows 

FS_INTERVAL = 30 

# File System space threshholds are specified in Bytes. Scaling 

factors 

# such as K, M and G are allowed. 

FS_NOTIFY = 750KB,1MB 

FS_SUSPEND = 500KB, 750KB 

FS_TERMINATE = 100MB,100MB 

# File System inode threshholds are specified in number of inodes. 

Scaling fact 

ors 

# such as K, M and G are allowed. 

INODE_NOTIFY = 1K,1.1K 

INODE_SUSPEND = 500,750 

INODE_TERMINATE = 50,50 


Appendix B. The sitefs file 

B 

This appendix includes the sitefs file that we used. We created it from the 

example sitefs file that is shown in the file: 

/bgl/BlueLight/ppcfloor/docs/ionode.README 

We added the lines that we needed and saved the file in the following directory 

on the Service Node so that it survived any upgrades to the Blue Gene/L driver: 

/bgl/dist/etc/rc.d/init.d 


The /bgl/dist/etc/rc.d/init.d/sitefs file 

In this appendix we provide/describe/discuss ... 

# US Government Users Restricted Rights - 

# Use, duplication or disclosure restricted 

# by GSA ADP Schedule Contract with IBM Corp. 

# 

# Licensed Materials-Property of IBM 

# ------------------------------------------------------------- 

# 

# ------------------------------------------------------------- 

# NOTE: The PATH environment variable is set to the following 

# upon entry to this script: 

# /bin.rd:/sbin.rd:/usr/bin:/bin:/usr/sbin:/sbin 

# The /bin.rd and /sbin.rd directories contain many of 

# the busybox commands and Blue Gene specific programs. 

# The /bin, /sbin, and /usr directories are symbolic 

# links to the NFS-mounted MCP. 

#-------------------------------------------------------------- 

# 

# 

# /etc/init.d/syslog 

# Default-Stop: 

# Description: Start the system logging daemons 

### END INIT INFO 

# Source config file, if it exists. 

test -f /etc/sysconfig/syslog || exit 6 

. /etc/sysconfig/syslog 

BINDIR=/sbin.rd 

case "$SYSLOG_DAEMON" in 

syslog-ng) 

syslog=syslog-ng 

config=/etc/syslog-ng/syslog-ng.conf 

params="$SYSLOG_NG_PARAMS" 

;; 

*) 

syslog=syslogd 

config=/etc/syslog.conf 

params="$SYSLOGD_PARAMS" 

# Add additional sockets to SYSLOGD_PARAMS 

# Extract the names of the variables beginning with 

SYSLOGD_ADDITIONAL_SOCKET 


SYSLOGD_ADDITIONAL_SOCKET_LIST=`grep SYSLOGD_ADDITIONAL_SOCKET 

/etc/sysconfig/syslog | sed s/=.*//` 

for variable in ${SYSLOGD_ADDITIONAL_SOCKET_LIST}; do 

eval value=\$$variable 

test -n "${value}" && test -d ${value%/*} && \ 

params="$params -a $value" 

done 

;; 

esac 

syslog_pid="/var/run/${syslog}.pid" 

# check config and programs 

test -x ${BINDIR}/$syslog || exit 5 

test -x ${BINDIR}/klogd || exit 5 

# If there is no config file in the ramdisk, create a simple one 

# that logs important messages to /dev/console 

if [ ! -e ${config} ]; then 

# Note: "*.warn" produces numerous boot messages that may affect 

boot performance. 

# Therefore, warnings (and higher log levels) are not logged, 

by default. 

echo "authpriv.none;*.emerg;*.alert;*.crit;*.err /dev/console" >> 

${config} 

fi 

# 

# Do not translate symbol addresses for 2.6 kernel 

# 

case `uname -r` in 

0.*|1.*|2.[0-4].*) 

#!/bin/sh 

# ------------------------------------------------------------- 

# Product(s): 

# 5733-BG1 

# 

# (C)Copyright IBM Corp. 2004, 2005 

# All rights reserved. 



# by GSA ADP Schedule Contract with IBM Corp. 

# 

# Licensed Materials-Property of IBM 

# ------------------------------------------------------------- 

Appendix B. The sitefs file 425

# ------------------------------------------------------------- 

# NOTE: The PATH environment variable is set to the following 

#!/bin/sh 

# ------------------------------------------------------------- 

# Product(s): 

# 5733-BG1 

# 

# (C)Copyright IBM Corp. 2004, 2005 

# All rights reserved. 



chmod +x /var/mmfs/etc/mmfsup.scr 

# Start GPFS and wait for it to come up 

rm -f $upfile 

/usr/lpp/mmfs/bin/mmautoload 

retries=300 

until test -e $upfile 

do sleep 2 

let retries=$retries-1 

if [ $retries -eq 0 ] 

then ras_advisory "$0: GPFS did not come up on I/O node 

$HOSTID" 

exit 1 

fi 

done 

#!/bin/sh 

# 


# 



directory 


directory 



# 



#------------------------------------------------------------------- 



# 




# to boot. 

# 





#------------------------------------------------------------------mountSiteFs() 

{ 

# Make the directory to be mounted over 

[ -d $3 ] || mkdir $3 

# Set to attempt the mount 5 times 

ATTEMPT_LIMIT=5 

ATTEMPT=1 


# Echo a message to the I/O node log for each failed attempt. 

until test $ATTEMPT -gt $ATTEMPT_LIMIT || mount $1:$2 $3 -o $4; do 

echo "Attempt number $ATTEMPT to mount $1:$2 on $3 failed" 

sleep $ATTEMPT 

let ATTEMPT=$ATTEMPT+1 

done 

# If all attempts failed, send a fatal RAS event so the block fails 

to 

# boot. If the mount worked, echo a message to the I/O node log. 

if test $ATTEMPT -gt $ATTEMPT_LIMIT; then 

echo "Mounting $1:$2 on $3 failed" > /proc/rasevent/kernel_fatal 

exit 

else 

echo "Attempt number $ATTEMPT to mount $1:$2 on $3 worked" 

fi 

} 

#------------------------------------------------------------------- 


# 


# 




#------------------------------------------------------------------- 


SITEFS=172.30.1.33 


rc_reset 


case "$1" in 

start) 



mountSiteFs $SITEFS /bglscratch /bglscratch 


# Mount a home file system... 

# mountSiteFs $SITEFS /home /home tcp,rsize=8192,wsize=8192 

# Arrange something for tmp... 

# - make a unique directory in the scratch file system. 

# - rename the original /tmp to /tmp.original. 

# - point /tmp to the unique directory. 

# tmpdir=/scratch/ionodetmp/$BGL_IP 

# [ -d $tmpdir ] || mkdir -p $tmpdir 

# mv /tmp /tmp.original 

# ln -s $tmpdir /tmp 

# Setup environment variables for ciod 

# echo "export CIOD_RDWR_BUFFER_SIZE=262144" >> 

/etc/sysconfig/ciod 

# echo "export DEBUG_SOCKET_STARTUP=ALL" >> 


klogd 

# Uncomment the following line to not start the NTP daemon 

# rm /etc/sysconfig/xntp 

# Uncomment the following line to not start the syslogd and 

# daemons 

# rm /etc/sysconfig/syslog 


for 


# Optionally uncomment the other lines to change the defaults 

# GPFS. 



# echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >> 

/etc/sysconfig/gpfs 

rc_status -v 

;; 

stop) 

echo Unmounting site filesystems 

# Put /tmp back 

# rm /tmp 

# mv /tmp.original /tmp 

# Unmount the scratch and home file systems 

umount -f /bglscratch 

# umount -f /home 

;; 

restart) 

;; 

status) 

;; 

esac 

rc_exit 

#------End of script--------- 



Appendix C. The ionode.README file 

C 

This file might well change between releases. You can find the current version of 

the ionode.README file that is compatible with the version of the code currently 

in use on your Blue Gene/L system at: 

/bgl/BlueLight/ppcfloor/docs/ionode.README 

This appendix includes the ionode.README file. This file contains useful 

information about the ionode bootup sequence. It is installed when the 

bglmcp*.rpm is installed. The version shown here is compatible with release 

V2R2M1. 


gl/BlueLight/ppcfloor/docs/ionode.README file 

Copyright Notice 

All Rights Reserved Legend 

US Government Users Restricted Rights Notice 

I/O Node Startup and Shutdown Scripts 

===================================== 

This file contains a summary of the startup and shutdown scripts 

executed 

on an I/O node. 

"Dist" directories 

------------------ 

There are three "distribution" directories involved during startup and 

shutdown. 

$BGL_DISTDIR 

The "dist" subdir located at the top of the Blue Gene driver install 

tree. 

This is referred to as the "system dist" or just "dist". This is 

normally 

located at "/bgl/BlueLight//ppc/dist". 

$BGL_SITEDISTDIR 

The "/bgl/dist" subdir. This is referred to as the "site dist", as it 

is 

intended for site-specific customization of the I/O node startup and 

shutdown. 

$BGL_OSDIR 

The Mini-Control Program (MCP) subdir. This contains many of the 

executables needed by programs that run in the I/O node. The I/O node 

/bin, /sbin, /lib, and /usr directories are links to directories within 

this subdir. This is normally located at "/bgl/OS/x.y", where "x.y" is 

the 

version and modification level of the MCP. 

All of these are treated as the root of a filesystem even though they 

are 

NOT explicitly exported and mounted. 


Contents of the ramdisk 

----------------------- 

The ramdisk contains a subset of files located in the system dist 

directories. These files remain in the system dist tree for the 

convenience of an administrator to read. 

/bgl NFS mount point of /bgl for bootstrap 

/bin/* shell commands implemented by busybox 

/sbin/* shell commands implemented by busybox 

/dev/* special device files 

/lib empty on startup (busybox is static linked) 

/proc mounted proc filesystem 

/proc/personality binary personality config (read by ciod) 

/proc/personality.sh shell personality config 

/tmp temp in ramdisk (very small) 

/etc/fstab minimal fstab 

/etc/group minimal group (note: ciod does not need group 

names 

defined) 

/etc/passwd minimal passwd (note: ciod does not need names 

defined) 

/etc/inittab inittab for starting startup and shutdown 

scripts 

and console shell 

/etc/protocols minimal protocols 

/etc/rpc minimal rpc 

/etc/services minimal services 

/etc/sysconfig/xntp minimal NTP options file 

/etc/ntp.conf minimal NTP configuration file 

/etc/sysconfig/syslog minimal syslog options file 

/etc/syslog.conf minimal syslog configuration file 

/etc/rc.dist defines $BGL_DISTDIR, $BGL_SITEDISTDIR, and 

$BGL_OSDIR 

/etc/rc.shutdown first stage shutdown script 

/etc/rc.d/rc.sysinit first stage sysinit script 

/etc/rc.d/rc.ras verbose, ras_advisory and ras_fatal functions 

/etc/rc.d/rc.network bring up network and default route 

/etc/rc.d/rc3.d dir of start (S*) and shutdown (K*) scripts 

The last few "rc" scripts are of interest in this document. 

Startup flow between rc scripts 

Appendix C. The ionode.README file 433

------------------------------- 

The I/O node startup begins with /sbin/init which reads /etc/inittab. 

The 

sysinit rule in inittab is coded to run /etc/rc.d/rc.sysinit. From 

here 

the flow is as follows. Note that BGL_SITEDISTDIR is normally 

/bgl/dist. 

/etc/rc.d/rc.sysinit 

mounts /proc 

includes /proc/personality.sh (e.g. $BGL_IP, $BGL_FS, etc) 

includes /etc/rc.d/rc.ras 

run /etc/rc.d/rc.network to bring up network and default route 

test for ethernet link status 

mount /bgl 

includes /etc/rc.dist to define $BGL_DISTDIR, $BGL_SITEDISTDIR, 

and 

$BGL_OSDIR 

run second stage startup from $BGL_DISTDIR/etc/rc.d/rc.sysinit2 

$BGL_DISTDIR/etc/rc.d/rc.sysinit2 

NOTE: this second stage startup exists only in NFS 

replace empty /lib with symlink to $BGL_OSDIR/lib 

replace empty /usr with symlink to $BGL_OSDIR/usr 

replace empty /etc/rc.d/rc3.d with symlink under $BGL_DISTDIR 

load the tree device driver 

run $BGL_DISTDIR/etc/rc.d/rc3.d/S* start scripts and 

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S* start scripts 

run $BGL_SITEDISTDIR/etc/rc.local 

Note that the start scripts run by rc.sysinit2 are selected from both 

the 

installation dist directory as well as the site dist directory and run 

in 

numeric order. Therefore the site dist directory can contain scripts 

that 

start before Blue Gene system software such as ciod. If a start script 

in 

the site dist directory has the same name as a start script in the 

installation dist directory, only the script in the installation dist 

directory is run. Start scripts having the same number are run 

alphabetically. 

Shutdown flow between rc scripts 


-------------------------------- 

When a block is freed, the shutdown rule in /etc/inittab is coded to 

run 

/etc/rc.shutdown. From here, the flow is as follows: 

/etc/rc.shutdown 

run $BGL_SITEDISTDIR/etc/rc.local.shutdown 

run $BGL_DISTDIR/etc/rc.d/rc3.d/K* shutdown scripts and 

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K* shutdown scripts 

in numeric order 

unmount /bgl and any remaining mounted file systems 

Shell variables during startup and shutdown 

------------------------------------------- 

The special file /proc/personality.sh is a shell script that sets shell 

variables to node specific values. These are used by the rc.* scripts 

and may also be useful within site-written start and shutdown scripts. 

These are not exported variables so each script will need to source 

this 

file (or export the variables to other scripts). See the example 

script 

below. 

The file /etc/rc.dist contains three additional variable definitions 

not 

found in /proc/personality.sh and may also be useful. These are marked 

with an (*) below. 

BGL_MAC Ethernet mac address of this I/O node 

BGL_IP IP address of this I/O node 

BGL_NETMASK Netmask to be used for this I/O node 

BGL_BROADCAST Broadcast address for this I/O node 

BGL_GATEWAY Gateway address for the default route for this I/O node 

BGL_MTU MTU for this I/O node 

BGL_FS Fileserver IP address for /bgl 

BGL_EXPORTDIR Fileserver export directory for /bgl 

BGL_LOCATION Location string for this I/O node 

BGL_PSETNUM PSet number for this I/O node (0..$BGL_NUMPSETS-1) 

BGL_NUMPSETS Total number of PSets in this block (i.e. # of I/O 

nodes) 

BGL_NODESINPSET Number of compute nodes served by this I/O node 

BGL_{X,Y,Z}SIZE Size of block in nodes 

BGL_VIRTUALNM 1 if the block is running in virtual node mode 


BGL_BLOCKID Name of block 

BGL_VERBOSE Defined if the block is created with the 

"io_node_verbose" 

option 

BGL_SNIP Functional network IP address of the service node 

BGL_MEMSIZE Bytes of RAM for this I/O node 

BGL_VERSION Blue Gene personality version 

BGL_DISTDIR(*) Path to top of dist directory in NFS 

BGL_SITEDISTDIR(*) Path to top of site dist directory in NFS 

(/bgl/dist) 

BGL_OSDIR(*) Path to I/O node MCP subdir (/bgl/OS/x.y) 

Environment Variables used by ciod 

---------------------------------- 

The following environment variables are used by ciod: 

CIOD_RDWR_BUFFER_SIZE=value 

This value specifies the size, in bytes, of each buffer used by ciod to 

issue read and write system calls. One buffer is allocated for each 

compute node CPU associated with this I/O node. The default size, if 

not 

specified, is 87600 bytes. A larger size may improve performance. If 

you 

are using a GPFS file server, it is recommended that this buffer size 

match 

the GPFS block size. Experiment with different sizes, such as 262144 

(256K) or 524288 (512K), to get the best performance. 

DEBUG_SOCKET_STARTUP={ ALL, } 

When specified, this variable causes ciod to start up sockets for 

debugging 

all blocks, or the specified block. 

These variables can be specified in the ciod start script, 

$BGL_DISTDIR/etc/rc.d/rc3.d/S50ciod. It is recommended that you create 

a /etc/sysconfig/ciod file in the ramdisk that specifies these 

variables. 

You can create this file in your 

$BGL_SITEDISTDIR/etc/rc.d/rc3.d/Sxxxxxx 

script that runs before ciod is started. Refer to "Typical Site 

Customization" in this README for an example. 

Support for Network Time Protocol (NTP) 


--------------------------------------- 

During I/O node startup, the $BGL_DISTDIR/etc/rc.d/rc3.d/S15xntpd 

script 

runs to set the time and date, and to start the NTP daemon (NTPD). 

There 

are three things you can do to configure this support: 

1. Set up your timezone file. The I/O node looks for the timezone file 

/etc/localtime. This is a symbolic link to 

$BGL_SITEDISTDIR/etc/localtime. Thus, to set up your timezone file, 

copy your service node's /etc/localtime file to 

$BGL_SITEDISTDIR/etc/localtime. If the timezone file is not found, 

the 

time is assumed to be UTC. 

2. Set up options used by the S15xntpd script. These options are 

located 

in file /etc/sysconfig/xntp. The default file contains two options: 

XNTPD_INITIAL_NTPDATE="AUTO-2" 

Specifies which NTP servers will be queried for the initial time 

and 

date. 

"AUTO" Query all of the NTP servers listed in the NTPD 

configuration file 

"AUTO-n" Query the first "n" NTP servers listed in the NTPD 

configuration file 

"" Don't perform the initial query at all 

"address1 address2 ..." Query the specified NTP servers 

The NTPD configuration file is /etc/ntp.conf. 

The default is "AUTO-2". 

XNTPD_OPTIONS="" 

Parameters for the /sbin/ntpd NTP daemon. The following 

parameters 

are supported: 

/sbin/ntpd [ -abdgmnqx ] [ -c config_file ] [ -e e_delay ] 

[ -f freq_file ] [ -k key_file ] [ -l log_file ] 

[ -p pid_file ] [ -r broad_delay ] [ -s statdir ] 

[ -t trust_key ] [ -v sys_var ] [ -V 

default_sysvar ] 

[ -P fixed_process_priority ] 

The default is no parameters. 


to 

by 

Normally, you should not have to change these options. If you wish 

change them, you need to do this in your site customization script 

(S10sitefs - refer to "Typical Site Customization" in this README) 

adding lines similar to the following: 

rm /etc/sysconfig/xntp 

echo "XNTPD_INITIAL_NTPDATE=\"AUTO-2\"" >> /etc/sysconfig/xntp 

echo "XNTPD_OPTIONS=\"-c /etc/ntp.conf\"" >> /etc/sysconfig/xntp 

3. Set up configuration information for the NTP daemon. This is stored 

in 

the /etc/ntp.conf file. The default file is created in the ramdisk 

by 

the S15xntpd script and contains the following: 

restrict default nomodify 

server $BGL_SNIP 

The default NTP server is the service node. Normally, you should 

not 

need to change these options. If you wish to supply your own file, 

you 

need to create it in your site customiztion script (S10sitefs - 

refer to 

"Typical Site Customization" in this README) by adding lines similar 

to 

the following: 

echo "restrict default nomodify" >> /etc/ntp.conf 

echo "server $BGL_SNIP" >> /etc/ntp.conf 

For more information on NTP, refer to http://www.ntp.org/ 

If you do not want NTP started on the I/O nodes, you can remove the 

/etc/sysconfig/xntp file from the ramdisk in your site customization 

script 

(S10sitefs - refer to "Typical Site Customization" in this README) by 

adding the following line to that script: 

rm /etc/sysconfig/xntp 

Support for Syslog 

------------------ 

During I/O node startup, the $BGL_DISTDIR/etc/rc.d/rc3.d/S45syslog 

script 

runs to start up the syslogd and klogd daemons. There are three things 

you 


can do to configure this support: 

1. Set up options used by the S45syslog script. These options are 

located 

in file /etc/sysconfig/syslog. The default file contains the 

following 

options: 

KERNEL_LOGLEVEL=1 

Specifies the default logging level for kernel messages (0-7). 

Kernel 

messages having a priority level equal to or more severe than 

(less 

than) this value are logged by syslog. The logging levels are as 

follows: 

KERN_EMERG 0 System is unusable 

KERN_ALERT 1 Action must be taken immediately 

KERN_CRIT 2 Critical conditions 

KERN_ERR 3 Error conditions 

KERN_WARNING 4 Warning conditions 

KERN_NOTICE 5 Normal but significant condition 

KERN_INFO 6 Informational 

KERN_DEBUG 7 Debug-level messages 

The default is "1". 

SYSLOGD_PARAMS="" 

Parameters for the /sbin/syslogd daemon. The following parameters 

are 

supported: 

/sbin/syslog [ -a socket ] [ -d ] [ -h ] 

[ -p socket ] [ -f config file ] [ -n ] 

[ -l hostlist ] [ -m interval ] [ -r ] 

[ -t ] [ -s domainlist ] [ -v ] 

The default is no parameters. However, the S45syslog script 

always 

specifies "-n" because it is required when starting syslogd from 

the 

init process. 

KLOGD_PARAMS="-2" 

Parameters for the /sbin/klogd daemon. The following parameters 

are 

supported: 

/sbin/klogd [ -c n ] [ -d ] [ -f fname ] [ -iI ] [ -n ] [ -o 

] 


] 

to 

by 


[ -k fname ] [ -v ] [ -x ] [ -2 ] [ -s ] [ -p 

The default is "-2" and "-c KERNEL_LOGLEVEL". 

SYSLOG_DAEMON="syslogd" 

The name of the syslog daemon. The default is "syslogd". 

Normally, you should not have to change these options. If you wish 

change them, you need to do this in your site customization script 



rm /etc/sysconfig/syslog 

echo "KERNEL_LOGLEVEL=1" >> /etc/sysconfig/syslog 

echo "SYSLOGD_PARAMS=\"\"" >> /etc/sysconfig/syslog 

echo "KLOGD_PARAMS=\"-2\"" >> /etc/sysconfig/syslog 

echo "SYSLOG_DAEMON=\"syslogd\"" >> /etc/sysconfig/syslog 

2. Set up configuration information for the syslog daemon. This is 

stored 

in the /etc/syslog.conf file. The default file is created in the 

ramdisk by the S45syslog script and contains the following: 

authpriv.none;*.emerg;*.alert;*.crit;*.err; /dev/console 

This default file logs messages having priority level 0, 1, 2, or 3 

to 

the I/O node console (the I/O node log file). If you wish to supply 

your own file, you need to create it in your site customization 

script 


by 


echo "authpriv.none;*.emerg;*.alert;*.crit;*.err; /dev/console" \ 

>> /etc/syslog.conf 

3. If you want to send I/O node messages to a remote syslog server 

(using 

@server-name in the /etc/syslog.conf file), consider the following: 

a. When starting syslogd on the remote server, specify "-r -m 0" to 

enable it. 

b. If the remote server is logging to files, be aware that the files 

can get large. Consider using a utility to manage these files, 

such 

as logrotate.

For more information, refer to the man pages for syslog, syslogd, 

klogd, 

and logrotate. 

If you do not want syslog started on the I/O nodes, you can remove the 

/etc/sysconfig/syslog file from the ramdisk in your site customization 

script (S10sitefs - refer to "Typical Site Customization" in this 

README) 

by adding the following line to that script: 

rm /etc/sysconfig/syslog 

Support for GPFS 

---------------- 

During I/O node startup, the GPFS client may optionally be started. 

Assuming the appropriate GPFS setup has been done, the following 

describes 

the steps for configuring the I/O node scripts to start GPFS and 

explains 

the flow of these scripts. 

In your site customization script (S10sitefs - refer to "Typical Site 

Customization" in this README), add the following line: 


This creates the /etc/sysconfig/gpfs file in the I/O node ramdisk and 

specifies that you want the GPFS client to be started. 

You may specify the following additional options in the same manner 

(the 

default values are shown here, so you only need to specify these if you 

want different values): 

echo "GPFS_VAR_DIR=/bgl/gpfsvar" >> /etc/sysconfig/gpfs 

echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >> /etc/sysconfig/gpfs 

GPFS_VAR_DIR specifies the pathname to the "var" directory to be used 

by 

the GPFS client. A per-I/O node directory is created under 

GPFS_VAR_DIR 

for storing log files and configuration information. Note that if the 

GPFS_VAR_DIR resides in an NFS exported file system, that export should 

specify "no_root_squash" so that the I/O node root user can write to 

the 


GPFS_VAR_DIR directory. 

GPFS_CONFIG_SERVER specifies the host name or IP address of the primary 

GPFS cluster configuration server node. The GPFS configuration file 

(/var/mmfs/gen/mmsdrfs) for an I/O node is retrieved from this node, if 

necessary. 

When you specify to start the GPFS client, as described above, the 

following occurs during I/O node startup: 

1) The SSH daemon is started by the $BGL_DISTDIR/etc/rc.d/rc3.d/S16sshd 

script. SSH is used by GPFS to communicate among the I/O nodes and 

the 

service node. This script will also use the 

$BGL_SITEDISTDIR/etc/hosts 

file to set the hostname of this I/O node. 

2) The GPFS client is started by the 

$BGL_DISTDIR/etc/rc.d/rc3.d/S40gpfs 

script. 

Typical Site Customization 

-------------------------- 

A Blue Gene site will normally have a script that customizes the 

startup 

and shutdown. This script could perform the following functions: 

1. Mount (during startup) and unmount (during shutdown) high 

performance 

file servers for application use. To ensure the file servers are 

available 

to applications, 

- the mount must occur after portmap is started by the S05nfs script 

and 

before ciod is started by the S50ciod script. 

- if using S45syslog to start syslog that logs to files in a mounted 

file 

system, the mount must occur before syslog is started. 

- the unmount must occur after ciod is ended by the K50ciod script and 

before portmap is ended by the K95nfs script. 

Recommendations for NFS mounts: 

a. The mount should be retried several times to accomodate a busy file 

server. 


. Mount options: 

tcp - This provides automatic flow control when it detects that 

packets 

are being dropped due to network congestion or server 

overload. 

rsize,wsize - These specify read and write NFS buffer sizes. 8192 

is 

the minimum recommended size, and 32768 is the maximum 

recommended size. If your server is not built with a 

kernel compiled with a 32768 size, it will negotiate 

down 

to what it can support. In general, the larger the 

size, 

the better the performance. However, depending on the 

capacity of your network and server, 32768 may be too 

large and cause excessive slowdowns or hangs during 

times 

of heavy I/O. This is something that each site needs 

to 

tune. 

async - Specifying this option may improve write performance, 

although 

there is greater risk for losing data if the file server 

crashes 

and is unable to get the data written to disk. 

2. Point /tmp to a larger file system. The default /tmp is in a small 

ramdisk. Your applications may require a larger /tmp. 

3. Set ciod environment variables. Refer to "Environment Variables 

used 

by ciod" in this README for details. 

4. Set NTP parameters. Refer to "Support for Network Time Protocol 

(NTP)" 

in this README for details. 

5. Set syslog parameters. Refer to "Support for Syslog" in this README 

for 

details. 

6. Set GPFS parameters. Refer to "Support for GPFS" in thie README for 

details. 

Your site customization script should be in the 


$BGL_SITEDISTDIR/etc/rc.d/rc3.d directory. For example, call it 

"sitefs". 

Then, in order to properly place it in the startup and shutdown 

sequence, 

symbolic links should be created to this script as follows: 


$BGL_SITEDISTDIR/etc/rc.d/rc3.d/S10sitefs 


$BGL_SITEDISTDIR/etc/rc.d/rc3.d/K90sitefs 

The S10sitefs and K90sitefs links are sequenced such that they meet the 

mount requirements described above. 

During startup, S10sitefs will be called with the "start" parameter. 

During shutdown, K90sitefs will be called with the "stop" parameter. 

The following is an example of such a site customization script: 

#!/bin/sh 

# 


# 



directory 


directory 



# 



#------------------------------------------------------------------- 


# 




# to boot. 

# 






#------------------------------------------------------------------mountSiteFs() 

{ 

# Make the directory to be mounted over 

[ -d $3 ] || mkdir $3 

# Set to attempt the mount 5 times 

ATTEMPT_LIMIT=5 

ATTEMPT=1 


# Echo a message to the I/O node log for each failed attempt. 

until test $ATTEMPT -gt $ATTEMPT_LIMIT || mount $1:$2 $3 -o $4; do 

echo "Attempt number $ATTEMPT to mount $1:$2 on $3 failed" 

sleep $ATTEMPT 

let ATTEMPT=$ATTEMPT+1 

done 

# If all attempts failed, send a fatal RAS event so the block fails 

to 

# boot. If the mount worked, echo a message to the I/O node log. 

if test $ATTEMPT -gt $ATTEMPT_LIMIT; then 

echo "Mounting $1:$2 on $3 failed" > /proc/rasevent/kernel_fatal 

exit 

else 

echo "Attempt number $ATTEMPT to mount $1:$2 on $3 worked" 

fi 

} 

#------------------------------------------------------------------- 


# 


# 



#------------------------------------------------------------------- 


SITEFS=172.32.1.1 


rc_reset 



case "$1" in 

start) 



mountSiteFs $SITEFS /scratch /scratch 


# Mount a home file system... 

mountSiteFs $SITEFS /home /home tcp,rsize=8192,wsize=8192 

# Arrange something for tmp... 

# - make a unique directory in the scratch file system. 

# - rename the original /tmp to /tmp.original. 

# - point /tmp to the unique directory. 

tmpdir=/scratch/ionodetmp/$BGL_IP 

[ -d $tmpdir ] || mkdir -p $tmpdir 

mv /tmp /tmp.original 

ln -s $tmpdir /tmp 

# Setup environment variables for ciod 

echo "export CIOD_RDWR_BUFFER_SIZE=262144" >> 


echo "export DEBUG_SOCKET_STARTUP=ALL" >> 


klogd 

for 

# Uncomment the following line to not start the NTP daemon 

# rm /etc/sysconfig/xntp 

# Uncomment the following line to not start the syslogd and 

# daemons 

# rm /etc/sysconfig/syslog 


# Optionally uncomment the other lines to change the defaults 

# GPFS. 

# echo "GPFS_STARTUP=1" >> /etc/sysconfig/gpfs 


# echo "GPFS_CONFIG_SERVER=\"$BGL_SNIP\"" >> 

/etc/sysconfig/gpfs 


c_status -v 

;; 

stop) 

echo Unmounting site filesystems 

# Put /tmp back 

rm /tmp 

mv /tmp.original /tmp 

# Unmount the scratch and home file systems 

umount -f /scratch 

umount -f /home 

;; 

restart) 

;; 

status) 

;; 

esac 

rc_exit 

#------End of script--------- 

A more advanced script could select a fileserver based on the I/O node 

location string (e.g. R01-M0-NE-I:J18-U01). Note that BGL_LOCATION 

comes 

from sourcing /proc/personality.sh. 

case "$BGL_LOCATION" in 

R00-*) SITEFS=172.32.1.1;; # rack 0 NFS server 






esac 

The script could use the $BGL_IP (ip addr of the I/O node) to compute 

the 

fileserver(s). It also can use $BGL_BLOCKID so that different blocks 

can 

mount special fileservers. 


Support for Subnets 

------------------- 

Each midplane or group of midplanes can belong to a different subnet. 

To 

take advantage of this, there are two things that need to be done: 

1. Specify IP addresses for specific I/O nodes. This is done in the 

BglIpPool database table. 

The I/O nodes in the midplane(s) belonging to a particular subnet 

must 

have IP addresses belonging to that subnet. When the BglIpPool 

table is 

populated, the I/O node locations must be specified along with the 

IP 

addresses as in the following example: 

db2 "INSERT INTO BglIpPool 

(location,machineSerialNumber,ipAddress) 

VALUES('R01-M0-N4-I:J18-U01','BGL', '172.30.100.244') " 

If the BglIpPool table already exists, and subnet support is being 

added, the following steps must be done: 

a. The BglIpPool table must be deleted and recreated with the 

correct 

location-ipAddress pairings, as described above. 

b. The IpAddress field within the BglNode table must be set to NULL: 

db2 "update bglnode set IpAddress = NULL" 

c. PostDiscovery must be re-run. 

2. Specify subnet information for each midplane in the 

BglMidplaneSubnet 

table, as in the following example: 

db2 "INSERT INTO \ 

BglMidplaneSubnet 

(posInMachine,ipAddress,broadcast,nfsIpAddress) \ 

VALUES('R000','172.27.138.254','172.27.138.255','172.27.96.117') " 

FIELD DESCRIPTION IONODE 

SCRIPT 

VARIABLE 


posInMachine The midplane specification 

ipAddress The IP address of the gateway for the subnet 

BGL_GATEWAY 

broadcast The broadcast address for the subnet 

BGL_BROADCAST 

nfsIpAddress The IP address of the /bgl file server for BGL_FS 

this midplane 

The ipAddress and broadcast fields are related to subnet support. 

The 

nfsIpAddress field enables multiple file servers to host /bgl, for 

improved performance. 

Extracting the ramdisk from ramdisk.elf 

--------------------------------------- 

For those who may wish to examine the contents of the I/O node's 

ramdisk, 

here are instructions for extracting a ramdisk.img.gz from the 

ramdisk.elf 

that is shipped with the Blue Gene driver, and mounting the ramdisk. 

The ramdisk.img.gz is stored in the .data section of a standard ELF 

image. It is stored as raw data with an 8-byte header that is needed 

at 

runtime, which needs to be stripped off with the dd command. 

cd 

objcopy --only-section .data --output-target binary \ 

/bgl/BlueLight//ppc/bglsys/bin/ramdisk.elf image.tmp 

dd if=image.tmp of=ramdisk.img.gz bs=8 skip=1 

rm -f image.tmp 

The ramdisk image has been extracted from ramdisk.elf. 

Uncompress it. 

gunzip ramdisk.img.gz 

If gunzip works, skip this step and go to the "loopback mount" step. 

If gunzip fails with "unexpected end of file", truncate 1 byte from 

ramdisk.img.gz using the following commands, where xxxxxx is the size 

of ramdisk.img.gz minus 1 (as calculated from "ls -l ramdisk.img.gz" 

output): 

dd if=ramdisk.img.gz of=ramdisk.img2.gz count=1 bs=xxxxxx 


mv ramdisk.img2.gz ramdisk.img.gz 

Loopback mount ramdisk.img on directory "r". 

mkdir r 

mount -o loop ramdisk.img r 

Run commands to examine the ramdisk under directory "r". 

For example: 

find r -ls 

bglsn:/bgl/BlueLight/ppcfloor/docs # 


Abbreviations and acronyms 

BPM Bulk Power Module 

AIX Advanced Interactive Executive 

API Application Programming Interface 

ARP Address Resolution Protocol 

ASIC Application Specific Integrated 

Circuit 

BGL Blue Gene Light 

BIST Built-In Self Test 

BLRTS Blue Gene Run Time System 

BPE Bulk Power Enclosure 

CDMF Commercial Data masking Facility 

CLI Command Line Interface 

CN Compute Node 

CNK Compute Node Kernel 

CPU Central Processing Unit 

CSM Cluster Systems Management 

CWD Current Working Directory 

DES Data Encryption System 

DNS Domain Name System 

EOF End Of File 

FEN front-end Node 

FIFO First-In-First-Out 

FQDN Fully Qualified Domain Name 

GPFS General Parallel File System 

GPL GNU Public License 

GSA Global Storage Architecture 

GUI Graphical User Interface 

HPC High Performance computing 

HTML Hyper Text Markup Language 

IBM International Business Machines 

Corporation 

IDEA International Data Encryption 

Algorithm 

IEEE Institute of Electrical and Electronic 

Engineering 

IO Input Output 

IP Internet Protocol 

ICMP Internet Control Message Protocol 

ITSO International Technical Support 

Organization 

JTAG Joint Technical Advisory Group 

LED Light Emitting Diode 

LPAR Logical Partition 

LUN Logical Unit Number 

MCP Mini-Control Program 

MMCS Midplane Management Control 

System 

MPI Message Passing Interface 

MSB Most Significant Bit 

MTU Maximum Transmission Unit 

NFS Network File System 

NSD Network Shared Disk 

NTP Network Time Protocol 

NUMA Non-Uniform Memory Access 

PGP Pretty Good Privacy 

PKI Public Key Infrastructure 

PPC Power PC® 

RAM Random Access Memory 

RAS Reliability Availability Serviceability 

RPC Remote Procedure Call 

RPM RedHat Package Manager 

RSA Rivest Shamir and Adelman 

RSCT Reliable Scalable Clustering 

Technology 


RSH Remote Shell 

SHA Secure Hash Algorithm 

SLES SUSE Linux Enterprise Server 

SMP Symmetric Multi-Processing 

SN Service Node 

SP System Parallel 

SQL Structured Query Language 

SRAM Static Random Access Memory 

SSH Secure Shell 

SSL Secure Socket Layer 

TCP Transmission Control Protocol 

TCP/IP Transmission Control Protocol / 

Internet Protocol 

TWS Tivoli Workload Scheduler 

UDP Universal Datagram Protocol 

UID User ID 

URL Universal Resource Locator 

VLSI Very Large Scale Integration 

XML Extended Markup Language 


Related publications 

IBM Redbooks 

The publications that we list in this section are considered particularly suitable for 

a more detailed discussion of the topics that we cover in this redbook. 

For information about ordering these publications, see “How to get IBM 

Redbooks” on page 454. Note that some of the documents referenced here 

might be available in softcopy only. 

► Unfolding the IBM eServer Blue Gene Solution, SG24-6686 

► Blue Gene/L: System Administration, SG24-7178 

Other publications 

These publications are also relevant as further information sources: 

► IBM General Parallel File System Concepts, Planning, and Installation Guide, 

GA22-7968-02 

► IBM General Parallel File System Administration and Programming 

Reference, SA22-7967-02 

► IBM General Parallel File System Problem Determination Guide, 

GA22-7969-02 

► IBM LoadLeveler Using and Administering Guide, SA22-7881 

► CSM for AIX 5L and Linux V1.5 Planning and Installation Guide, 

SA23-1344-01 

► CSM for AIX 5L and Linux V1.5 Administration Guide, SA23-1343-01 

► CSM for AIX 5L and Linux V1.5 Command and Technical Reference, 

SA23-1345-01 

► RSCT Administration Guide, SA22-7889-10 

► RSCT for Linux Technical Reference, SA22-7893-10 


Online resources 

These Web sites and URLs are also relevant as further information sources: 

► GPFS Concepts, Planning, and Installation Guide, GA22-7968-02 


ter.gpfs.doc/gpfs23/bl1ins10/bl1ins10.html 

► GPFS Administration and Programming Reference, SA22-7967-02 


ter.gpfs.doc/gpfs23/bl1adm10/bl1adm10.html 

► GPFS Problem Determination Guide, GA22-7969-02 


ter.gpfs.doc/gpfs23/bl1pdg10/bl1pdg10.html 

► GPFS Documentation updates 


ter.gpfs.doc/gpfs23_doc_updates/docerrata.html 

► Cluster Systems Management documentation and updates 


How to get IBM Redbooks 

Help from IBM 

You can search for, view, or download Redbooks, Redpapers, Hints and Tips, 

draft publications and Additional materials, as well as order hardcopy Redbooks 

or CD-ROMs, at this Web site: 


IBM Support and downloads 

ibm.com/support 

IBM Global Services 

ibm.com/services 


Index 

Symbols 

/O node 218 

~/.rhosts 153–154 

A 

accountability 396 

allocate_block 162 

alternate Central Manager 171, 203 

Apache 120 

authentication 396, 398 

authorization 396 

authorized_keys 303, 402 

B 

back-end mpirun 150, 184 

Barrier 189 

Barrier and Global interrupt 21 

BG_ALLOW_LL_JOBS_ONLY 173, 183 

BG_CACHE_PARTIONS 173 

BG_CACHE_PARTITIONS 173 

BG_ENABLED 173 

BG_MIN_PARTITION_SIZE 173 

bgIO cluster 244, 251, 263 

BGL_DISTDIR 213 

BGL_EXPORTDIR 213 

BGL_OSDIR 213 

BGL_SITEDISTDIR 213 

BGLBASEPARTITION 67 

BGLBLOCK 36, 92 

BGLEVENTLOG 36 

BGLJOB 41 

bglmaster 31, 43, 62, 89, 115, 281, 283 

BGLMIDPLANE 68 

BGLNODECARDCOUNT 77 

BGLPROCESSORCARDCOUNT 77 

BGLSERVICEACTION 49 

BGLSERVICECARDCOUNT 70 

bglsysdb 287 

bgmksensor 390, 394 

BGWEB 57–59, 68, 71, 77, 80, 86, 108 

blc_powermodules 132 

blc_temperatures 132 

blc_voltages 132 

bll_lbist 133 

bll_lbist_linkreset 133 

bll_lbist_pgood 133 

bll_powermodules 134 

bll_temperatures 134 

bll_voltages 134 

block 93, 158, 162–163, 218 

Block Information 162 

Block information 123–124 

Block inititialization 106 

Block monitoring 100 

blrts 7, 145 

blrts tool chain 145, 148, 322 

Blue Gene Light Runtime System 7 

Blue Gene/L 15, 199 

Blue Gene/L driver 339, 342 

Blue Gene/L libraries 173 

Blue Gene/L processes 56 

Blue Gene/L, core 56, 82–83 

Boot process 36 

booting a block 138 

BPE 15 

BPM 15, 451 

bridge API 170, 179–180 

bridge.config 176 

BRIDGE_CONFIG_FILE 155, 167, 175, 349, 351 

bs_trash_0 132 

bs_trash_1 132 

Bulk power enclosure 15 

Bulk power module 15 

Bulk power modules 114 

BusyBox 33 

C 

CableDiscovery 44, 46 

CDMF 397 

Central Manager 168–170, 178, 203, 375–376 

chkconfig 152 

ciod 34, 39, 162, 212, 274 

ciodb 31, 34, 41, 89, 136, 274, 280 

cipherList 259 

clock card 15, 68 


clock signal 11 

cluster 168 

Cluster Systems Management. See CSM. 

Cluster Wide File System 164 

CNK 7, 35, 38, 145, 212 

coaxial port 15 

collective communication 142–143, 189 

collective communication and computation 142 

collective network 21, 26, 32, 41 

compilers 141, 144 

compilers, IBM 145 

compute card 2, 7–8, 80, 127, 268 

Compute node kernel 7, 34, 212 

COMPUTENODES 81 

condition 386, 388 

control system server logs 99 

co-processor 7 

cryptographic techniques 395 

CSM 384 

CSM client 385 

CSM cluster 385 

D 

data encryption 396 

data signing 396 

database 29, 60 

database browser 119, 130 

database view 70 

db.properties 176, 287 

DB_PROPERTY 156, 167, 175, 349, 352 

DB2 29, 58, 87 

DB2 commands 161 

DB2 database 69, 119, 384 

DB2 statements 111 

DB2 stored procedure 391 

DB2 trigger 391 

db2cshrc 330 

db2iauto 88 

db2profile 156, 167, 330, 349 

db2set 88 

DES 397 

dgemm160 132 

dgemm160e 132 

dgemm3200 132 

dgemm3200e 132 

diagnostic data 31 

diagnostic tests 31 

diagnostics 106, 113, 119, 128, 131–132, 135 


digital signature 398 

discovery 12, 30, 43, 46, 50, 67 

discovery process 104 

dr_bitfail 132 

driver update 320 

dumpconv 235 

E 

emac_dg 132 

EndServiceAction 46, 49, 53, 122, 294 

environment data 30 

environmental information 114, 118, 126 

ethtool 84, 290 

event 386 

F 

fan card 13 

fan unit 14 

fans 114 

FEN (Front-End Node) 24, 32, 61, 103, 144 

File Server 217 

file systems 32, 64 

free_block 267 

front-end mpirun 150, 183 

Functional Ethernet 21 

functional network 24, 29, 56, 60, 267, 269 

G 

gcc compiler 148 

General Parallel File System. See GPFS. 

gensmallblock 95 

gi_single_chip 132 

gidcm 132 

global barrier and interrupt network 27–28 

GPFS 24, 56, 65, 211–212, 226, 230, 234, 244, 

266, 294, 303, 329, 346 

GPFS access patterns 226 

GPFS cross-cluster authentication 246–247 

GPFS Portability Layer 230, 321 

GPFS_STARTUP 229 

H 

hardware browser 121 

hardware monitor 113–114 

hardware problems 106 

Hash function 396 

hosts.equiv 153–154

I 

I/O card 77, 135 

I/O node 7–8, 24, 27, 33, 39, 62, 77, 95, 145, 149, 

168, 212, 237–238, 318 

I/O node boot sequence 212–213 

I²C 21–22, 24 

IBM compilers 145 

IBM XL compilers 145 

ibmcmp 214 

IDEA 398 

IDo 9, 22, 28, 36 

IDo chip 28, 30–31, 43 

IDo link 270 

idoproxy 36, 43, 89, 280–281, 283, 289 

idoproxydb 31, 136 

ifconfig 290 

Initializing the partition 182 

installation dist 214 

intake plenum 13 

Inter-Integrated Circuit 24 

IRecv 142 

J 

job command file 317 

job cycle 176 

job ID 193, 197–198 

job information 123, 125, 162 

job monitoring 100 

job runtime information 165 

job setup 157 

job state 363 

job states 163 

job status 96, 161–162 

job submission 36, 66, 141, 149, 158, 178, 349 

job tracking 158 

job_type 199 

JOBNAME 198 

jobstepid 193 

journaling file system 226 

JTAG 21–22 

K 

known_hosts 303, 402 

L 

ldconfig 173 

ldd 173 

link card 17, 22, 114 

link card chips 71 

linpack 133 

linuximg 33 

list_jobs 139 

llcancel 193 

llhold 366 

llinit 204 

llq 178, 193, 197, 203, 365 

llstatus 61, 172, 186, 188–189, 192, 203, 371 

llsubmit 36, 66, 178, 193, 198–199 

LoadL_admin 171, 203, 206 

LoadL_config 172, 178, 204, 207 

LoadLeveler 56, 66, 109, 124, 141, 162, 167–168, 

170, 205, 295, 358 

daemons 202 

job classes 205 

job command file 198 

job state 177 

job submission 178 

job sumission 179 

queue 182 

run queue 193 

LOCAL_CONFIG 205 

location 47 

location codes 3 

LocationString 46 

lscondition 386 

lsresponse 387 

lssensor 389 

lxtrace 235 

M 

massive parallel system 15 

Master 375 

MCP 7, 33, 35, 38, 212–213 

mem_l2_coherency 133 

Message Passing Interface 141–142 

methodology 2, 56, 106 

microloader 35, 38 

midplane 6, 8, 17, 22, 68–70, 75, 158, 189 

midplane information 123 

Midplane Management Control System 30, 149 

mini-control program 212 

mkcondition 390 

mkcondresp 386 

mkresponse 390 

mksensor 390 

Index 457

mmauth 247–248, 261 

mmchconfig 247, 308 

mmchdisk 263 

MMCS 30, 141, 162 

MMCS console 113, 136 

mmcs_console 66, 96 

mmcs_db_console 91, 96, 136, 163, 220, 242, 252, 

267, 282–283, 295 

mmcs_db_server 31, 36, 39, 136, 282 

mmcs_server 89, 283 

MMCS_SERVER_IP 155, 166, 175, 349 

mmdsh 308 

mmfslinux 235 

mmgetstate 255, 298, 347–348 

mmlscluster 259 

mmremotefs 249–250, 257 

mmstartup 245 

monitoring tools 106 

MPI 56, 141, 147 

MPI library 142, 146 

MPI_Barrier 142–143 

MPI_Comm_rank 144 

MPI_Comm_size 144 

MPI_COMM_WORLD 142, 144 

MPI_Finalize 144 

MPI_Init 144 

MPI_IRecv 142 

MPI_Isend 142 

MPI_Recv 142 

MPI_Reduce 142 

MPI_Scatter 142 

MPI_Send 142 

mpicc 147 

mpicxx 147 

mpif77 147 

mpirun 36, 41, 66, 141, 149–150, 155, 158, 160, 

162–165, 170–171, 177, 183, 187, 200–201, 266, 

343, 347, 350 

only_test_protocol 356 

-verbose 370 

mpirun checklist 166 

ms_gen_short 133 

N 

naming convention 19 

Negotiator 168, 170, 189, 375 

NegotiatorLog 208 

netstat 208 


network switches 103 

network tuning parameters 223 

NFS 65, 90, 211–212, 215, 225, 266, 294, 296, 329 

NFS servers 215 

nfsd 219 

node card 8, 30, 48, 76, 104, 114 

node definition file 251 

node startup 212 

no-pty 155 

NSD 231 

NUMNODECARDS 77 

NUMSERVICECARDS 70 

O 

operational logs 113 

P 

pagepool 226, 244, 259, 261, 294, 297 

parallel 199 

parallel programming environment 141–142 

partition 180, 358 

partitions 158, 180, 197 

point-to-point 142 

point-to-point communication 142 

port mapper 294 

portmap 219 

PostDiscovery 43, 46, 54 

postdiscovery 50 

power_module_stress 133 

PowerPC 144 

PowerPC 440 7, 148 

PPC64 144 

ppcfloor 335 

predefined partitions 160 

PrepareForService 46–47, 50, 122, 289, 292, 294 

printf () 144 

problem definition 56 

problem determination methodology 55, 104, 254 

problem monitor 123 

processor card 80, 82 

Public Key Infrastructure (PKI) 396, 398 

Q 

quorum 244 

R 

rack 3, 104, 131, 158, 267

acks 66 

ramdisk 38 

RAS 30 

RAS data 30 

RAS event 36, 38, 108, 115, 118–119, 126–127, 

268 

Redbooks Web site 454 

Contact us xv 

Reliable Scalable Cluster Technology 385 

remote command execution 266, 399 

remote execution environment 151 

remote file copy 399 

remote shell 100, 151 

response 386, 388 

rpc.mountd 219 

RPM 147 

RSCT 385 

rsh 100 

rshd 153 

rundiag 135 

runtime 123 

S 

schedd 171, 185, 187 

scheduler 168 

scp 401 

secure shell 102, 151, 154, 383, 399 

secure socket layer 400 

sensor 388 

serial 199 

server logs 61 

service action 46, 121–122 

service card 11, 69, 114 

Service Network 267, 269–270, 273 

Service network 21 

Service Netwotk 29 

Service Node 22, 29, 32, 34, 44, 56, 59, 83, 103, 

135, 144, 155, 232, 237–238, 267, 273 

Service Node database 58 

serviceactionid 53 

sftp 401 

showmount 91, 218–219, 296 

site configuration 158 

site dist 214 

sitefs 217, 221, 259 

slogin 401 

SRAM 38 

ssh 401 

sshd 103, 155, 214, 334 

SSL 400 

Start daemon 168 

startcondresp 386, 389, 394 

startd 171–172, 182 

starter 172 

Starter process 168 

StarterLog 368 

status LED 12 

stderr 25 

stdin 25 

stdout 25 

submit_job 36, 96, 141, 149, 267 

sumbit_job 66 

Symmetric key 396 

sysctl.conf 223 

system processes 267 

systemcontroller 43, 46, 50 

T 

TBGLIPPOOL 60 

TBGLJOB 111 

TBGLJOB_HISTORY 111 

TBLSERVICECARD 69 

ti_abist 133 

ti_edramabist 133 

ti_lbist 133 

Torus 17, 21, 26, 189 

tr_connectivity 133 

tr_loopback 133 

tr_multinode 133 

tracedev 235 

Triple DES 397 

ts_connectivity 133 

ts_loopback_0 133 

ts_loopback_1 133 

ts_multinode 133 

V 

virtual node 7 

W 

Web interface 113 

write_con 137, 221, 242 

X 

xinetd 152 

Index 459

XL compilers 148 

XLC/XLF 148 

XLC/XLF compilers 148 

xlmass 149 

xlsmp 149 


(1.0” spine) 

0.875”1.498” 

460 788 pages 

IBM System Blue Gene Solution: 

Problem Determination Guide

IBM System Blue Gene 

Solution 

Problem Determination Guide 

Learn detailed 

procedures through 

illustrations 

Use sample 

scenarios as helpful 

resources 

Discover GPFS 

installation hints and 

tips 

SG24-7211-00 ISBN 0738496766 

Back cover 

This IBM Redbook is intended as a problem determination 

guide for system administrators in a High Performance 

Computing environment. It can help you find a solution to 

issues that you encounter on your IBM eServer Blue Gene 

system. 

This redbook includes a problem determination methodology 

along with the problem determination tools that are available 

with the basic IBM eServer Blue Gene Solution. It also 

discusses additional software components that are required 

for integrating your Blue Gene system in a complex computing 

environment. 

This redbook also describes a GPFS installation procedure 

that we used in our test environment and several scenarios 

that describe possible issues and their resolution that we 

developed following the proposed problem determination 

methodology. 

Finally, this redbook includes a short introduction about to 

how to integrate your Blue Gene system in a High 

Performance Computing environment managed by IBM 

Cluster Systems Management as well as an introduction to 

how you can use secure shell in such an environment. 

INTERNATIONAL 

TECHNICAL 

SUPPORT 

ORGANIZATION 

® 

BUILDING TECHNICAL 

INFORMATION BASED ON 

PRACTICAL EXPERIENCE 

IBM Redbooks are developed by 

the IBM International Technical 

Support Organization. Experts 

from IBM, Customers and 

Partners from around the world 

create timely technical 

information based on realistic 

scenarios. Specific 

recommendations are provided 

to help you implement IT 

solutions more effectively in 

your environment. 

For more information: 

ibm.com/redbooks

Problem Determination Guide - Systems Group

Create successful ePaper yourself

Delete template?

Save as template?